Last updated: April 19, 2026
Application No. 17/969,591
REINFORCEMENT LEARNING-BASED ENHANCED DISTRIBUTED CHANNEL ACCESS

Final Rejection §103
Filed
Oct 19, 2022
Examiner
VANWORMER, SKYLAR K
Art Unit
2146
Tech Center
2100 — Computer Architecture & Software
Assignee
Qualcomm Incorporated
OA Round
2 (Final)
This examiner grants 39% of cases after interview

— +22.5% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 28 resolved cases, 2023–2026
Examiner Intelligence

VANWORMER, SKYLAR K View full profile →
Grants only 39% of cases
Career Allow Rate
11 granted / 28 resolved
-15.7% vs TC avg
Strong +22% interview lift
Without
With
+22.5%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
29 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
27.7%
-12.3% vs TC avg
§103
61.4%
+21.4% vs TC avg
§102
2.8%
-37.2% vs TC avg
§112
8.1%
-31.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 28 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
In regard to 35 USC 103 rejection, applicant’s arguments with respect to claims 1-5 and 7-30 have been considered. However, Guo has been remapped in combination with references Ganesan to teach the amended features. Examiner respectfully directs Applicant to the detailed rejection for an explanation of how the references disclose the argued limitations. Guo teaches having an indicator that allows an output to transmit within a certain window of time.  
Specifically:
receive an indication that the wireless communication device is allowed to use an output of the machine learning model to derive one or more of a transmission probability, a backoff counter, a duration of a contention window, or one or more parameters associated with the duration of the contention window; (Guo, pg. 1591, Col. 2, paragraph 2, “In this sense, the optimal action for user i at time slot t can be described as ai t _ = 1, if i = argmaxPi t , 0, otherwise, where Pi t _ Ui t V I t denotes the priority of user i at time t. Ui t is an indicator [receive an indication] of the buffer emptiness [one or more parameters associated with the duration of the contention window;], i.e., Ui t = 0 if the buffer of user i at time t is empty, Ui t = 1 if user i has packet to transmit at time t [allowed to use an output of the machine learning model]. V I t is the average throughput of user i over last second.”)
Therefore the 35 USC 103 rejection is maintained. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 8-15, 17-28 and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al (Multi-Agent Reinforcement Learning-Based Distributed Channel Access for Next Generation Wireless Networks, "Guo")[May, 2022], in view of Ganesan et al (US Published Patent Application No. 20200012748, "Ganesan")[Jan. 9, 2020].

In regard to claim 1 and analogous claim 21, Guo teaches A wireless communication device, comprising: (Guo, pg. 1590, III., paragraph 1. “As shown in Fig. 1, we consider a time-slotted wireless network where n stations (STAs) seek channel access opportunity to transmit data packets to their associated Access Point (AP)”)
[…]
receive information associated with a machine learning model, wherein the machine learning model is associated with performing a distributed channel access procedure at the wireless communication device in a wireless local area network in accordance with the information; and (Guo, pg. 1590, lll., paragraph 1, “As shown in Fig. 1, we consider a time-slotted wireless network where n stations (STAs) seek channel access opportunity to transmit data packets to their associated Access Point (AP)… A transmission once starts will last for multiple time slots, the number of which is denoted as packet length. A transmission is successful if and only if there is only one STA transmitting during the whole packet length.” And Abstract (“Therefore, we propose a new MAC protocol, QMIX-advanced Listen-Before-Talk (QLBT), based on the cutting-edge multi-agent reinforcement learning (MARL) algorithm. It employs a centralized training with decentralized execution (CTDE) framework to exploit the overall information of all agents during training, and ensure that each agent can independently infer the optimal channel access behavior [distributed channel access procedure] based on its local observation.”), and 1st paragraph of Section V which teaches: “Based on the Dec-POMDP formulation introduced above, we propose a CTDE QLBT algorithm to find the optimal channel access policy. The framework of the proposed algorithm is shown in Fig. 2a. The centralized training is performed at AP side based on the experiences consisting of joint action-observation history, joint action, global environment state, and reward.” And Fig. 2 which teaches: “(a) CTDE framework: Centralized training is performed at AP side based on experiences reported by each STA. After training, AP sends out the agent network parameters [information] to corresponding STA [wireless communication device].”) 
receive an indication that the wireless communication device is allowed to use an output of the machine learning model to derive one or more of a transmission probability, a backoff counter, a duration of a contention window, or one or more parameters associated with the duration of the contention window; (Guo, pg. 1591, Col. 2, paragraph 2, “In this sense, the optimal action for user i at time slot t can be described as ai t _ = 1, if i = argmaxPi t , 0, otherwise, where Pi t _ Ui t V I t denotes the priority of user i at time t. Ui t is an indicator [receive an indication] of the buffer emptiness [one or more parameters associated with the duration of the contention window;], i.e., Ui t = 0 if the buffer of user i at time t is empty, Ui t = 1 if user i has packet to transmit at time t [allowed to use an output of the machine learning model]. V I t is the average throughput of user i over last second.”)
transmit a protocol data unit in accordance with the distributed channel access procedure and during a slot that is based at least in part on the output of the machine learning model. (Guo, pg. 1592, Col. 1, A., paragraph 2, “At each time step, agent network i takes ci t as an input and feeds it to a gated recurrent unit (GRU) layer. Then, the output of GRU passes two fully-connected (FC) layers and outputs Qi _ {Qi(_i t,Wait),Qi(_i t , Transmit)}. Action ai t is selected using _-greedy algorithm and corresponding Qi(_i t , ai t) is fed to the mixing network. The rectified linear unit (ReLU) is used as the activation function of the first FC layer and a linear activation is adopted by the second FC layer. All agent networks can share the same parameters, however, the parameters of each agent network are different by default.” Guo pg. 1590 IV. A. “The action of agent i ∈{1,2,...,n} at time slot t is defined as ai t ∈A≡{Transmit, Wait}, where Transmit means that agent i transmits at current time slot, and Wait means that agent i waits for a time slot.”) 
However, Guo does not explicitly teach one or more processors; 
one or more memories coupled with one or more processors; and 
instructions stored in the one or more memories and executable by the one or more processors to cause the wireless communication device to: 
Ganesen teaches one or more processors; (Ganesen, paragraph 0028, “The device may comprise one or more network interfaces 210 (e.g., wired, wireless, cellular, PLC, etc.), at least one processor 220, and a memory 240 interconnected by a system bus 250, as well as a power supply 260 (e.g., battery, plug-in, etc.).”)
one or more memories coupled with one or more processors; and (Ganesen, paragraph 0028, “The device may comprise one or more network interfaces 210 (e.g., wired, wireless, cellular, PLC, etc.), at least one processor 220, and a memory 240 interconnected by a system bus 250, as well as a power supply 260 (e.g., battery, plug-in, etc.).”) 
instructions stored in the one or more memories and executable by the one or more processors to cause the wireless communication device to: (Ganesen, paragraph 0043, “computer executable instructions executed by the processor 220 ( or independent processor of interfaces 210) to perform functions relating to the techniques described herein.”)
Guo and Ganesen are related to the same field of endeavor (i.e. wireless networks). In view of the teachings of Ganesen, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Ganesen to Guo before the effective filing date of the claimed invention in order perform the techniques needed for reinforcement learning with distributed channels. (Ganesen, paragraph 0043, “…as in accordance with the client emulation process 248, which may include computer executable instructions executed by the processor 220 ( or independent processor of interfaces 210) to perform functions relating to the techniques described herein.”)

In regard to claim 8, Guo and Ganesen teach the apparatus of claim 1. 
Guo further teaches receive an indication of one or more parameters associated with a reinforcement learning technique that the wireless communication device is to follow when training or retraining the machine learning model, (Guo, pg. 1592, V., “The centralized training is performed at AP side based on the experiences consisting of joint action observation history, joint action, global environment state and reward. An experience memory (EM) is used to store experience tuple (s,τ,a,r,s,τ ) [parameters].”) 
wherein the reinforcement learning technique is associated with a Q learning technique, a policy gradient, an actor-critic technique, or a contextual multi-armed bandit (MAB) or context-less MAB technique, wherein the instructions are further executable by the one or more processors to cause the apparatus to: (Guo, pg. 1588 “In order to converge to a cooperative learning behavior of agents, we solve the DCA problem on the basis of a multi-agent reinforcement learning (MARL) method QMIX [12], which adopts a centralized training with decentralized execution (CTDE) framework. Specifically, the central unit makes use of the global environment information to train the neural network parameters, while each agent makes independent channel access decision only based on its local observations. We name the proposed MAC layer protocol as QMIX-advanced LBT (QLBT) [e.g. Q learning technique] as it complies with LBT protocol.”)
	train or retrain the machine learning model in accordance with the one or more parameters associated with the reinforcement learning technique. (Guo, pg. 1592, “The centralized training [train or retrain] is performed at AP side based on the experiences consisting of joint action observation history, joint action, global environment state and reward [parameters]”.)

In regard to claim 9 and analogous claim 22, Guo and Ganesen teach the apparatus of claim 1.
Guo further teaches attempt to transmit the protocol data unit during one or more idle slots in accordance with the transmission probability, wherein the transmission probability is derived from the output of the machine learning model. (Guo, pg. 1589, II.A. paragraph 2, “To solve this problem, the action-value function, also known as Q-value, is defined as the expected cumulative discounted return from undertaking action a at state s”, pg. 1590, Col. 2, paragraph 1, “For the aforementioned system model, our goal is to design a distributed channel access strategy that maximizes the aggregate network throughput while maintaining fairness among STAs. For these two optimization objectives, i.e., maximizing network throughput and ensuring fairness, we formulate two reward functions accordingly. One is a total reward function that encourages the successful transmission behavior and punishes the actions leading to collisions. The other is an individual reward function based on proportional-fair (PF) scheduler. The individual reward function ensures fairness by awarding the actions consistent with PF scheduler and penalizing the other inconsistent actions.”)

In regard to claim 10 and analogous 23, Guo and Ganesen teach the apparatus of claim 1.
Guo further teaches transmit the protocol data unit during the slot in accordance with an expiration of a backoff counter, wherein the backoff counter is derived from the output of the machine learning model. (Guo, pg. 1593, Col. 2, 3), paragraph 1, “The optimal contention window to achieve CSMA/CA performance bound can be obtained by solving equations (28) and (46) in [7] and we omit the tedious derivation due to limited space. The key parameters therein are set as follows. The holding time in collision state _F equals to the packet length plus a DIFS. The maximum backoff stage K is set to 5 [expiration of a backoff counter], which aligns with AC_BE in Wi-Fi standard.”)       

	In regard to claim 11 and analogous claim 24, Guo and Ganesen teach the apparatus of claim 1. 
	Guo further teaches transmit the protocol data unit during the contention window, wherein a duration of the contention window is derived from the output of the machine learning model. (Guo, pg. 1593, Col. 2, 3), paragraph 1, “The optimal contention window [the contention window] to achieve CSMA/CA performance bound can be obtained by solving equations [output] (28) and (46) in [7] and we omit the tedious derivation due to limited space. The key parameters therein are set as follows. The holding time in collision state _F equals to the packet length plus a DIFS. The maximum backoff stage K is set to 5, which aligns with AC_BE in Wi-Fi standard.”)      

In regard to claim 12 and analogous claim 25, Guo and Gansesen teach the apparatus of claim 1.
Guo further teaches derive one or more rewards associated with the machine learning model in accordance with one or more of a signal-to-interference-plus-noise ratio, a throughput metric, a delay metric, a quantity of collisions, a ratio between a quantity of successful protocol data unit transmissions and a total quantity of protocol data unit transmissions, or a ratio between a quantity of unsuccessful protocol data unit transmissions and the total quantity of protocol data unit transmissions, wherein the output of the machine learning model is based at least in part on the one or more rewards. (Guo, pg. 1591 2nd column “Note that the total reward encourages the behavior of successful transmission by assigning positive rewards and punishes the behavior of collision by assigning negative rewards. Moreover, it also guarantees that a larger reward will be received if an agent with larger D2LT transmits successfully, which is consistent with the intuition.”)

In regard to claim 13, Guo and Gansesen teach the apparatus of claim 1.
Guo further teaches transmit a second protocol data unit in accordance with the distributed channel access procedure and during a second slot that is based at least in part on the updated output of the machine learning model. (Guo, pg. 1592, Col. 1, 2), “In this way, the mixing network is a two-layer feedforward neural network, which combines agent network outputs and generates [Qind,Qtot]. The exponential linear unit (ELU) is used as the activation function of the first hidden layer and a linear activation is adopted by the output layer.”)    
obtain an updated state associated with the wireless communication device in accordance with transmitting the protocol data unit; (Guo, pg. 1592 A., “1) Agent Network: At each time step, agent network i takes ci t as an input and feeds it to a gated recurrent unit (GRU) layer. Then, the output of GRU passes two fully-connected (FC) layers and outputs Qi ≜ {Qi(τi t,Wait),Qi(τi t,Transmit)}. Action ai t is selected using ε-greedy algorithm and corresponding Qi(τi t,ai t) is fed to the mixing network.”
input, into the machine learning model, the updated state to obtain an updated output of the machine learning model; and (Guo, pg. 1592. A. “2) Mixing Network: The mixing network of original QMIX algorithm takes the output of each agent network Qi(τi t,ai t),i ∈{1,2,··· ,n} and global environment state st as inputs to generate Qtot(τt,at,st) based on parameter θ.”)

In regard to claim 14 and analogous claim 27, Guo teaches A wireless communication device, comprising: (Guo, pg. 1590, III., paragraph 1. “As shown in Fig. 1, we consider a time-slotted wireless network where n stations (STAs) seek channel access opportunity to transmit data packets to their associated Access Point (AP) [wireless communication device]”)
[…]
transmit information associated with a machine learning model, wherein the machine learning model is associated with performing a distributed channel access procedure at a second wireless communication device in a wireless local area network in accordance with the information; and (Guo, pg. 1592, Col. 1, A., paragraph 2, “At each time step, agent network i takes ci t as an input and feeds it to a gated recurrent unit (GRU) layer. Then, the output of GRU passes two fully-connected (FC) layers and outputs Qi _ {Qi(_i t,Wait),Qi(_i t , Transmit)}. Action ai t is selected using _-greedy algorithm and corresponding Qi(_i t , ai t) is fed to the mixing network. The rectified linear unit (ReLU) is used as the activation function of the first FC layer and a linear activation is adopted by the second FC layer. All agent networks can share the same parameters, however, the parameters of each agent network are different by default.” And Abstract (“Therefore, we propose a new MAC protocol, QMIX-advanced Listen-Before-Talk (QLBT), based on the cutting-edge multi-agent reinforcement learning (MARL) algorithm. It employs a centralized training with decentralized execution (CTDE) framework to exploit the overall information of all agents during training, and ensure that each agent can independently infer the optimal channel access behavior [distributed channel access procedure] based on its local observation.”), and 1st paragraph of Section V which teaches: “Based on the Dec-POMDP formulation introduced above, we propose a CTDE QLBT algorithm to find the optimal channel access policy. The framework of the proposed algorithm is shown in Fig. 2a. The centralized training is performed at AP side based on the experiences consisting of joint action-observation history, joint action, global environment state, and reward.” And Fig. 2 which teaches: “(a) CTDE framework: Centralized training is performed at AP side based on experiences reported by each STA. After training, AP sends out the agent network parameters [information] to corresponding STA [wireless communication device].”)
transmit an indication that the second wireless communication device is allowed to use an output of the machine learning model to derive one or more of a transmission probability, a backoff counter, a duration of a contention window, or one or more parameters associated with the duration of the contention window; (Guo, pg. 1591, Col. 2, paragraph 2, “In this sense, the optimal action for user i at time slot t can be described as ai t _ = 1, if i = argmaxPi t , 0, otherwise, where Pi t _ Ui t V I t denotes the priority of user i at time t. Ui t is an indicator of the buffer emptiness, i.e., Ui t = 0 if the buffer of user i at time t is empty, Ui t = 1 if user i has packet to transmit at time t [allowed to use an output of the machine learning model]. V I t is the average throughput of user i over last second.”)
receive, from the second wireless communication device, a protocol data unit in accordance with the distributed channel access procedure and during a slot that is based at least in part on the output of the machine learning model. (Guo, pg. 1590, lll., paragraph 1, “As shown in Fig. 1, we consider a time-slotted wireless network where n stations (STAs) [second wireless communication device] seek channel access opportunity to transmit data packets to their associated Access Point (AP)… A transmission once starts will last for multiple time slots, the number of which is denoted as packet length. A transmission is successful if and only if there is only one STA transmitting during the whole packet length.” Guo pg. 1590 IV. A. “The action of agent i ∈{1,2,...,n} at time slot t is defined as ai t ∈A≡{Transmit, Wait}, where Transmit means that agent i transmits at current time slot, and Wait means that agent i waits for a time slot.”)
	However, Guo does not specifically teach one or more processors; 
One or more memories coupled with the one or more processors; and 
instructions stored in the one or more memories and executable by the one or more processors to cause the apparatus to: 
	Ganesen teaches one or more processors; (Ganesen, paragraph 0028, “The device may comprise one or more network interfaces 210 (e.g., wired, wireless, cellular, PLC, etc.), at least one processor 220, and a memory 240 interconnected by a system bus 250, as well as a power supply 260 (e.g., battery, plug-in, etc.).”)
One or more memories coupled with the one or more processors; and (Ganesen, paragraph 0028, “The device may comprise one or more network interfaces 210 (e.g., wired, wireless, cellular, PLC, etc.), at least one processor 220, and a memory 240 interconnected by a system bus 250, as well as a power supply 260 (e.g., battery, plug-in, etc.).”) 
instructions stored in the one or more memories and executable by the one or more processors to cause the apparatus to: (Ganesen, paragraph 0043, “computer executable instructions executed by the processor 220 ( or independent processor of interfaces 210) to perform functions relating to the techniques described herein.”)
Guo and Ganesen are combinable for the same rationale as set forth above with respect to claim 1. 

In regard to claim 15 and analogous claim 28, Guo and Gansesen teach the apparatus of claim 14.
	Guo further teaches transmit an indication that the second wireless communication device is allowed to develop the machine learning model and use the machine learning model for the distributed channel access procedure. (Guo, pg. 1592 A. “2) Mixing Network: The mixing network of original QMIX algorithm takes the output of each agent network Qi(τi t,ai t),i ∈{1,2,··· ,n} and global environment state st as inputs to generate Qtot(τt,at,st) based on parameter θ.”)

	In regard to claim 17 and analogous claim 30, Guo and Ganesen teach the apparatus of claim 14.
	Guo further teaches transmit an indication that the second wireless communication device is allowed to retrain the machine learning model for the distributed channel access procedure, wherein the machine learning model is pre-loaded at the second wireless communication device. (Guo, pg. 1595, Col. 2, paragraph 1, “For instance there are two stations in the network at the beginning of the simulation and the agent network parameters are initialized using a pre-trained 2-agent QLBT model [pre-loaded at the second wireless communication]. At 10s, another station joins the network. In this case, the agent network parameters are reinitialized using a pre-trained 3-agent QLBT model [allowed to retrain the reinforcement learning model]. In practical implementation, AP is aware of the number of connected stations, so the agent network parameter update can be achieved via beacon frames carrying such information.”)

In regard to claim 18, Guo and Ganesen teach the apparatus of claim 14.
Guo fuirther teaches transmit an indication of one or more parameters associated with a reinforcement learning technique that the second wireless communication device is to follow when training or retraining the machine learning model, (Guo, pg. 1592, V., “The centralized training is performed at AP side based on the experiences consisting of joint action observation history, joint action, global environment state and reward. An experience memory (EM) is used to store experience tuple (s,τ,a,r,s,τ ) [parameters].”)
wherein the reinforcement learning technique is associated with a Q learning technique, a policy gradient, an actor-critic technique, or a contextual multi-armed bandit (MAB) or context-less MAB technique. (Guo, pg. 1588 “In order to converge to a cooperative learning behavior of agents, we solve the DCA problem on the basis of a multi-agent reinforcement learning (MARL) method QMIX [12], which adopts a centralized training with decentralized execution (CTDE) framework. Specifically, the central unit makes use of the global environment information to train the neural network parameters, while each agent makes independent channel access decision only based on its local observations. We name the proposed MAC layer protocol as QMIX-advanced LBT (QLBT) [e.g. Q learning technique] as it complies with LBT protocol.”)

In regard to claim 19, Guo and Ganesen teach the apparatus of claim 14.
Guo further teaches transmit an indication of one or more rewards associated with the machine learning model, wherein the one or more rewards include one or more of a signal-to-interference- plus-noise ratio, a throughput metric, a delay metric, a quantity of collisions, a ratio between a quantity of successful protocol data unit transmissions and a total quantity of protocol data unit transmissions, or a ratio between a quantity of unsuccessful protocol data unit transmissions and the total quantity of protocol data unit transmissions, and wherein the output of the machine learning model is based at least in part on the one or more rewards. (Guo, pg. 1591 2nd column “Note that the total reward encourages the behavior of successful transmission by assigning positive rewards and punishes the behavior of collision by assigning negative rewards. Moreover, it also guarantees that a larger reward will be received if an agent with larger D2LT transmits successfully, which is consistent with the intuition.”)

In regard to claim 20 and analogous claim 26, Guo and Ganesen teach the apparatus of claim 14.
Guo further teaches transmit an indication of one or more parameters associated with an environment of the second wireless communication device, wherein the output of the machine learning model is associated with a use of the one or more parameters as inputs into the machine learning model. (Guo, pg. 1592 A. “1) Agent Network: At each time step, agent network i takes ci t as an input and feeds it to a gated recurrent unit (GRU) layer. Then, the output of GRU passes two fully-connected (FC) layers and outputs Qi ≜ {Qi(τi t,Wait),Qi(τi t,Transmit)}. Action ai t is selected using ε-greedy algorithm and corresponding Qi(τi t,ai t) is fed to the mixing network. The rectified linear unit (ReLU) is used as the activation function of the f irst FC layer and a linear activation is adopted by the second FC layer.”)

Claims 2-3 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Guo, in view of Ganesen and in further view of Wang et al (Deep Reinforcement Learning for Dynamic Multichannel Access in Wireless Networks, "Wang")[June, 2018].

In regard to claim 2, Guo and Ganesen teach the apparatus of claim 1.
However, Guo and Ganesen do not explicitly teach receive an indication that the wireless communication device is allowed to develop the machine learning model and use the machine learning model for the distributed channel access procedure.  
Wang teaches receive an indication that the wireless communication device is allowed to develop the machine learning model and use the machine learning model for the distributed channel access procedure. (Wang, pg. 263, Col. 1, paragraph 2, “The sender can detect [indication] the mismatch event if no ACK/NAK is received. Once the mismatch happens, the sender stops updating its DQN model [use the reinforcement learning model] as well as training dataset and transmits data in the future using one single good channel - or a small set of channels known so far to have better channel conditions [for the distributed channel access procedure.] [26]. Along with the data messages, the sender also sends the timestamp when the channel mismatch was perceived. The sender keeps sending this channel mismatch time information until an ACK being received, which indicates the receiver is on the same channel again and receives the channel mismatch information.”)
Guo, Ganesen and Wang are related to the same field of endeavor (i.e. wireless networks). In view of the teachings of Wang, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Wang to Guo and Ganesen before the effective filing date of the claimed invention in order to allow for an optimal performance with the reinforcement model. (Wang, abstract, “We then compare the performance of DQN with a Myopic policy and a Whittle Index-based heuristic through both more general simulations as well as real data trace and show that DQN achieves near-optimal performance in more complex situations.”)    

In regard to claim 3, Guo and Ganesen teach the apparatus of claim 1.
However, Guo and Ganesen do not explicitly teach develop the machine learning model at the wireless communication device in accordance with receiving the indication that the wireless communication device is allowed to develop the machine learning model.  
Wang teaches develop the machine learning model at the wireless communication device in accordance with receiving the indication that the wireless communication device is allowed to develop the machine learning model. (Wang, pg. 263, Col. 1, paragraph 2, “The sender can detect the mismatch event if no ACK/NAK is received. Once the mismatch happens, the sender stops updating its DQN model as well as training dataset and transmits data in the future using one single good channel - or a small set of channels known so far to have better channel conditions [26] [receiving the indication that the wireless communication device is allowed]. Along with the data messages, the sender also sends the timestamp when the channel mismatch was perceived. The sender keeps sending this channel mismatch time information until an ACK being received, which indicates the receiver is on the same channel again and receives the channel mismatch information.”)     
Guo, Ganesen and Wang are combinable for the same rationale as set forth above with respect to claim 2.

In regard to claim 7, Guo and Ganesen teach the apparatus of claim 1. 
Guo further teaches train or retrain the machine learning model in accordance with the policy. (Guo, pg. 1592, “The centralized training is performed at AP side based on the experiences consisting of joint action observation history, joint action, global environment state and reward [policy]”.)
However, Guo does not explicitly receive an indication of a policy associated with training or retraining the machine learning model, wherein the instructions are further executable by the one or more processors to cause the wireless communication device to:  
Wang teaches receive an indication of a policy associated with training or retraining the machine learning model, wherein the instructions are further executable by the one or more processors to cause the wireless communication device to: (Wang, pg. 259, Col. 2, A., “It is shown that the Myopic policy is optimal for any number of channels when the channel state transitions are positively correlated, i.e., p11 ≥ p01.”)       
Guo, Ganesen and Wang are combinable for the same rationale as set forth above with respect to claim 2.      

Claims 4-5, 16 and 29 are rejected under 35 U.S.C. 103 as being unpatentable over Guo, in view of Ganesen and in further view of Zhang et al (CFR-RL: Traffic Engineering With Reinforcement Learning in SDN, “Zhang”).

In regard to claim 4, Guo and Ganesen teach the apparatus of claim 1.
Guo and Ganesen do not explicitly teach receive a configuration associated with the machine learning model and an indication of whether the wireless communication device is allowed to retrain the machine learning model for the distributed channel access procedure, wherein the instructions are further executable by the one or more processors to cause the wireless communication device to: 
selectively retrain the machine learning model based at least in part on whether the wireless communication device is allowed to retrain the machine learning model.  
Zhang teaches receive a configuration associated with the machine learning model and an indication of whether the wireless communication device is allowed to retrain the machine learning model for the distributed channel access procedure, wherein the instructions are further executable by the one or more processors to cause the wireless communication device to: (Zhang, pg. 2258, Col. 1, C., paragraph 1, “This self-learning technique will enable CFR-RL to further adapt itself to the dynamic conditions in the network after being deployed in real networks. CFR-RL can be retrained by including new traffic matrices [allowed to retrain the reinforcement learning model]. For example, the outlier traffic matrices (e.g., the 235th-240th traffic matrices in Day 2) presented in Fig. 10 should be included for retraining, while the generalization results shown in Section VI-B.3 suggest that retraining frequently might not be necessary. Techniques to determine when to retrain and which new/old traffic matrix should be included/excluded in/from the training dataset should be further investigated.”)    
selectively retrain the machine learning model based at least in part on whether the wireless communication device is allowed to retrain the machine learning model. (Zhang, pg. 2258, Col. 1, C., paragraph 1, “This self-learning technique will enable CFR-RL to further adapt itself to the dynamic conditions in the network after being deployed in real networks. CFR-RL can be retrained by including new traffic matrices. For example, the outlier traffic matrices (e.g., the 235th-240th traffic matrices in Day 2) presented in Fig. 10 should be included for retraining, while the generalization results shown in Section VI-B.3 suggest that retraining frequently might not be necessary [selectively retrain the reinforcement learning model]. Techniques to determine when to retrain and which new/old traffic matrix should be included/excluded in/from the training dataset should be further investigated.”)
Guo, Ganesen and Zhang are related to the same field of endeavor (i.e. wireless networks). In view of the teachings of Zhang, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Zhang to Guo and Ganesen before the effective filing date of the claimed invention in order to be able to determine if retraining the model is necessary. (Zhang, pg. 2258, Col. 1, “Techniques to determine when to retrain and which new/old traffic matrix should be included/excluded in/from the training dataset should be further investigated.”)

In regard to claim 5, Guo and Ganesen teach the apparatus of claim 1.
Guo and Ganesen do not explicitly teach receive an indication that the wireless communication device is allowed to retrain the machine learning model for the distributed channel access procedure, wherein the machine learning model is pre-loaded at the wireless communication device, wherein the instructions are further executable by the one or more processors to cause the wireless communication device to: 
retrain the machine learning model based at least in part on the wireless communication device being allowed to retrain the machine learning model.  
Zhang teaches receive an indication that the wireless communication device is allowed to retrain the machine learning model for the distributed channel access procedure, wherein the machine learning model is pre-loaded at the wireless communication device, wherein the instructions are further executable by the one or more processors to cause the wireless communication device to: (Zhang, pg. 2258, Col. 1, paragraph 1, “This self-learning technique will enable CFR-RL to further adapt itself to the dynamic conditions in the network after being deployed in real networks. CFR-RL can be retrained by including new traffic matrices [allowed to retrain the reinforcement learning model]. For example, the outlier traffic matrices (e.g., the 235th-240th traffic matrices in Day 2) presented in Fig. 10 should be included for retraining, while the generalization results shown in Section VI-B.3 suggest that retraining frequently might not be necessary. Techniques to determine when to retrain and which new/old traffic matrix should be included/excluded in/from the training dataset should be further investigated.”)   
retrain the machine learning model based at least in part on the wireless communication device being allowed to retrain the machine learning model. (Zhang, pg. 2258, Col. 1, paragraph 1, “This self-learning technique will enable CFR-RL to further adapt itself to the dynamic conditions in the network after being deployed in real networks. CFR-RL can be retrained by including new traffic matrices. For example, the outlier traffic matrices (e.g., the 235th-240th traffic matrices in Day 2) presented in Fig. 10 should be included for retraining, while the generalization results [retrain the reinforcement learning model] shown in Section VI-B.3 suggest that retraining frequently might not be necessary. Techniques to determine when to retrain and which new/old traffic matrix should be included/excluded in/from the training dataset should be further investigated.”)  
Guo, Ganesen and Zhang are combinable for the same rationale as set forth above with respect to claim 4.   

In regard to claim 16 and analogous claim 29, Guo and Ganesen teach the apparatus of claim 14.
Guo and Ganesen do not explicitly teach transmit information associated with the machine learning model and an indication of whether the second wireless communication device is allowed to retrain the machine learning model for the distributed channel access procedure.  
Zhang teaches transmit information associated with the machine learning model and an indication of whether the second wireless communication device is allowed to retrain the machine learning model for the distributed channel access procedure. (Zhang, pg. 2258, Col. 1, paragraph 1, “This self-learning technique will enable CFR-RL to further adapt itself to the dynamic conditions in the network after being deployed in real networks. CFR-RL can be retrained by including new traffic matrices [allowed to retrain the reinforcement learning model for the distributed channel access procedure]. For example, the outlier traffic matrices (e.g., the 235th-240th traffic matrices in Day 2) presented in Fig. 10 should be included for retraining, while the generalization results shown in Section VI-B.3 suggest that retraining frequently might not be necessary. Techniques to determine when to retrain and which new/old traffic matrix should be included/excluded in/from the training dataset should be further investigated.”) 
Guo, Ganesen and Zhang are combinable for the same rationale as set forth above with respect to claim 4.   

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SKYLAR K VANWORMER whose telephone number is (703)756-1571. The examiner can normally be reached M-F 6:00am to 3:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached at (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/S.K.V./Examiner, Art Unit 2146                                                                                                                                                                                                        /USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146
Read full office action
Prosecution Timeline

Oct 19, 2022
Application Filed
Sep 05, 2025
Non-Final Rejection — §103
Dec 11, 2025
Response Filed
Mar 18, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/331,475
Patent 12591789
Knowledge distillation in multi-arm bandit, neural network models for real-time online optimization
2y 5m to grant Granted Mar 31, 2026
17/169,083
Patent 12541680
REDUCED COMPUTATION REAL TIME RECURRENT LEARNING
2y 5m to grant Granted Feb 03, 2026
17/383,132
Patent 12524655
ARTIFICIAL NEURAL NETWORK PROCESSING METHODS AND SYSTEM
2y 5m to grant Granted Jan 13, 2026
17/350,840
Patent 12511554
Complex System for End-to-End Causal Inference
2y 5m to grant Granted Dec 30, 2025
17/514,512
Patent 12505358
Methods and Systems for Approximating Embeddings of Out-Of-Knowledge-Graph Entities for Link Prediction in Knowledge Graph
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
39%
Grant Probability
62%
With Interview (+22.5%)
4y 4m
Median Time to Grant
Moderate
PTA Risk
Based on 28 resolved cases by this examiner. Grant probability derived from career allow rate.