Last updated: May 29, 2026
Application No. 17/719,740
DEEP REINFORCEMENT LEARNING FOR SKILL RECOMMENDATION

Non-Final OA §103
Filed
Apr 13, 2022
Examiner
KARTHOLY, REJI P
Art Unit
2143
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
3 (Non-Final)
Interview Optional

— +72.1% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 65% grant rate with +72.1% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 153 resolved cases, 2023–2026
Examiner Intelligence

KARTHOLY, REJI P View full profile →
Grants 65% of resolved cases
Career Allowance Rate
99 granted / 153 resolved
+9.7% vs TC avg
Strong +72% interview lift
Without
With
+72.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
15 currently pending
Career history
172
Total Applications
across all art units
Statute-Specific Performance

§101
2.4%
-37.6% vs TC avg
§103
94.9%
+54.9% vs TC avg
§102
1.9%
-38.1% vs TC avg
§112
0.5%
-39.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 153 resolved cases
Office Action

§103
DETAILED ACTION
This Office Action is in response to Applicant's Response filed on 01/08/2026 for the above identified application.  

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/08/2026 has been entered.

Response to Amendment
The amendment filed on 01/08/2026 has been entered.  
Claims 1, 2, 10-12, and 20 are amended.  Claims 9 and 19 are canceled.  Claims 1-7, 10-17, and 20 are pending in the application.  


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-7, 10-17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions” hereinafter Chen) in view of Sengupta et al. (US 2021/0056651 A1 hereinafter Sengupta), further in view of Bambha et al. (US 2024/0273575 A1 hereinafter Bambha).

Regarding Claim 1, Chen teaches a computer-implemented method (page 2, para. 1 - Deep Reinforcement Learning (DRL) based recommendation systems), the computer- implemented method comprising:  
for each reference user of a plurality of reference users of an online service, computing an action embedding based on current impression interaction data of the reference user (page 2, section 2.1, para. 1 - recommender systems (i.e., online service) require coping with dynamic environments by estimating rapidly changing users’ preferences and proactively recommending items to users; let U be a set of users of cardinality |U| (i.e., plurality of reference users of online service) and I be a set of items of cardinality |I|; for each user𝑢 ∈ U (i.e., each reference user), we observe a sequence of user actions X𝑢 = [𝑥𝑢 1,𝑥𝑢 2, · · · ,𝑥𝑢 𝑇𝑢 ] with item 𝑥𝑢 𝑡 ∈ I, i.e., each event in a user sequence comes from the item set; we refer to a user making a decision as an interaction with an item; suppose the feedback (e.g., ratings or clicking behavior) provided by users is F (i.e.,  action embedding based on current impression interaction data of the reference user), then a dynamic recommender system maintains the corresponding recommendation policy 𝜋𝑢 𝑡 , which will be updated systematically based on the feedback 𝑓𝑢 𝑖 ∈ F received during the interaction for item 𝑖 ∈ I at the timestamp 𝑡; page 6, section 2.3, para. 1 - given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 to user 𝑢 and then gets feedback 𝑓𝑢 𝑖; page 23, section 5.4, para. 2 - for instance, users in a video recommender system watch, rate, and comment on those movies that they are interested in), the current impression interaction data indicating a reference item that has been selected by a recommendation model at a current time step for display to the reference user (pages 1 & 2, section 1 - deep reinforcement learning (DRL) aims to train an agent that can learn from interaction trajectories provided by the environment by combining the power of deep learning and reinforcement learning; an agent in DRL can actively learn from users’ real-time feedback to infer dynamic user preferences; page 2, section 2.1, para. 1 - recommender systems (equivalent to recommendation model) require coping with dynamic environments by estimating rapidly changing users’ preferences and proactively recommending items to users; for each user𝑢 ∈ U, we observe a sequence of user actions X𝑢 = [𝑥𝑢 1,𝑥𝑢 2, · · · ,𝑥𝑢 𝑇𝑢 ] with item 𝑥𝑢 𝑡 ∈ I, i.e., each event in a user sequence comes from the item set; we refer to a user making a decision as an interaction with an item; suppose the feedback (e.g., ratings or clicking behavior) provided by users is F; page 6, section 2.3, para. 1 - given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 (i.e., reference item that has been selected by the recommendation model at a current time step for display to the reference user) to user 𝑢 and then gets feedback 𝑓𝑢 𝑖  (i.e., current impression interaction); page 6, section 3 - DRL-based recommendation model-based and model-free methods), the selected reference item having been displayed using a selectable user interface element (page 2, section 2.1, para. 1 - for each user𝑢 ∈ U, we observe a sequence of user actions X𝑢 = [𝑥𝑢 1,𝑥𝑢 2, · · · ,𝑥𝑢 𝑇𝑢 ] with item 𝑥𝑢 𝑡 ∈ I, i.e., each event in a user sequence comes from the item set; we refer to a user making a decision as an interaction with an item; suppose the feedback (e.g., ratings or clicking behavior) provided by users is F (i.e., the selected reference item is displayed using a selectable user interface element for clicking/rating); page 6, section 2.3, para. 1 - given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖  to user 𝑢 and then gets feedback 𝑓𝑢 𝑖 );  
training a recommendation model using deep reinforcement learning and a Markov decision process (pages 1 & 2, section 1 - deep reinforcement learning (DRL) aims to train an agent that can learn from interaction trajectories provided by the environment by combining the power of deep learning and reinforcement learning; an agent in DRL (equivalent to recommendation model) can actively learn from users’ real-time feedback to infer dynamic user preferences; page 3, section 2.2, para. 1 - DRL use the deep learning to approximate reinforcement learning’s value function and solve high-dimensional Markov Decision Processes (MDPs); MDP can be represented as a tuple (S,A,P,R,𝛾); the agent chooses an action 𝑎𝑡 ∈ A according to the policy 𝜋𝑡(𝑠𝑡) at state 𝑠𝑡 ∈ S; the environment receives the action and produces a reward 𝑟𝑡+1 ∈ R and transfers the reward into the next state 𝑠𝑡+1 according to the transition probability 𝑃(𝑠𝑡+1|𝑠𝑡,𝑎𝑡) ∈ P), the Markov decision process having an action space and a reward function (page 3, section 2.2, para. 1 - MDP can be represented as a tuple (S,A,P,R,𝛾); the agent chooses an action 𝑎𝑡 ∈ A (i.e., action space) according to the policy 𝜋𝑡(𝑠𝑡) at state 𝑠𝑡 ∈ S; the environment receives the action and produces a reward 𝑟𝑡+1 ∈ R (i.e., reward function) and transfers the reward into the next state 𝑠𝑡+1 according to the transition probability 𝑃(𝑠𝑡+1|𝑠𝑡,𝑎𝑡) ∈ P), the action space including action embeddings of the plurality of reference users (page 6, section 2.3, para. 1 - DRL is normally formulated as a Markov Decision Process (MDP); given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 to user 𝑢 and then gets feedback 𝑓𝑢 𝑖; the system aims to incorporate the feedback to improve future recommendations and needs to determine an optimal policy𝜋∗ regarding which item to recommend to the user to achieve positive feedback; the MDP modelling of the problem treats the user as the environment and the system as the agent; Action A: An action𝑎𝑡 ∈ A represents users’ dynamic preference at time 𝑡 as predicted by the agent. A represents the whole set of (potentially millions of) candidate items - thus, including action embeddings of the reference users), the reward function configured to issue a first reward based on the current impression interaction data indicating that the reference user selected the selectable user interface element displayed at the current time step (page 6, section 2.3, para. 1 - DRL is normally formulated as a Markov Decision Process (MDP); given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 to user 𝑢 and then gets feedback 𝑓𝑢 𝑖; the system aims to incorporate the feedback to improve future recommendations and needs to determine an optimal policy𝜋∗ regarding which item to recommend to the user to achieve positive feedback; the MDP modelling of the problem treats the user as the environment and the system as the agent; Reward R: once the agent chooses a suitable action 𝑎𝑡 based on the current state 𝑆𝑡 at time 𝑡, the user will receive the item recommended by the agent; users’ feedback on the recommended item accounts for the reward 𝑟(𝑆𝑡,𝑎𝑡) (i.e.,first reward based on the current impression interaction data)), the reward function also configured to issue a long-term reward that is different from the first reward (page 11, section 3.2, para. 2 -  Deep Q-learning and its variants are typical value-based DRL methods widely used in DRL-based RS; utilizing Deep Q-Networks (DQN) in RS; page 11, section 3. 2 - most studies do not consider users’ long-term engagement in the state representation as they focus on the immediate reward. FeedRec is proposed that combines both instant feedback and delayed feedback into the model to represent the long-term reward and optimize the long-term engagement by using DQN), the long-term reward based on a measurement of engagement of the reference user with the online service (page 23, section 5.4, para. 2 - for instance, users in a video recommender system watch, rate, and comment on those movies that they are interested in; pages 1 & 2, section 1 - an agent in DRL can actively learn from users’ real-time feedback to infer dynamic user preferences; page 2, section 2.1, para. 1 - recommender systems estimate rapidly changing users’ preferences and proactively recommend items to users; page 11 - combine both instant feedback and delayed feedback (i.e., measurement of engagement) into the model to represent the long-term reward and optimize the long-term engagement by using DQN; time-LSTM is employed to track users’ hierarchical behavior over time (i.e., measurement of engagement) to represent the delayed feedback; page 12, para. 3 -decompose slate Q-value to estimate a long-term value for individual items - thus, the long-term reward is based on user behavior over time/ measurement of engagement of the user with the recommender system/online service); 
performing a function of the online service using the trained recommendation model (pages 1 & 2, section 1 - deep reinforcement learning (DRL) aims to train an agent that can learn from interaction trajectories provided by the environment by combining the power of deep learning and reinforcement learning; an agent in DRL (equivalent to trained recommendation model) can actively learn from users’ real-time feedback to infer dynamic user preferences; page 2, section 2.1, para. 1 - recommender systems  require coping with dynamic environments by estimating rapidly changing users’ preferences and proactively recommending items to users; for each user𝑢 ∈ U, we observe a sequence of user actions X𝑢 = [𝑥𝑢 1,𝑥𝑢 2, · · · ,𝑥𝑢 𝑇𝑢 ] with item 𝑥𝑢 𝑡 ∈ I, i.e., each event in a user sequence comes from the item set; we refer to a user making a decision as an interaction with an item; suppose the feedback (e.g., ratings or clicking behavior) provided by users is F; page 6, section 2.3, para. 1 - DRL is normally formulated as a Markov Decision Process (MDP); the MDP modelling of the problem treats the user as the environment and the system as the agent; given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 to user 𝑢 and then gets feedback 𝑓𝑢 𝑖 ; the system aims to incorporate the feedback to improve future recommendations and needs to determine an optimal policy𝜋∗ regarding which item to recommend to the user to achieve positive feedback - thus, the trained agent/ model provide improved future recommendations (i.e., performing function of the online service) to the user); and 
displaying a selectable user interface element based on an output of the function on a computing device (pages 1 & 2, section 1 - an agent in DRL can actively learn from users’ real-time feedback to infer dynamic user preferences; page 2, section 2.1, para. 1 - deep reinforcement learning (DRL) aims to train an agent that can learn from interaction trajectories provided by the environment by combining the power of deep learning and reinforcement learning; recommender systems require coping with dynamic environments by estimating rapidly changing users’ preferences and proactively recommending items to users; for each user𝑢 ∈ U, we observe a sequence of user actions X𝑢 with item 𝑥𝑢 𝑡 ∈ I; we refer to a user making a decision as an interaction with an item; suppose the feedback (e.g., ratings or clicking behavior) provided by users is F, then a dynamic recommender system maintains the corresponding recommendation policy 𝜋𝑢 𝑡 , which will be updated systematically based on the feedback 𝑓𝑢 𝑖 ∈ F received during the interaction for item 𝑖 ∈ I at the timestamp 𝑡; page 6, section 2.3, para. 1 - DRL is normally formulated as a Markov Decision Process (MDP); the MDP modelling of the problem treats the user as the environment and the system as the agent; given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 to user 𝑢 and then gets feedback 𝑓𝑢 𝑖 (i.e., displaying a selectable user interface element based on an output of the function on a computing device); the system aims to incorporate the feedback to improve future recommendations and needs to determine an optimal policy𝜋∗ regarding which item to recommend to the user to achieve positive feedback). 
However, Chen fails to expressly teach wherein the method performed by a computer system having a memory and at least one hardware processor; wherein a reference skill that has been selected by a recommendation model; and the reference skill having been displayed using a selectable user interface element configured to add the reference skill to a profile of the reference user; and calling a profile update service for displaying a selectable user interface element.  
In the same field of endeavor, Sengupta teaches wherein the method performed by a computer system having a memory and at least one hardware processor ([0010] one or more non-transitory computer readable storage media encoded with instructions that, when executed by a processor of a computing device, causes the processor to perform steps); wherein a reference skill that has been selected by a recommendation model ([0072] FIG. 8 shows a web interface of a recommended skills output 800 as it may appear on interface 114 to user 120; model selection option 804 may allow worker 120 to view various recommended skills output created by each model); the reference skill having been displayed using a selectable user interface element configured to add the reference skill to a profile of the reference user ([0072] FIG. 8 shows a web interface of a recommended skills output 800 as it may appear on interface 114 to user 120; model selection option 804 may allow worker 120 to view various recommended skills output created by each model; Recommended skill 1 810, recommended skill 2 812, recommended skill 3 814, recommended skill 4 816, and recommended skill 5 818 are each listed on a row; on each row is an input option 808 that allows worker 120 to agree or disagree with the recommendation; the “agree” option may add the recommended skill to the worker's user profile); and calling a profile update service for displaying a selectable user interface element ([0072] FIG. 8 shows a web interface of a recommended skills output 800 as it may appear on interface 114 to user 120; model selection option 804 (equivalent to profile update service) may allow worker 120 to view various recommended skills output created by each model).  
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein the method performed by a computer system having a memory and at least one hardware processor; wherein a reference skill that has been selected by a recommendation model; the reference skill having been displayed using a selectable user interface element configured to add the reference skill to a profile of the reference user; and calling a profile update service for displaying a selectable user interface element, as taught by Sengupta into Chen.  Doing so would be desirable because it would allow to best track, develop, and promote acquisition of human capital professional skills among a large and diverse workforce/ users (Sengupta [0004]).  
However, Chen and Sengupta do not expressly teach wherein the measurement of engagement of the reference user with the online service is based on a number of sessions the reference user has had with the online service within a predetermined period of time. 
In the same field of endeavor, Bambha teaches wherein the measurement of engagement of the reference user with the online service is based on a number of sessions the reference user has had with the online service within a predetermined period of time ([0047] the revenue optimization system use reinforcement learning models for capturing delayed rewards generated by a sequence of actions such as long term expected streaming time, user retention, and long term expected revenue; [0004] the long term revenue not only depends on expected revenue per session, but also depends on the number of sessions (e.g. the activeness of the users) and the retention rate of the users; [0052] the activeness of the user on the media system (i.e., measurement of engagement of the user with the online service) can be can be calculated per week, per month, or any other metrics - thus, the activeness of the user/number of sessions calculated per week/ per month (i.e., predetermined period of time) for delayed rewards). 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein the measurement of engagement of the reference user with the online service is based on a number of sessions the reference user has had with the online service within a predetermined period of time, as taught by Bambha into Chen and Sengupta.  Doing so would be desirable because it would allow for capturing delayed rewards generated by a sequence of actions such as long term expected streaming time, user retention, and long term expected revenue and optimizing long term revenue (Bambha [0004]). 

As to dependent Claim 2, Chen, Sengupta, and Bambha teach all the limitations of Claim 1.  Chen further teaches wherein for each reference user of the plurality of reference users of an online service, computing a state embedding for the reference user based on profile data of the reference user, activity data of the reference user, and previous impression interaction data of the reference user (page 6, section 2.3, para. 1 - DRL is formulated as a Markov Decision Process (MDP); given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖  to user 𝑢 and then gets feedback 𝑓𝑢 𝑖; the MDP modelling of the problem treats the user as the environment and the system as the agent; State S: A state 𝑆𝑡 ∈ S is determined by both users’ information (i.e., based on profile data of the reference user) and the recent 𝑙 items in which the user was interested before time (i.e., activity data and previous impression interaction data of the reference user)), the activity data indicating interactions of the reference user with one or more applications of the online service (page 6, section 2.3, para. 1 -State S: A state 𝑆𝑡 ∈ S is determined by both users’ information and the recent 𝑙 items in which the user was interested before time; page 23, section 5.4, para. 2 - for instance, users in a video recommender system watch, rate, and comment on those movies that they are interested in - thus, activity data indicating interactions of the user with one or more applications of the online service), the previous impression interaction data indicating interactions of the reference user with reference skills that have been selected by a recommendation model at one or more previous time steps for display to the reference user (page 6, section 2.3, para. 1 - given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖  (equivalent to skills) to user 𝑢 and then gets feedback 𝑓𝑢 𝑖; State S: A state 𝑆𝑡 ∈ S is determined by both users’ information and the recent 𝑙 items in which the user was interested before time - thus, previous impression interaction data indicating interactions of the user with skills at one or more previous time steps for display to the user), wherein the Markov decision process has a state space including the state embeddings of the plurality of reference users (page 6, section 2.3, para. 1 - DRL is formulated as a Markov Decision Process (MDP); the MDP modelling of the problem treats the user as the environment and the system as the agent; State S: A state 𝑆𝑡 ∈ S is determined by both users’ information and the recent 𝑙 items in which the user was interested before time (i.e., state embedding of plurality of users)).  Sengupta further teaches wherein the selected reference skills having been displayed using selectable user interface elements configured to add the selected reference skills to the profile of the reference user ([0072] FIG. 8 shows a web interface of a recommended skills output 800 as it may appear on interface 114 to user 120; model selection option 804 may allow worker 120 to view various recommended skills output created by each model; Recommended skill 1 810, recommended skill 2 812, recommended skill 3 814, recommended skill 4 816, and recommended skill 5 818 are each listed on a row; on each row is an input option 808 that allows worker 120 to agree or disagree with the recommendation; the “agree” option may add the recommended skill to the worker's user profile).

As to dependent Claim 3, Chen, Sengupta, and Bambha teach all the limitations of Claim 2.  Sengupta further teaches wherein the profile data comprises at least one of a company, an educational institution, a job title, or one or more reference skills ([0075] a worker 120 first adds a particular skill to their user profile (i.e., reference skill); process 900 interfaces with skillset database 216 by storing the added skill as associated with the user profile).  

As to dependent Claim 4, Chen, Sengupta, and Bambha teach all the limitations of Claim 2.  Sengupta further teaches wherein the one or more applications of the online service comprise at least one of: a job search application configured to present online job postings, an online course application configured to present online courses published on the online service, or an online feed configured to present online content published on the online service ([0089] the skill training component 102 provides to the worker 120 a path for developing one or more existing skills associated with the workers users profile 214; this section of the digital platform may include information such as recommended training courses (i.e., online courses) for a given existing skill).

  As to dependent Claim 5, Chen, Sengupta, and Bambha teach all the limitations of Claim 2.  Sengupta further teaches wherein the previous impression interaction data identifies which reference skills were added to the profile of the reference user via user selection of the selectable user interface elements and which reference skills were not added to the profile of the reference user via user selection of the selectable user interface elements ([0072] FIG. 8 shows a web interface of a recommended skills output 800 as it may appear on interface 114 to user 120; model selection option 804 may allow worker 120 to view various recommended skills output created by each model; Recommended skill 1 810, recommended skill 2 812, recommended skill 3 814, recommended skill 4 816, and recommended skill 5 818 are each listed on a row; on each row is an input option 808 that allows worker 120 to agree or disagree with the recommendation; the “agree” option may add the recommended skill to the worker's user profile; the “disagree” option may include a popup option box 822 that allows worker 120 to provide feedback so that system 100 may generate even better recommendations in the future - thus, the previous interaction data identifies which skills were added and which skills were not added).  

As to dependent Claim 6, Chen, Sengupta, and Bambha teach all the limitations of Claim 1.  Chen further teaches wherein the recommendation model is trained using Q-learning and a deep convolutional neural network (page 10, section 3.2, para. 2 - Deep Q-learning and its variants are typical value-based DRL methods widely used in DRL-based RS; DRN utilizing Deep Q-Networks (DQN) (i.e., deep convolutional neural network) in RS/ recommender systems).  

As to dependent Claim 7, Chen, Sengupta, and Bambha teach all the limitations of Claim 1.  Chen further teaches wherein the recommendation model is trained using a policy gradient algorithm (page 13, para. 3 - Policy gradient-based methods are the other stream in policy-based DRL methods for RS/ recommender systems; Policy Gradient for Contextual Recommendation).  

As to dependent Claim 10, Chen, Sengupta, and Bambha teach all the limitations of Claim 1.  Chen further teaches wherein the performing the function of the online service using the trained recommendation model (pages 1 & 2, section 1 - deep reinforcement learning (DRL) aims to train an agent that can learn from interaction trajectories provided by the environment by combining the power of deep learning and reinforcement learning; an agent in DRL can actively learn from users’ real-time feedback to infer dynamic user preferences; page 2, section 2.1, para. 1 - recommender systems (equivalent to recommendation model) require coping with dynamic environments by estimating rapidly changing users’ preferences and proactively recommending items to users; for each user𝑢 ∈ U, we observe a sequence of user actions X𝑢 = [𝑥𝑢 1,𝑥𝑢 2, · · · ,𝑥𝑢 𝑇𝑢 ] with item 𝑥𝑢 𝑡 ∈ I, i.e., each event in a user sequence comes from the item set; we refer to a user making a decision as an interaction with an item; suppose the feedback (e.g., ratings or clicking behavior) provided by users is F; page 6, section 2.3, para. 1 - DRL is normally formulated as a Markov Decision Process (MDP); the MDP modelling of the problem treats the user as the environment and the system as the agent; given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 to user 𝑢 and then gets feedback 𝑓𝑢 𝑖 ; the system aims to incorporate the feedback to improve future recommendations and needs to determine an optimal policy𝜋∗ regarding which item to recommend to the user to achieve positive feedback - thus, the trained system/ model provide improved future recommendations (i.e., performing function of the online service) to the user) comprises: selecting a target skill using the trained recommendation model (page 6, section 2.3, para. 1 - DRL is normally formulated as a Markov Decision Process (MDP); the MDP modelling of the problem treats the user as the environment and the system as the agent; given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 to user 𝑢 and then gets feedback 𝑓𝑢 𝑖 ; the system aims to incorporate the feedback to improve future recommendations and needs to determine an optimal policy𝜋∗ regarding which item to recommend to the user to achieve positive feedback - thus, the selecting items (equivalent to target skills) to recommend in the future using the trained model using user feedback); and displaying the target skill on a computing device of a target user of the online service using a selectable user interface element (page 6, section 2.3, para. 1 - DRL is normally formulated as a Markov Decision Process (MDP); the MDP modelling of the problem treats the user as the environment and the system as the agent; given a set of users U = {𝑢,𝑢1,𝑢2,𝑢3, ...}, a set of items I = {𝑖,𝑖1,𝑖2,𝑖3, ...}, the system first recommends item 𝑖 to user 𝑢 and then gets feedback 𝑓𝑢 𝑖 ; the system aims to incorporate the feedback to improve future recommendations and needs to determine an optimal policy𝜋∗ regarding which item to recommend to the user to achieve positive feedback - thus, the future recommendations (equivalent to target skills) displayed to the user for feedback).  Sengupta further teaches wherein add the target skill to a profile of the target user ([0072] FIG. 8 shows a web interface of a recommended skills output 800 as it may appear on interface 114 to user 120; model selection option 804 may allow worker 120 to view various recommended skills output created by each model; Recommended skill 1 810, recommended skill 2 812, recommended skill 3 814, recommended skill 4 816, and recommended skill 5 818 are each listed on a row; on each row is an input option 808 that allows worker 120 to agree or disagree with the recommendation; the “agree” option may add the recommended skill to the worker's user profile).  

Claims 11-17 are system claims corresponding to the method claims 1-7 above and therefore, rejected for the same reasons.  Sengupta further teaches at least one hardware processor; and a non-transitory machine-readable medium embodying a set of instructions that, when executed by the at least one hardware processor, cause the at least one hardware processor to perform operations ([0010] one or more non-transitory computer readable storage media encoded with instructions that, when executed by a processor of a computing device, causes the processor to perform steps).

Claim 20 is a medium claim corresponding to the method claim 1 above and therefore, rejected for the same reasons.  Sengupta further teaches a non-transitory machine-readable medium embodying a set of instructions that, when executed by at least one hardware processor, cause the at least one hardware processor to perform operations ([0010] one or more non-transitory computer readable storage media encoded with instructions that, when executed by a processor of a computing device, causes the processor to perform steps).


Response to Arguments
Claim Objection: Applicant’s amendments have overcome the claim objections previously set forth.

35 U.S.C. §112:  Applicant’s amendment has overcome the 112(a) rejections previously set forth.

35 U.S.C. §101: Applicant’s amendments and arguments with respect to 101 rejections have been fully considered and are persuasive. The 101 rejections are withdrawn.

35 U.S.C. §103: In the remarks, applicant argues that: (a) The cited references fail to reasonably disclose, teach or suggest "the long-term reward based on a measurement of engagement of the reference user with the online service, wherein the measurement of engagement of the reference user with the online service is based on a number of sessions the reference user has had with the online service within a predetermined period of time" as recited in claim 1. (b) Chen is silent on the concept of a long-term reward defined as a measurement of engagement of a reference user with an online service.  (c) Chen teaches away from the use of user behavior data, such as users that tend to watch, rate and comment on movies, in recommender systems, due to the inherent selection bias of such data.

As to point (a), applicant's arguments have been considered, but are moot in view of new ground of rejection made under 35 U.S.C. § 103.  As to point (b), Examiner respectfully disagrees - Chen does teach the concept of a long-term reward based on a measurement of engagement of a reference user with an online service.  Chen teaches that users in a video recommender system watch, rate, and comment on those movies that they are interested in; agent in DRL learns user feedback and proactively recommend items to users; combine both instant feedback and delayed feedback (i.e., measurement of engagement) into the model to represent the long-term reward and optimize the long-term engagement; time-LSTM is employed to track users’ hierarchical behavior over time (i.e., measurement of engagement) to represent the delayed feedback (see pages 1 & 2, section 1; page 2, section 2.1, para. 1; page 23, section 5.4, para. 2).  Thus, the long-term reward is based on user behavior over time/ measurement of engagement of the user with the recommender system/online service.  As to point (c), Examiner respectfully disagrees that Chen teaches away from the use of user behavior data.  Chen provides a comprehensive review of deep reinforcement learning in recommender systems; an agent in DRL can actively learn from users’ real-time feedback to infer dynamic user preferences, DRL is especially suitable for learning from interactions (see section 1 – Introduction).  Chen does not discredit the use of user behavior data, but rather discloses debiasing step that could correct the biases presented in the logged data (see section 5.4 – Bias).  Per MPEP §2123 II, "[t]he prior art’s mere disclosure of more than one alternative does not constitute a teaching away from any of these alternatives because such disclosure does not criticize, discredit, or otherwise discourage the solution claimed…." In re Fulton, 391 F.3d 1195, 1201, 73 USPQ2d 1141, 1146 (Fed. Cir. 2004).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Applicant is required under 37 CFR § 1.111(c) to consider these references fully when responding to this action.  
Daboll et al. (US 2011/0208585 A1) teaches: The tracking module 214 may be configured to track, per user, a user frequency associated with different user action categories over a predetermine period of time. The user frequency associated with a user action category is the number of times a user performs one or more activities associated with the user action category over a predetermined time. The owner of a web site identifies the business objectives and correlates the business objectives with a visit frequency. The visit frequency is the number of visits of a user or group of users during the predetermined period. The tracking module 214 may track the number of visits by a user to a web page or a document from a plurality of documents on a server. (see [0038], [0045], [0064]). 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to REJI KARTHOLY whose telephone number is (571)272-3432. The examiner can normally be reached on Monday - Thursday from 7:30 am to 3:30 pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Welch, can be reached at telephone number 571-272-7212. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from Patent Center. Status information for published applications may be obtained from Patent Center. Status information for unpublished applications is available through Patent Center for authorized users only. Should you have questions about access to Patent Center, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) Form at https://www.uspto.gov/patents/uspto-automated- interview-request-air-form.

/REJI KARTHOLY/Primary Examiner, Art Unit 2143
Read full office action
Prosecution Timeline

Show 1 earlier event
Jun 02, 2025
Non-Final Rejection mailed — §103
Jul 24, 2025
Examiner Interview Summary
Jul 24, 2025
Applicant Interview (Telephonic)
Aug 21, 2025
Response Filed
Nov 13, 2025
Final Rejection mailed — §103
Jan 08, 2026
Request for Continued Examination
Jan 20, 2026
Response after Non-Final Action
Feb 24, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/956,836
Patent 12632163
CLOUD SYSTEM, AGGREGATED RESULT DISPLAY METHOD, AND INFORMATION STORAGE MEDIUM
3y 7m to grant Granted May 19, 2026
17/305,586
Patent 12585963
METHOD AND DEVICE FOR LEARNING A STRATEGY AND FOR IMPLEMENTING THE STRATEGY
4y 8m to grant Granted Mar 24, 2026
17/683,395
Patent 12585988
SYSTEMS AND METHODS FOR GENERATING AND APPLYING A SECURE STATISTICAL CLASSIFIER
4y 0m to grant Granted Mar 24, 2026
17/331,136
Patent 12572395
Method and Devices for Latency Compensation
4y 9m to grant Granted Mar 10, 2026
17/655,845
Patent 12572846
SYSTEM AND METHOD FOR DEVICE ATTRIBUTE IDENTIFICATION BASED ON HOST CONFIGURATION PROTOCOLS
3y 11m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
65%
Grant Probability
99%
With Interview (+72.1%)
3y 1m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 153 resolved cases by this examiner. Grant probability derived from career allowance rate.