Last updated: April 19, 2026
Application No. 18/395,346
CONTEXTUAL LONG-TERM SURVIVAL OPTIMIZATION FOR CONTENT MANAGEMENT SYSTEM CONTENT SELECTION

Non-Final OA §101§103
Filed
Dec 22, 2023
Examiner
DAGNEW, SABA
Art Unit
3621
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Microsoft Technology Licensing, LLC
OA Round
3 (Non-Final)
Interview Optional

— +18.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 594 resolved cases, 2023–2026
Examiner Intelligence

DAGNEW, SABA View full profile →
Grants only 38% of cases
Career Allow Rate
225 granted / 594 resolved
-14.1% vs TC avg
Strong +18% interview lift
Without
With
+18.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
47 currently pending
Career history
641
Total Applications
across all art units
Statute-Specific Performance

§101
31.0%
-9.0% vs TC avg
§103
40.7%
+0.7% vs TC avg
§102
12.9%
-27.1% vs TC avg
§112
8.7%
-31.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 594 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
This action in response to the amendment filed 2 October  2025.  Claims 1,  4, 9, 10,  12, 16, 19, 21 and 22  have been amended.  Claims 3, 6-7, 11, 14-15, 18  and 23  have been cancelled.  Claims  24-28 have been added.  Claims 1-2, 4-5, 8-10, 12, 13, 16, 17, 19-22 and 24-28 are currently pending have been examined.  

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2 October 2025 has been entered.
 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Step 1: The claims 1-2, 4-5, 8, 9, 21-22 and 26-28 are a method , claims 9-10, 12-13 and 24  are  a system and claims 16-17, 19-20 and 24are a media. Thus, each independent claim, on its face, is directed to one of the statutory categories of 35 U.S.C. §101. However, the claims 1-2, 4-5, 8-10, 12, 13, 16, 17, 19-22 and 24-28 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

Step 2A-Prong 1: The claim recites the limitation of determining a second version of the content variant choice model based at least in part on the first set of content variant choice-reward data; determining a set of expected rewards of   the set of content variant choices based at least in part on the set of contextual features, wherein a particular content variant choice aligns with the intent of the user associated with the request; and choosing the particular content variant choice from among the set of content variant choices based at least in part on the second version of the content variant choice model and the set of expected rewards, wherein choosing the particular content variant choice from amount the set of content variant choices is based at least in part on:  randomly sampling a set of sampled rewards from a set of probability distribution of the set of expected rewards, and selecting as the particular content variant choice, a content variants choice of the set of content variant choices with a greatest sampled reward of the set of sampled rewards, wherein the second version of the content variant choice model comprises parameter of the set of probability distribution.   The claims fails to : 
Mathematical concepts: This includes mathematical formulas, relationships, and calculations. A mathematical algorithm, for instance, is considered an abstract idea on its own.
 
Mental processes: These are concepts that can be performed in the human mind, such as observations, evaluations, or judgments.   
That is, other than reciting a computer components, nothing in the claims preclude the determining step from practically being performed in the human mind.    For example, but for the “computer components” language, the claim encompasses the user comparing to choose a particular content variant choice from among the set of content variant choice based at least in part on the second version of the content variant choice. This limitation is a mental process.
Step 2A-Prong 2: the claims recite additional limitation  of receiving a first set of content variant choice-reward data, wherein the content variant choice-reward data comprises reward data for content variant choices chosen.., and  receiving a request to choose a content variant choice from among a set of content variant choices associated with the second version of the content variant choice model, wherein the request is associated with a set of contextual features and a device.  The receiving steps are recited at a high level of generality (i.e., as a  general means of gathering  a set of content variant choice-reward data for use in the determining steps), and amount to mere data gathering, which is a form of insignificant extra-solution activity.   The device that perform  displaying steps  is also recited at a high level of generality, and merely  automates the outputting steps.  Each of the additional limitation is no more than mere  instruction to apply the exception using a generic computer components.    The combination of these additional elements is no more than mere instructions to apply the exception using a generic computer component.   Accordingly, even in combination, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.  The claim is directed to the abstract idea.


Step 2B:  As discussed with respect to Step 2A Prong Two, the additional elements in the claim amount to no more than mere instructions to apply the exception using a generic computer component. The same analysis applies here in 2B, i.e., mere instructions to apply an exception on a generic computer cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.   For these reasons, there is no inventive concept in the claims, and thus claims are ineligible.
Furthermore, Based on recent patent case law, including the Federal Circuit's Recentive Analytics decision, claims that use generic machine learning techniques on new data without a demonstrated improvement to the underlying technology.
Dependent claims 2, 4-8, 10, 12-15, 17 and 19-23, these claims recite limitation that further define the same abstract idea noted in claims 1, 9 and 16.  Therefore, they are conserved patent ineligible for the same reason above.  



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-2,  8-10,  16, 17, 22 and 24-28 are rejected under 35 U.S.C. 103 as being unpatentable over Jain et al (US Pub., No., 2020/0259777 A1) in view of Mnih et al (US Pub., 2019/0258938 A1)

With respect to claim 1, Jain teaches a method comprising:
receiving a first set of content variant choice-reward data, wherein the content variant choice-reward data comprises reward data for content variant choices chosen, wherein the content variant choices were chosen based on a first version of a content variant choice model, and wherein the first version of the content variant choice model was determined based at least in part on a second set of content variant choice-reward data(Fig. 1, 110 discloses receive reward data for previously  executed action  [content variant choice chosen] paragraphs  [0006], [0010]-[0011], discloses receiving reward data for arm actions taken as a first set of action-reward data, where the arm actions were chosen based on a previous version of an arm choice policy and the previous version of the arm choice policy and the previous version of the arm choice policy was determined at least in part on a previous se of reward data for a previous et of arm action taken); 
determining a second version of the content variant choice model based at least in part on the first set of content variant choice-reward data(Fig. 1, 130 discloses determine new arm choice policy based on reward data,  paragraph [0023], discloses determining new arm choice policies [second version of the content variant choice model]  based on reward data received, and paragraph [0046], discloses each time new policy will be determined); and 
choosing a particular content variant choice from among the set of content variant choices based at least in part on the second version of the content variant choice model and the set of expected rewards(paragraph [0020], discloses choosing arms that may not have the highest expected rewards and exploration (choosing the arms with the highest expected reward will provide balance of increase rewards as well as increase knowledge of which arms  provide the best..); and 
causing a content variant corresponding the particular content variant choice to be displayed at a device in response to the request(paragraph [0050], discloses the prompted chosen (the arm) can be displayed right away to the user).

Jain teaches the above elements including receiving a request to choose a content variant choice from among a set of content variant choices associated with the second version of the content variant choice model, (Fig. 1, 140, discloses receive a request for an arm (choice) paragraph [0023], discloses once a request for an arm choice is received  a determination 150 may be made as to which are to choose for the request and  paragraphs [0046]-[0047], discloses request for arm choice are received, paragraph [0010], discloses receive request for arm action to take a particular arm action to take is determined  based at least impart on the arm choice policy and is the proved in response to request [intent of user assocted with request  ] and paragraph [0047], discloses the request for arm choice may include receiving an indication that an arm choice is needed.., the request for an arm choice may be received along with context that defines important information about the request); and  determining a set of expected rewards of the set of content variant choice based at least in part of the set of contextual features (paragraph [0053], discloses deteriming the expected reward for multiple arms), upgrading to a more feature rich “Pro” account may have a reward value of 1.33 (paragraph [0032]),  and create an initial policy as long as logged feature (context) action and reward are compatible (e.g., meaning and/or statists) (paragraph 0039]),  and probability of distribution from the Thompson sampling may be based on counts or rewards (paragraph [0045]) and Thompson sampling with a liner or logic model, a deep neural network model or any other appropriate technique or algorithm (paragraph [0044]).    


Jain failed to teach  training an expected rewards neural network model to produce a trained expected rewards neural network model based at least in part on a set of features extracted from the first set of content variant choice-reward data, a loss function of the reward data of the first set of content variant choice-reward data and a set of sampled reward generated by one or more content variant choice polices, wherein a set of labels used in the training is based at least in part on the reward data of the first set of content variant choice reward data; wherein the contextual features include an intent of a user associated with the request  and corresponding expected reward of choosing the set of content variant choices based at least in part on the set of contextual features and  the corrosinding neural network is  the trained expected rewards neural network model.
However, Mnih teaches  training an expected rewards neural network model to produce a trained expected rewards neural network model based at least in part on a set of features extracted from the first set of content variant choice-reward data(Fig. 3, 301-306, discloses obtain observation of a sequence of tiem steps and actual reward received following the last observation, generate one or more intermediate outputs of the action selection policy neural network  that characterize the sequence of observations, process the one or more intermediate output using the reward prediction neural network to generate a predicted reward, process the one or more intermediate output using  the reward prediction neural network to generate predicted reward and  ) , a loss function of the reward data of the first set of content variant choice-reward data, and a set of sampled rewards generated by one or more content variant choice policies, wherein a set of labels used in the training is based at least in part on the reward data of the first set of content variant choice-reward data (paragraph [0008], dislcies the gradient may be gradient of policy loss function  .., an auxiliary control neural network may be used to compute a loss function for such backpropagation …, determined the reward pedion loss function, paragraph [0017], dislcies receiving an actual reward received with the next or a subsequent observation image, and training the immediate reward neural network to decrease a loss between the actual reward and the estimated reward …, loss function dependent on a difference between the actual reward and the estimated reward, paragraph [0082], discloses a loss function is given by the mean-squared error between the actual reward received with the observation following the last observation  in the sequence and the prediction for the reward received ) and the trained expected rewards neural network model (paragraph [0006], discloses train a reward prediction [expected]  neural network), wherein the contextual features include an intent of a user associated with the request (paragraph [0007], discloses a plurality of action selection policy network parameter and in used in selecting action to be performed by an agent interaction  with an environment)  and  reward of choosing the set of content variant choices based at least in part on the set of contextual features (paragraph [0039],  discloses identify a particular  action that is predicted to yield the highest long-term time-discount reward if performed by the agent in response to the observation, and paragraph [0051], discloses the set of parameters of the action selection policy neural network 112 to recognize observation that lead to receive a high reward at a subsequent time step  ).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for  the probability distribution from the Thompson sampling  may be based on count or reward how time action  taken were successful/unsuccessful  of Jane modifying by adding the gradient of policy loss function of with a feature  of Mnih in order to compute and  decrease a loss between the actual reward and the estimated reward (see Mnih, paragraph [0017]). 


With respect to claim 2, Jane in view of Mnih teaches elements of claim 1, furthermore, Jain teaches the method further comprising: determining the set of expected rewards of choosing the set of content variant choices based at least in part on using a trained neural network to determine a respective expected reward of choosing each content variant choice of the set of content variant choices (paragraph [0044], discloses the choice among other arms (these chosen epsilon precent of the time) may be done using a random distribution … a deep neural network model) .  
With respect to claim 8, Jane in view of Mnih teaches elements of claim 1, furthermore, Jain teaches the method  wherein the set of contextual features of the request comprises one or more of: an identity of a user associated with the request, a date or time of the request, a specification of a type of device associated with the request, or a specification of a graphical user interface channel associated with the request (paragraph [0047], discloses the request for an arm choice may be received among with context that defines important information about the request , for example, the context received may include the percent of quota used, quota size location of user, time of day and number of apps installed ..) .
With respect to claim 9, Jain teaches a system comprising:
	 at least a processor; memory; and instructions stored  in the memory that when  executed by the at least one processor (Fig. 4, processor 404, memory 406 and commutation interface 418, storage devices 410, and paragraph [0064], discloses general purpose hardware processor programmed to perform the technique pursuant to program instruction …, )  cause the at least one processor to: 
receive a first set of content variant choice-reward data, wherein the content variant choice-reward data comprises reward data for content variant choices chosen, wherein the content variant choices were chosen based on a first version of a content variant choice model, and wherein the first version of the content variant choice model was determined based at least in part on a second set of content variant choice-reward data(Fig. 1, 110 discloses receive reward data for previously  executed action  [content variant choice chosen] paragraphs  [0006], [0010]-[0011], discloses receiving reward data for arm actions taken as a first set of action-reward data, where the arm actions were chosen based on a previous version of an arm choice policy and the previous version of the arm choice policy and the previous version of the arm choice policy was determined at least in part on a previous se of reward data for a previous et of arm action taken); 
determine a second version of the content variant choice model based at least in part on the first set of content variant choice-reward data(Fig. 1, 130 discloses determine new arm choice policy based on reward data,  paragraph [0023], discloses determining new arm choice policies [second version of the content variant choice model]  based on reward data received, and paragraph [0046], discloses each time new policy will be determined); and 
choose a particular content variant choice from among the set of content variant choices based at least in part on the second version of the content variant choice model and the set of expected rewards(paragraph [0020], discloses choosing arms that may not have the highest expected rewards and exploration (choosing the arms with the highest expected reward will provide balance of increase rewards as well as increase knowledge of which arms  provide the best..); and 
cause a content variant corresponding the particular content variant choice to be displayed at a device in response to the request(paragraph [0050], discloses the prompted chosen (the arm) can be displayed right away to the user).

Jain teaches the above elements including receive a request to choose a content variant choice from among a set of content variant choices associated with the second version of the content variant choice model, (Fig. 1, 140, discloses receive a request for an arm (choice) paragraph [0023], discloses once a request for an arm choice is received  a determination 150 may be made as to which are to choose for the request and  paragraphs [0046]-[0047], discloses request for arm choice are received, paragraph [0010], discloses receive request for arm action to take a particular arm action to take is determined  based at least impart on the arm choice policy and is the proved in response to request [intent of user assocted with request  ] and paragraph [0047], discloses the request for arm choice may include receiving an indication that an arm choice is needed.., the request for an arm choice may be received along with context that defines important information about the request); and  determine a set of expected rewards of the set of content variant choice based at least in part of the set of contextual features (paragraph [0053], discloses deteriming the expected reward for multiple arms), upgrading to a more feature rich “Pro” account may have a reward value of 1.33 (paragraph [0032]),  and create an initial policy as long as logged feature (context) action and reward are compatible (e.g., meaning and/or statists) (paragraph 0039]),  and probability of distribution from the Thompson sampling may be based on counts or rewards (paragraph [0045]) and Thompson sampling with a liner or logic model, a deep neural network model or any other appropriate technique or algorithm (paragraph [0044]).    Jain failed to teach  training an expected rewards neural network model to produce a trained expected rewards neural network model based at least in part on a set of features extracted from the first set of content variant choice-reward data, a loss function of the reward data of the first set of content variant choice-reward data and a set of sampled reward generated by one or more content variant choice polices, wherein a set of labels used in the training is based at least in part on the reward data of the first set of content variant choice reward data; wherein the contextual features include an intent of a user associated with the request  and corresponding expected reward of choosing the set of content variant choices based at least in part on the set of contextual features and  the corrosinding neural network is  the trained expected rewards neural network model.
However, Mnih teaches  training an expected rewards neural network model to produce a trained expected rewards neural network model based at least in part on a set of features extracted from the first set of content variant choice-reward data(Fig. 3, 301-306, discloses obtain observation of a sequence of tiem steps and actual reward received following the last observation, generate one or more intermediate outputs of the action selection policy neural network  that characterize the sequence of observations, process the one or more intermediate output using the reward prediction neural network to generate a predicted reward, process the one or more intermediate output using  the reward prediction neural network to generate predicted reward and  ) , a loss function of the reward data of the first set of content variant choice-reward data, and a set of sampled rewards generated by one or more content variant choice policies, wherein a set of labels used in the training is based at least in part on the reward data of the first set of content variant choice-reward data (paragraph [0008], dislcies the gradient may be gradient of policy loss function  .., an auxiliary control neural network may be used to compute a loss function for such backpropagation …, determined the reward pedion loss function, paragraph [0017], dislcies receiving an actual reward received with the next or a subsequent observation image, and training the immediate reward neural network to decrease a loss between the actual reward and the estimated reward …, loss function dependent on a difference between the actual reward and the estimated reward, paragraph [0082], discloses a loss function is given by the mean-squared error between the actual reward received with the observation following the last observation  in the sequence and the prediction for the reward received ) and the trained expected rewards neural network model (paragraph [0006], discloses train a reward prediction [expected]  neural network), wherein the contextual features include an intent of a user associated with the request (paragraph [0007], discloses a plurality of action selection policy network parameter and in used in selecting action to be performed by an agent interaction  with an environment)  and  reward of choosing the set of content variant choices based at least in part on the set of contextual features (paragraph [0039],  discloses identify a particular  action that is predicted to yield the highest long-term time-discount reward if performed by the agent in response to the observation, and paragraph [0051], discloses the set of parameters of the action selection policy neural network 112 to recognize observation that lead to receive a high reward at a subsequent time step  ).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for  the probability distribution from the Thompson sampling  may be based on count or reward how time action  taken were successful/unsuccessful  of Jane modifying by adding the gradient of policy loss function of with a feature  of Mnih in order to compute and  decrease a loss between the actual reward and the estimated reward (see Mnih, paragraph [0017]). 

With respect to claim 10, Jane in view of Mnih teaches elements of claim 9, furthermore, Jain teaches the system further comprising: determining the set of expected rewards of choosing the set of content variant choices based at least in part on using a trained neural network to determine a respective expected reward of choosing each content variant choice of the set of content variant choices (paragraph [0044], discloses the choice among other arms (these chosen epsilon precent of the time) may be done using a random distribution … a deep neural network model) .  
With respect to claim 16, Jain teaches a non-transitory computer readable medium storing instruction which, when executed by at least one programmable electronic  device, cause the at least one programmable electronic  device to perform operation (paragraph [0066], discloses when stored in non-transitory storage media accessible to processor 404, render computer system 400 into a special-purpose machine that is customized to perform the operations specified in the instructions) comprising:

receiving a first set of content variant choice-reward data, wherein the content variant choice-reward data comprises reward data for content variant choices chosen, wherein the content variant choices were chosen based on a first version of a content variant choice model, and wherein the first version of the content variant choice model was determined based at least in part on a second set of content variant choice-reward data(Fig. 1, 110 discloses receive reward data for previously  executed action  [content variant choice chosen] paragraphs  [0006], [0010]-[0011], discloses receiving reward data for arm actions taken as a first set of action-reward data, where the arm actions were chosen based on a previous version of an arm choice policy and the previous version of the arm choice policy and the previous version of the arm choice policy was determined at least in part on a previous se of reward data for a previous et of arm action taken); 
determining a second version of the content variant choice model based at least in part on the first set of content variant choice-reward data(Fig. 1, 130 discloses determine new arm choice policy based on reward data,  paragraph [0023], discloses determining new arm choice policies [second version of the content variant choice model]  based on reward data received, and paragraph [0046], discloses each time new policy will be determined); and 
choosing a particular content variant choice from among the set of content variant choices based at least in part on the second version of the content variant choice model and the set of expected rewards(paragraph [0020], discloses choosing arms that may not have the highest expected rewards and exploration (choosing the arms with the highest expected reward will provide balance of increase rewards as well as increase knowledge of which arms  provide the best..); and 
causing a content variant corresponding the particular content variant choice to be displayed at a device in response to the request(paragraph [0050], discloses the prompted chosen (the arm) can be displayed right away to the user).

Jain teaches the above elements including receiving a request to choose a content variant choice from among a set of content variant choices associated with the second version of the content variant choice model, (Fig. 1, 140, discloses receive a request for an arm (choice) paragraph [0023], discloses once a request for an arm choice is received  a determination 150 may be made as to which are to choose for the request and  paragraphs [0046]-[0047], discloses request for arm choice are received, paragraph [0010], discloses receive request for arm action to take a particular arm action to take is determined  based at least impart on the arm choice policy and is the proved in response to request [intent of user assocted with request  ] and paragraph [0047], discloses the request for arm choice may include receiving an indication that an arm choice is needed.., the request for an arm choice may be received along with context that defines important information about the request); and  determining a set of expected rewards of the set of content variant choice based at least in part of the set of contextual features (paragraph [0053], discloses deteriming the expected reward for multiple arms), upgrading to a more feature rich “Pro” account may have a reward value of 1.33 (paragraph [0032]),  and create an initial policy as long as logged feature (context) action and reward are compatible (e.g., meaning and/or statists) (paragraph 0039]),  and probability of distribution from the Thompson sampling may be based on counts or rewards (paragraph [0045]) and Thompson sampling with a liner or logic model, a deep neural network model or any other appropriate technique or algorithm (paragraph [0044]).    


Jain failed to teach  training an expected rewards neural network model to produce a trained expected rewards neural network model based at least in part on a set of features extracted from the first set of content variant choice-reward data, a loss function of the reward data of the first set of content variant choice-reward data and a set of sampled reward generated by one or more content variant choice polices, wherein a set of labels used in the training is based at least in part on the reward data of the first set of content variant choice reward data; wherein the contextual features include an intent of a user associated with the request  and corresponding expected reward of choosing the set of content variant choices based at least in part on the set of contextual features and  the corrosinding neural network is  the trained expected rewards neural network model.
However, Mnih teaches  training an expected rewards neural network model to produce a trained expected rewards neural network model based at least in part on a set of features extracted from the first set of content variant choice-reward data(Fig. 3, 301-306, discloses obtain observation of a sequence of tiem steps and actual reward received following the last observation, generate one or more intermediate outputs of the action selection policy neural network  that characterize the sequence of observations, process the one or more intermediate output using the reward prediction neural network to generate a predicted reward, process the one or more intermediate output using  the reward prediction neural network to generate predicted reward and  ) , a loss function of the reward data of the first set of content variant choice-reward data, and a set of sampled rewards generated by one or more content variant choice policies, wherein a set of labels used in the training is based at least in part on the reward data of the first set of content variant choice-reward data (paragraph [0008], dislcies the gradient may be gradient of policy loss function  .., an auxiliary control neural network may be used to compute a loss function for such backpropagation …, determined the reward pedion loss function, paragraph [0017], dislcies receiving an actual reward received with the next or a subsequent observation image, and training the immediate reward neural network to decrease a loss between the actual reward and the estimated reward …, loss function dependent on a difference between the actual reward and the estimated reward, paragraph [0082], discloses a loss function is given by the mean-squared error between the actual reward received with the observation following the last observation  in the sequence and the prediction for the reward received ) and the trained expected rewards neural network model (paragraph [0006], discloses train a reward prediction [expected]  neural network), wherein the contextual features include an intent of a user associated with the request (paragraph [0007], discloses a plurality of action selection policy network parameter and in used in selecting action to be performed by an agent interaction  with an environment)  and  reward of choosing the set of content variant choices based at least in part on the set of contextual features (paragraph [0039],  discloses identify a particular  action that is predicted to yield the highest long-term time-discount reward if performed by the agent in response to the observation, and paragraph [0051], discloses the set of parameters of the action selection policy neural network 112 to recognize observation that lead to receive a high reward at a subsequent time step  ).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for  the probability distribution from the Thompson sampling  may be based on count or reward how time action  taken were successful/unsuccessful  of Jane modifying by adding the gradient of policy loss function of with a feature  of Mnih in order to compute and  decrease a loss between the actual reward and the estimated reward (see Mnih, paragraph [0017]). 
With respect to claim 17, Jane in view of Mnih teaches elements of claim 16, furthermore, Jain teaches the non-transitory computer-readable medium further comprising: determining the set of expected rewards of choosing the set of content variant choices based at least in part on using a trained neural network to determine a respective expected reward of choosing each content variant choice of the set of content variant choices (paragraph [0044], discloses the choice among other arms (these chosen epsilon precent of the time) may be done using a random distribution … a deep neural network model) .  

With respect to claim 22, Jane in view of Mnih teaches elements of claim 1,  furthermore, Jane teaches  the method wherein the intent of the user associated with the request comprises at least one of a job-seeking intent or a professional networking intent further comprising:

choosing the particular content variant choice from among the set of content variant choices(paragraph [0045], discloses choosing action with high return on reward) based at least in part on:
randomly sampling a set of sampled rewards from a set of probability distributions of the set of expected rewards(paragraph [0042], discloses probability of distribution from Thompson sampling may be based on count or rewards…, Thompson may be varied or sampled in order to introduce a verity or distribution  in the action  suggested or taken, paragraph [0044], discloses using a random distribution among the arms and paragraph [0051], discloses sampling may be accomplished by introducing small, random variations in the coefficients); and
selecting, as the particular content variant choice, a content variant choice of the set of content variant choices with a greatest sampled reward of the set of sampled rewards(paragraphs [0020], discloses choosing arms that may not have the highest expected rewards and exploration(choosing the arms with the highest expected reward will provide balance of increase reward as well as increasing knowledge of which arms proved best); and  
wherein the second version of the content variant choice model comprises parameters of the set of probability distributions(paragraph [0042], discloses determining a new arm policy may include performing Thompson Sampling on the under consideration (e.g., data from the most batches [second version])…, that probability distribution from the Thompsons sampling based on count or rewards, how many times action taken were successful/unsuccessful or any appropriate measure [parameters]). 
With respect to claim 24, Jane in view of Mnih   teaches elements of claim 9, furthermore, Jain teaches the system   further comprising instructions stored in the memory that when executed by the at least one processor cause the at least one processor to:
choose the particular content variant choice from among the set of content variant choices(paragraph [0045], discloses choosing action with high return on reward)  based at least in part on: 
randomly sample a set of sampled rewards from a set of probability distributions of the set of expected rewards(paragraph [0042], discloses probability of distribution from Thompson sampling may be based on count or rewards…, Thompson may be varied or sampled in order to introduce a verity or distribution  in the action  suggested or taken, paragraph [0044], discloses using a random distribution among the arms and paragraph [0051], discloses sampling may be accomplished by introducing small, random variations in the coefficients); and 
select, as the particular content variant choice, a content variant choice of the set of content variant choices with a greatest sampled reward of the set of sampled rewards(paragraphs [0020], discloses choosing arms that may not have the highest expected rewards and exploration(choosing the arms with the highest expected reward will provide balance of increase reward as well as increasing knowledge of which arms proved best);; 
wherein the second version of the content variant choice model comprises parameters of the set of probability distributions(paragraph [0042], discloses determining a new arm policy may include performing Thompson Sampling on the under consideration (e.g., data from the most batches [second version])…, that probability distribution from the Thompsons sampling based on count or rewards, how many times action taken were successful/unsuccessful or any appropriate measure [parameters]). 
With respect to claim 25, Jane in view of Mnih teaches elements of claim 16,  furthermore, Jane teaches  the non-transitory computer-readable medium operation  further comprising:

choosing the particular content variant choice from among the set of content variant choices(paragraph [0045], discloses choosing action with high return on reward) based at least in part on:
randomly sampling a set of sampled rewards from a set of probability distributions of the set of expected rewards(paragraph [0042], discloses probability of distribution from Thompson sampling may be based on count or rewards…, Thompson may be varied or sampled in order to introduce a verity or distribution  in the action  suggested or taken, paragraph [0044], discloses using a random distribution among the arms and paragraph [0051], discloses sampling may be accomplished by introducing small, random variations in the coefficients); and
selecting, as the particular content variant choice, a content variant choice of the set of content variant choices with a greatest sampled reward of the set of sampled rewards(paragraphs [0020], discloses choosing arms that may not have the highest expected rewards and exploration(choosing the arms with the highest expected reward will provide balance of increase reward as well as increasing knowledge of which arms proved best); and  
wherein the second version of the content variant choice model comprises parameters of the set of probability distributions(paragraph [0042], discloses determining a new arm policy may include performing Thompson Sampling on the under consideration (e.g., data from the most batches [second version])…, that probability distribution from the Thompsons sampling based on count or rewards, how many times action taken were successful/unsuccessful or any appropriate measure [parameters]). 


With respect to claim 26, Jane in view of Mnih teaches elements of claim 1 , except  adjusting parameters of the trained expected rewards neural network model using output of the loss function.  
However, Mnih teaches adjusting parameters of the trained expected rewards neural network model using output of the loss function(paragraph [0017], dislcies receiving an actual reward received with the next or a subsequent observation image, and training the immediate reward neural network to decrease a loss between the actual reward and the estimated reward …, loss function dependent on a difference between the actual reward and the estimated reward, paragraph [0082], discloses a loss function is given by the mean-squared error between the actual reward received with the observation following the last observation  in the sequence and the prediction for the reward received ).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for  the probability distribution from the Thompson sampling  may be based on count or reward how time action  taken were successful/unsuccessful  of Jane modifying by adding the gradient of policy loss function of with a feature  of Mnih in order to compute and  decrease a loss between the actual reward and the estimated reward (see Mnih, paragraph [0017]). 
With respect to claim 27, Jane in view of Mnih teaches elements of claim 1 , except  wherein the training trains the expected rewards neural network model to minimize a difference between sampled rewards and actual rewards.  
However, Mnih teaches wherein the training trains the expected rewards neural network model to minimize a difference between sampled rewards and actual rewards.   (paragraph [0017], dislcies receiving an actual reward received with the next or a subsequent observation image, and training the immediate reward neural network to decrease a loss between the actual reward and the estimated reward …, loss function dependent on a difference between the actual reward and the estimated reward, paragraph [0082], discloses a loss function is given by the mean-squared error between the actual reward received with the observation following the last observation  in the sequence and the prediction for the reward received ).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for  the probability distribution from the Thompson sampling  may be based on count or reward how time action  taken were successful/unsuccessful  of Jane modifying by adding the gradient of policy loss function of with a feature  of Mnih in order to compute and  decrease a loss between the actual reward and the estimated reward (see Mnih, paragraph [0017]). 
With respect to claim 28, Jane in view of Mnih teaches elements of claim 1, furthermore, Jane teaches the method  wherein the training trains the expected rewards neural model to adapt to variances introduced by the content variant choice model(paragraph [0037], discloses the information used to train the arm policy is the combination of context, the chosen action and rewards).



Claim(s) 4, 5  12, 13, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Jain et al (US Pub., No., 2020/0259777 A1) in view of  Mnih et al (US Pub., 2019/0258938 A1) and futher view of Lee et al (US Pub., No., 2017/0098236 A1)

With respect to claim 4, Jane in view of Mnih teaches elements of claim 3, furthermore, Jain teaches the method wherein: the set of probability distributions are a set of beta distributions (paragraph [0042], discloses the Thompson sampling may be done with beta distribution..) and Mnih teaches the policy output may be a probability distribution  over the set of possible action (paragraph [0039]) . Jane  and Mnih failed to teach  the parameters of the set of distributions comprise a respective alpha parameter and a respective beta parameter for each content variant choice of the set of content variant choices.  
However, Lee teaches the parameters of the set of distributions comprise a respective alpha parameter and a respective beta parameter for each content variant choice of the set of content variant choices (paragraph [0079], discloses the beta distribution  parameter generation  module may be configured to generate an alpha parameter .., and a beta parameter pair for each arm .. alpha-beta parameter pair ).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for  Thompson sampling may be done with beta distribution of Jane and a probability of distribution of Mnih with a feature of beta distribution model generation module to generate an alpha parameter and beta parameter pair  of Lee in order to generate for each arm (see Lee, paragraph [0079]).   


With respect to claim 5, Jane in view of Lee teaches elements of claim 1, furthermore, Jain teaches the method  wherein: the first content variant choice model comprises a first set of probability distribution parameters for the set of content variant choices(paragraph [0042], discloses the probability distribution from the Thompson sampling may be based on count or rewards…,) and Mnih teaches the policy output may be a probability distribution  over the set of possible action (paragraph [0039]) ..  Jain and Mnih  failed to teach the second content variant choice model comprises a second set of probability distribution parameters for the set of content variant choices; and  the method further comprises determining the second set of probability distribution parameters for the set of content variant choices based at least in part on the first set of probability distribution parameters for the set of content variant choices and the first set of content variant choice-reward data.   
However, Lee teaches the second content variant choice model comprises a second set of probability distribution parameters for the set of content variant choices; and  the method further comprises determining the second set of probability distribution parameters for the set of content variant choices based at least in part on the first set of probability distribution parameters for the set of content variant choices and the first set of content variant choice-reward data(paragraph [0050], discloses select advertisements in the advertisement pool that maximize the probability of a response event .., the probabilities of the response event in the time period is maximined.., and paragraph [0051], discloses to select the advertismetn that yield a maximum expected probability of response events for the M-number of ad request..) ).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for  probability  distribution of Jane and a probability of distribution of Mnih with a feature of probability  distribution that maximize response rate of Lee in order to indicate  the highest expected response rate  (see Lee, paragraph [0080]).   

With respect to claim 12, Jane in view of Mnih teaches elements of claim 9 furthermore, Jain teaches the system wherein: the set of probability distributions are a set of beta distributions (paragraph [0042], discloses the Thompson sampling may be done with beta distribution..) and Mnih teaches the policy output may be a probability distribution  over the set of possible action (paragraph [0039]).  Jain and Mnih failed to teach   the parameters of the set of distributions comprise a respective alpha parameter and a respective beta parameter for each content variant choice of the set of content variant choices.  
However, Lee teaches the parameters of the set of distributions comprise a respective alpha parameter and a respective beta parameter for each content variant choice of the set of content variant choices (paragraph [0079], discloses the beta distribution  parameter generation  module may be configured to generate an alpha parameter .., and a beta parameter pair for each arm .. alpha-beta parameter pair ).  Therefore, it would have been obvious to the one ordinary skill in the art before the effective filing date of the claimed invention for  Thompson sampling may be done with beta distribution of Jane and a probability of distribution of Mnih with a feature of beta distribution model generation module to generate an alpha parameter and beta parameter pair  of Lee in order to generate for each arm (see Lee, paragraph [0079]).   

With respect to claim 13, Jane in view of Mnih teaches elements of claim 9, furthermore, Jain teaches the system  wherein: the first content variant choice model comprises a first set of probability distribution parameters for the set of content variant choices(paragraph [0042], discloses the probability distribution from the Thompson sampling may be based on count or rewards…,) and Mnih teaches the policy output may be a probability distribution  over the set of possible action (paragraph [0039]) ..  Jain and Mnih  failed to teach the second content variant choice model comprises a second set of probability distribution parameters for the set of content variant choices; and  the method further comprises determining the second set of probability distribution parameters for the s
Read full office action
Prosecution Timeline

Dec 22, 2023
Application Filed
Dec 28, 2024
Non-Final Rejection — §101, §103
Mar 10, 2025
Interview Requested
Mar 25, 2025
Applicant Interview (Telephonic)
Mar 25, 2025
Examiner Interview Summary
May 01, 2025
Response Filed
Jun 06, 2025
Final Rejection — §101, §103
Aug 06, 2025
Interview Requested
Aug 27, 2025
Interview Requested
Sep 02, 2025
Examiner Interview Summary
Sep 02, 2025
Applicant Interview (Telephonic)
Oct 02, 2025
Request for Continued Examination
Oct 13, 2025
Response after Non-Final Action
Oct 30, 2025
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/789,473
Patent 12572959
SYSTEMS AND METHODS FOR OPTIMAL AUTOMATIC ADVERTISING TRANSACTIONS ON NETWORKED DEVICES
2y 5m to grant Granted Mar 10, 2026
18/631,250
Patent 12505426
AUTOMATED MULTI-PARTY TRANSACTION DECISIONING SYSTEM
2y 5m to grant Granted Dec 23, 2025
18/385,779
Patent 12488149
SYSTEM AND METHOD FOR OPTIMIZING ONLINE PRIVACY RECOMMENDATIONS FOR ENTITY USERS
2y 5m to grant Granted Dec 02, 2025
17/839,760
Patent 12450633
RETAIL DIGITAL SIGNAGE AND AUTOMATIC PROMOTION SYSTEM
2y 5m to grant Granted Oct 21, 2025
18/773,125
Patent 12443972
USE OF LOCALIZED BROADCAST SIGNALS TO MODIFY MOBILE APPLICATION BEHAVIOR
2y 5m to grant Granted Oct 14, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
38%
Grant Probability
56%
With Interview (+18.1%)
3y 11m
Median Time to Grant
High
PTA Risk
Based on 594 resolved cases by this examiner. Grant probability derived from career allow rate.
CONTEXTUAL LONG-TERM SURVIVAL OPTIMIZATION FOR CONTENT MANAGEMENT SYSTEM CONTENT SELECTION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email