Office Action Analysis: 16568292 — SYSTEM AND METHOD FOR A RECOMMENDER USING DEEP REINFORCEMENT LEARNING AND Q-LEARNING BASED ON USER ATIRIBUTES

Examiner Intelligence

LEY, SALLY THI View full profile →
Grants only 19% of cases
Career Allowance Rate
7 granted / 36 resolved
-35.6% vs TC avg
Strong +33% interview lift
Without
With
+33.3%
Interview Lift
resolved cases with interview
Typical timeline
4y 8m
Avg Prosecution
17 currently pending
Career history
69
Total Applications
across all art units
Statute-Specific Performance

§101
10.3%
-29.7% vs TC avg
§103
83.2%
+43.2% vs TC avg
§102
3.8%
-36.2% vs TC avg
§112
2.7%
-37.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 36 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
	This Office Action is in response to the communication filed on 02 Sep 2025.
	Claims 1-3, 5-7, and 9-22 are being considered on the merits.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-3, 5-7 and 9-20, are rejected under 35 U.S.C. 103 as being unpatentable over Zhao, et. al. (“Deep Reinforcement Learning for List-wise Recommendations”, 27 Jun. 2019, arXiv:1801.00209v3; hereinafter “Zhao”) in view of Wilson, et. al. (US 2015/0220835 A1; hereinafter “Wilson”)
Regarding Claim 1, Zhao teaches a method for: 
modifying at least one model value of the prediction model to maximize a reward score, (Zhao sec. 2.1, 2.5 and algorithm 3: “We study the recommendation task in which a recommender agent (RA) interacts with environment E (or users) by sequentially choosing recommendation items over a sequence of time steps, so as to maximize its cumulative reward.” “In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28).” Examiner notes that Zhao refers to algorithm 3 where the code calls for modification of the parameters (i.e. model values) to arrive at a maximum cumulative reward). 
the reward score computed using a plurality of expected content-based scores computed by applying a content-based filtering method to a plurality of training user attribute values and a plurality of training item properties of a plurality of training items; and (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that Zhao teaches arrival at a maximum reward score using a user profile i.e. dynamic preferences of a changing profile and content-based filtering i.e. prediction based on a user profile as well as items.
a plurality of training predictive collaborative-based scores computed by the prediction model for each one of the plurality of training items; and (Zhao, sec. 1 and 1.2: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” “The Actor inputs the current state and aims to output the parameters of a state-specific scoring function. Then the RA scores all items and selects an item with the highest score.”)
outputting the at least one composite score. (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that Zhao teaches maximizing an overall cumulative reward wherein the reward is a composite score)
Zhang does not explicitly disclose: 
A computer-implemented method comprising: 
receiving a user profile having a plurality of user attribute values; 
However, Wilson teaches: 
A computer-implemented method comprising: (Wilson, para. 0366: “The apparatus is optionally implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine readable storage device or in a propagated signal, for execution by a programmable processor; and method steps are performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output.”)
receiving a user profile having a plurality of user attribute values; (Wilson, pg. 8, para. 0106: “The system 100 may access a user profile to collect data from the user profile such as other venues liked, gender, profession, or age.”).
computing at least one composite score according to a similarity between the user profile and a plurality of other user profiles by inputting the user profile and a plurality of items into a prediction model trained by (Wilson, pg. 6, para. 0087: “Nodes in the data network represent venues, venue properties, users, user properties, reviewers, reviewer properties, and the like. Links or links represent relations between those nodes. The number of links between two items might therefore grow as data on two items grows. The strength of each link denotes the affinity between the two connected items, such as similarity of star rating (in a review of a venue), number of attributes held in common. Links can be either positive or negative in sign.” Examiner notes that the broadest reasonable interpretation of “computing [a] score” means to calculate and keep a running account such as calculating the strength of a link between two users).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Wilson into Zhao. Zhao teaches a recommender system using deep reinforcement learning, integrating concepts of q-learning; Wilson teaches a recommender system as network of interrelationships between users and venues (items). One of ordinary skill would have been motivated to combine the teachings of Wilson into Zhao in order to execute more holistic searches and to generate more timely and accurate recommendations (Wilson, para. 008).

Regarding Claim 2, Zhao and Wilson teaches the method of claim 1 (above). Wilson further teaches: 
The method of claim 1, further comprising, in each of a plurality of training iterations: (Zhao, sec. 2.5: “The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28)”)
receiving a training user profile of a plurality of training user profiles (Zhao, sec. 1.3: “More specifically, we build the simulator by users’ historical records The intuition is no matter what algorithms a recommender system adopt, given the same state (or a user’s historical records) and the same action (recommending the same items to the user)”), the training user profile having training user attribute values of the plurality of training user attribute values; (Wilson, para. 0106: “The system 100 may access a user profile to collect data from the user profile such as other venues liked, gender, profession, or age”)
computing by the prediction model a plurality of predicted scores, each for one of the plurality of training items, in response to inputting to the prediction model the training user profile and the plurality of training items, (Zhao, sec. 2.3: “The positive items represent key information about users’ preferences, i.e., which items the users prefer to. Thus, we only consider them for state-specific scoring function… Note that it is straightforward to extend it with non-linear relations.” Examiner notes that predicted scores are scores for training items calculated by a prediction model such as taught by Zhao).
where each of the plurality of training items is described by training item properties of the plurality of training item properties, said plurality of training item properties are selected according to relevancy of a property type of said plurality of training item properties with respect to a respective domain of the respective training item, (Zhao, sec. 2.3, 2.5 and sec. 4: “In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” “The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28). For transition generating stage (line 8): given the current state                                 
                                    
                                            s
                                        
                                            t
                                        
                            ,the RA first recommends a list of items                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                             according to Algorithm 2 (line 9); then the agent observes the reward                                 
                                    
                                            r
                                        
                                            t
                                        
                             from simulator (line 10) and updates the state to                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                             (lines 11-17) following the same strategy in Algorithm 1;” “Another common approach is content based filtering[15], which tries to recommend items with similar properties to those that a user ordered in the past”. Examiner notes that the broadest definition of “domain” means a particular field of thought, activity, or interest such as the particular field of interest being the browsing history of a user as in Zhao; Examiner additionally notes that “training item properties” means a quality, attribute, or feature of a training item such as proposed in section 2.5 of Zhao where the training includes the inclusion and removal of an item to be recommended is based on the item and browser history. Zhao also teaches the use of content-based filtering where items with similar properties are selected i.e. where the broadest definition of “relevant” means closely connected such that content based filtering will leave only those items that are closely connected to that selected)
wherein, for at least one of the training iterations, the method is configured to process the user attribute values using a first set of values corresponding to a first domain and wherein, for at least one of the training iterations, the method is configured to process the user attribute values using a second set of values corresponding to a second domain; (Zhao, sec. 2.1 and 2.5: “State space S: A state                                 
                                    
                                            s
                                        
                                            t
                                        
                                    =
                                    
                                                    s
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            ,
                                            
                                                    s
                                                
                                                    t
                                                
                                                    N
                                                
                                    ∈
                                    S
                                
                             is defined as the browsing history of a user, i.e., previous                                 
                                    N
                                
                             items that a user browsed before time                                 
                                    t
                                
                            . The items in                                 
                                    
                                            s
                                        
                                            t
                                        
                             are sorted in chronological order we present a state-specific scoring function” “The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28). For transition generating stage (line 8): given the current state                                 
                                    
                                            s
                                        
                                            t
                                        
                            ,the RA first recommends a list of items                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                             according to Algorithm 2 (line 9); then the agent observes the reward                                 
                                    
                                            r
                                        
                                            t
                                        
                             from simulator (line 10) and updates the state to                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                             (lines 11-17) following the same strategy in Algorithm 1;” Examiner notes that Zhao teaches discrete states,                                 
                                    
                                            s
                                        
                                            t
                                        
                                            1
                                        
                            , which are defined as the “previous                                 
                                    N
                                
                             items that a user browsed before time                                 
                                    t
                                
                            .” where each                                 
                                    
                                            s
                                        
                                            t
                                        
                                            1
                                        
                             is a domain and Zhao teaches at least a first and second domain where the items browsed are the user attributes of the time domain at east time t.).
reducing training time of a corresponding system and increasing a prediction accuracy of the prediction model by: (Zhao, sec. 3.2 and 5: “This result suggests that introducing reinforcement learning can improve the performance of recommendations…the training speed of LIRD is much faster than DQN” “we propose a list- wise recommendation framework, which can be applied in scenarios with large and dynamic item space and can reduce redundant computation significantly.” Examiner notes that Zhao teaches LIRD which results in better performance i.e. prediction accuracy than other known methods, as well as a faster training speed).
computing a plurality of expected scores for the training user profile, by applying a content based filtering method to the plurality of training user attribute values and the plurality of training item properties of the plurality of training items, (Zhao, sec. 2.3 and sec. 4: “In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” “Another common approach is content based filtering[15], which tries to recommend items with similar properties to those that a user ordered in the past. Knowledge-based systems[1] recommend items based on specific domain knowledge about how certain item features meet users’ needs and preferences and how the item is useful for the user”. Examiner notes that Zhao teaches content-based filtering using item features i.e. properties as well as users’ needs i.e. user attributes) 
each of the plurality of expected scores is computed for one of the plurality of training items according to the plurality of training user attribute values and the plurality of training item properties of the training item; and (Zhao, sec. 1.3 and 2.3: “Thus we cannot get the feedbacks (rewards) of items that are not in users’ historical records. This may result in inconsistent results between offline and online measurements. Our proposed online environment simulator can also mitigate this challenge by producing simulated online rewards given any state-action pair, so that the recommender system can rate items from the whole item space.” ““In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” Examiner notes that the broadest reasonable interpretation of “scores” means any accounting including quantifying and rating items; examiner additionally notes that “training item properties” means a quality, attribute, or feature of a training item such as whether the items were clicked or ordered).
collecting at least one feedback value from at least one training user associated with at least one of the plurality of training user profiles, (Zhao sec. 2.1 and 3.2: “After the recommender agent takes an action at at the state st , i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward r (st , at ) according to the user’s feedback.”) 
where the at least one feedback value is indicative of a level of agreement of the at least one training user with at least some of the plurality of predicted scores computed by the prediction model in response to the respective training user profile and the plurality of training items and updating at least one training user attribute value of the respective at least one training user profile according to the at least one feedback value (Zhao, sec. 2.1 and 2.3: “A state                                 
                                    
                                            s
                                        
                                            t
                                        
                                    =
                                    
                                                    s
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            .
                                            .
                                            .
                                            ,
                                            
                                                    s
                                                
                                                    t
                                                
                                                    N
                                                
                                    ∈
                                     
                                    S
                                
                             is defined as the browsing history of a user, i.e., previous N items that a user browsed before time …An action state                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            .
                                            .
                                            .
                                            ,
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                                    ∈
                                     
                                    A
                                
                             is to recommend a list of items to a user at time t based on current state st , where K is the number of items the RA recommends to user each time…After the recommender agent takes an action at at the state st, i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward r (st, at) according to the user’s feedback…If user skips all the recommended items, then the next state st+1 = st; while if the user clicks/orders part of items, then the next state st+1 updates” “A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to. Thus, we only consider them for state-specific scoring function…”) 
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1. 

Regarding Claim 3, Zhao and Wilson teaches the method of claim 2 (above). Zhao further teaches: 
the prediction model comprises at least one deep reinforcement learning (DRL) network (Zhao, sec. 1: “Thus, we leverage Deep Reinforcement Learning[10] with (adapted) artificial neural networks as the non-linear approximators to estimate the action-value function in RL”)

Regarding Claim 5, Zhao and Wilson teaches the method of claim 2 (above). Zhao further teaches: 
wherein applying the content-based filtering method comprises providing the plurality of training user attribute values and the plurality of training item properties to at least one neural network. (Zhao, sec. 1 and sec. 4: “Thus, we leverage Deep Reinforcement Learning[10] with (adapted) artificial neural networks as the non-linear approximators to estimate the action-value function in RL” “Another common approach is content based filtering[15], which tries to recommend items with similar properties to those that a user ordered in the past”)

Regarding Claim 6, Zhao and Wilson teaches the method of claim 2 (above). Zhao further teaches: 
training the prediction model comprises using a Q-learning method having a state, a plurality of actions, a reward and an output (Zhao, sec. 1.1 and 1.2: “DQN[14] can calculate Q-values of all recalled items separately and recommend a list of items with highest Q-values. However, these approaches recommend items based on one same state, and ignore relationship among the recommended items. As a consequence, the recommended items are similar. In practice, a bundling with complementary items may receive higher rewards than recommending all similar items.” “Traditional deep Q-learning adopts the first architecture as shown in Fig.1(a), which inputs only the state space and outputs Q-values of all actions.” Examiner notes that the broadest reasonable interpretation of an “output” includes the result of any computed input such as a calculated q-value)
wherein the state is a vector of state values indicative of a plurality of training user attribute values of the training user profile  (Zhao sec. 2.2: “                                
                                    
                                            N
                                        
                                            x
                                        
                             is the size of users’ historical browsing history group that                                 
                                    r
                                    =
                                    
                                            U
                                        
                                            x
                                        
                                    ∙
                                
                                                    s
                                                
                                                    -
                                                
                                            x
                                        
                            and                                 
                                    
                                                    a
                                                
                                                    -
                                                
                                            x
                                        
                             are the average state vector and average action vector for                                  
                                    r
                                    =
                                    U
                                
                            ”. Examiner notes that that the broadest reasonable interpretation of “indicative” means any indication at all of training user attributes, including users’ historical browsing)
wherein the plurality of actions is a plurality of vectors of item values, each vector of item values indicative of a respective plurality of training item properties of one of the plurality of training items (Zhao sec. 2.2: “                                
                                    
                                            N
                                        
                                            x
                                        
                             is the size of users’ historical browsing history group that                                 
                                    r
                                    =
                                    
                                            U
                                        
                                            x
                                        
                                    ∙
                                
                                                    s
                                                
                                                    -
                                                
                                            x
                                        
                            and                                 
                                    
                                                    a
                                                
                                                    -
                                                
                                            x
                                        
                             are the average state vector and average action vector for                                  
                                    r
                                    =
                                    U
                                
                            ”. Examiner notes that that the broadest reasonable interpretation of “indicative” means any indication at all of item values, including an action vector based on item values)
the reward is the plurality of expected content-based scores (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that for examination purposes only “the reward” refers to the reward initially set forth in claim 1 as the maximum reward for the system) 
wherein the output is the plurality of training predictive collaborative-based scores (Zhao, sec. 2.2: “According to collaborative filtering techniques, users with similar interests will make similar decisions on the same item. With this intuition,we match the current state and action to existing historical state-action pairs, and stochastically generate a simulated reward.” Examiner notes that the broadest reasonable interpretation of “predictive collaborative-based scores” means an output of a prediction algorithm iteration using collaborative-filtering techniques such as taught by Zhao). 

Regarding Claim 7, Zhao and Wilson teaches the method of claim 2 (above). Zhao further teaches: 
wherein training the prediction model comprises using a Q-learning method having another state, another plurality of actions, another reward and another output (Zhao, sec. 2.2: “then we can map the current state-action pair                                 
                                    
                                            p
                                        
                                            t
                                        
                             to a reward according the above probability…we assume that                                 
                                    
                                            r
                                        
                                            i
                                        
                             is a reward list containing user’s feedbacks of the recommended items”. Examiner notes that Zhao’s “current-action pair” implies non-current action pairs i.e. other states and actions which are impliedly mapped to other rewards and probability outputs). 
Wherein the other state is a vector of state values indicative of another plurality of training user attribute values of the training user profile and another plurality of training item properties of the plurality of training items (Zhao, sec. 2.2: “                                
                                    
                                            N
                                        
                                            x
                                        
                             is the size of users’ historical browsing history group that                                 
                                    r
                                    =
                                    
                                            U
                                        
                                            x
                                        
                                    ∙
                                
                                                    s
                                                
                                                    -
                                                
                                            x
                                        
                            and                                 
                                    
                                                    a
                                                
                                                    -
                                                
                                            x
                                        
                             are the average state vector and average action vector for                                  
                                    r
                                    =
                                    U
                                
                            ”. Examiner notes that the broadest reasonable interpretation of “indicative” means any indication at all, including a group of users’ historical browsing where an average state and action vectors gives some indication of associated attributes and properties within the users’ historical browsing history group).
wherein the another plurality of actions is another plurality of vectors of item values, each vector of item values indicative of a respective plurality of training item properties of one of the plurality of training items (Zhao, sec. 2.2: “                                
                                    
                                            N
                                        
                                            x
                                        
                             is the size of users’ historical browsing history group that                                 
                                    r
                                    =
                                    
                                            U
                                        
                                            x
                                        
                                    ∙
                                
                                                    s
                                                
                                                    -
                                                
                                            x
                                        
                            and                                 
                                    
                                                    a
                                                
                                                    -
                                                
                                            x
                                        
                             are the average state vector and average action vector for                                  
                                    r
                                    =
                                    U
                                
                            ”. Examiner notes that the broadest reasonable interpretation of “indicative” means any indication at all, including a group of users’ historical browsing where an average state and action vectors gives some indication of associated attributes and properties within the users’ historical browsing history group).
the reward is one of the plurality of expected content-based scores; and (Zhao sec. 2.2: “Thus if the                                 
                                    
                                            p
                                        
                                            t
                                        
                              is mapped to                                 
                                    
                                            U
                                        
                                            x
                                        
                             we calculate the overall reward                                 
                                    
                                            r
                                        
                                            t
                                        
                            of the whole recommended list.” Examiner notes that the broadest reasonable interpretation of “scores” means an accounting such as here where the reward is calculated using the plurality of accountings of each item on the list).
wherein the output is a predicted score computed for one of the plurality of training user profiles and one of the plurality of training items in at least one of the plurality of training iterations. (Zhao sec. 2.2 and 2.5 and Algorithm 3: “Thus if the                                 
                                    
                                            p
                                        
                                            t
                                        
                              is mapped to                                 
                                    
                                            U
                                        
                                            x
                                        
                             we calculate the overall reward                                 
                                    
                                            r
                                        
                                            t
                                        
                            of the whole recommended list…The intuition of Eq.(4) is that reward in the top of recommended list has a higher contribution to the overall rewards, which force RA arranging items that user may order in the top of the recommended list.” ““The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28)” Examiner notes that algorithm 3 illustrates a predicted score for each item given a user state in order to arrange the items vis a vis the user).

Regarding Claim 9, Zhao and Wilson teaches the method of claim 2 (above). Wilson further teaches: 
at least one of the plurality of items is selected from a group of items consisting of: a restaurant identifier, a hospitality facility identifier, a movie identifier, a book identifier, a consumer appliance identifier, a retailer identifier, and a venue identifier (Wilson, pg. 39, para. 0334: “Turning next to matching objects to content pages, whenever the system is gathering data from target websites on an object of interest, the system should ensure that the data on the target site is actually referring to the object of interest. This is especially true when attempting to cross-reference objects across different sites. The system optionally utilizes a “likelihood of match” score to make this determination, taking into account multiple variables. For example, if the system is trying to match a venue on two different sites, the fact that they have the same phone number or address may tend to indicate that they are the same venue. Numeric identifiers on consistent scales are particularly valuable for this purpose, such as phone numbers, UPC symbols, and latitude/longitude.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1. 

Regarding Claim 10 Zhao and Wilson teaches the method of claim 2 (above). Zhao further teaches: 
The method of claim 2, wherein outputting the at least one composite score comprises outputting for each of the at least one composite score a respective item of the at least one item (Zhao, sec. 2.3: “Then after computing scores of all items, the RA selects an item with highest score as the sub-action                                 
                                    
                                            a
                                        
                                            k
                                        
                                            t
                                        
                             of action at                                 
                                    
                                            a
                                        
                                            t
                                        
                            .”) 

Regarding Claim 11, Zhao and Wilson teaches the method of claim 2 (above). Zhao further teaches: 
wherein inputting the user profile and the plurality of items into the prediction model comprises computing at least one set of state values indicative of the plurality of user attribute values and a plurality of item properties of the plurality of items (Zhao, sec. 2.1: “State space S: A state                                 
                                    
                                            s
                                        
                                            t
                                        
                                    =
                                    
                                                    s
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            ,
                                            
                                                    s
                                                
                                                    t
                                                
                                                    N
                                                
                                    ∈
                                    S
                                
                             is defined as the browsing history of a user, i.e., previous                                 
                                    N
                                
                             items that a user browsed before time                                 
                                    t
                                
                            . The items in                                 
                                    
                                            s
                                        
                                            t
                                        
                             are sorted in chronological order we present a state-specific scoring function”). 

Regarding Claim 12, Zhao and Wilson teaches the method of claim 2 (above). Wilson further teaches: 
computing the at least one composite score further comprises: computing at least one other score, each computed for one of the plurality of items according to the plurality of user attribute values and the respective plurality of item properties of the respective item (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that Zhao teaches computing an intermediate score (i.e. at least one other score) according to the interplay between the user and the item). 
aggregating the at least one composite score with the at least one other score (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that Zhao teaches aggregating the score by taking such score into consideration to maximize the long term reward).

Regarding Claim 13 Zhao and Wilson teaches the method of claim 12 (above). Zhao further teaches: 
wherein computing the at least one other score comprises applying the content-based filtering method to the plurality of user attribute values and the plurality of item properties of the plurality of items (Zhao, sec. 4: “Another common approach is content based filtering[15], which tries to recommend items with similar properties to those that a user ordered in the past.”) 

Regarding Claim 14, Zhao and Wilson teaches the method of claim 2 (above). Zhao further teaches: 
computing the at least one score further comprises: identifying at least one highest score of the at least one composite score (Zhao, sec. 1 and 1.2: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” “The Actor inputs the current state and aims to output the parameters of a state-specific scoring function. Then the RA scores all items and selects an item with the highest score.”)
outputting the at least one highest score (Zhao, sec. 2.3: “For each weight vector, the RA scores all items in the item space (line 3), selects the item with highest score (line 4), and then adds this item at the end of the recommendation list.” Examiner notes that the broadest reasonable interpretation of “outputting” means to produce, deliver, or supply such as here where the item with the highest score is produced for selection).

Regarding Claim 15, Zhao and Wilson teaches the method of claim 2 (above). Wilson further teaches: 
computing the at least one composite score further comprises: computing at least one filtered score by applying at least one test to the at least one composite score (Wilson, pg. 20, para. 0204: “if there is a large number of recommendations available based on the filter set and the only issue is that none of the recommendations of the set exceed the recommendation threshold, a minimal normalization factor may be utilized to normalize the recommendation set such that a limited amount of recommendations exceed the recommendation threshold”. Examiner notes that a filter set comprises of at least one filtered score and that the presence of a recommendation threshold implies that a score or other running accounting is computed to ensure such threshold is met)
outputting the at least one filtered score. (Wilson, pg. 20, para. 0204: “if there is a large number of recommendations available based on the filter set and the only issue is that none of the recommendations of the set exceed the recommendation threshold, a minimal normalization factor may be utilized to normalize the recommendation set such that a limited amount of recommendations exceed the recommendation threshold”. Examiner notes that the broadest reasonable interpretation of outputting means to produce, deliver, or supply such as supplying a recommendation score to the system to compare against a recommendation threshold). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1. 

Regarding Claim 16, Zhao and Wilson teaches the method of claim 2 (above). Wilson further teaches: 
computing the at least one composite score further comprises: computing at least one collaborative filtering score, each computed for one of the plurality of items according to another similarity between the plurality of user attribute values and the other plurality of user attribute values of the plurality of other user profiles, by applying at least one matrix factorization method to the plurality of item properties, the plurality of user attribute values and the other plurality of user attribute values (Wilson, pg. 3, paras. 0060, 0065, 0066: “The system may continuously analyze the data to add positive or negative collaborative links, content links, or content-collaborative links” “The recommendation engine 112 accesses the matrices of interrelationships and generates the recommendations according to the techniques described herein.” “The matrix builder also incorporates venue, reviewer and user data 124 collected from users 108, venues 104 and other web pages”). 
aggregating the at least one composite score with the at least one collaborative filtering score. (Wilson, pg. 3, para. 0061: “The system may provide a plurality of recommendations based overall link strengths that factor in collaborative and content-based interrelationships” Examiner notes that the broadest reasonable interpretation of aggregating means bringing together such as here in Wilson where the overall link strengths factor in collaborative filtering). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1.

Regarding Claim 17, Zhao teaches a system: 
modifying at least one model value of the prediction model to maximize a reward score, (Zhao sec. 2.1, 2.5 and algorithm 3: “We study the recommendation task in which a recommender agent (RA) interacts with environment E (or users) by sequentially choosing recommendation items over a sequence of time steps, so as to maximize its cumulative reward.” “In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28).” Examiner notes that Zhao refers to algorithm 3 where the code calls for modification of the parameters (i.e. model values) to arrive at a maximum cumulative reward). 
the reward score computed using a plurality of expected content-based scores computed by applying a content-based filtering method to a plurality of training user attribute values and a plurality of training item properties of a plurality of training items; and (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that Zhao teaches arrival at a maximum reward score using a user profile i.e. dynamic preferences of a changing profile and content-based filtering i.e. prediction based on a user profile as well as items.)
a plurality of training predictive collaborative-based scores computed by the prediction model for each one of the plurality of training items; and (Zhao, sec. 1 and 1.2: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” “The Actor inputs the current state and aims to output the parameters of a state-specific scoring function. Then the RA scores all items and selects an item with the highest score.”)
outputting the at least one composite score (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that Zhao teaches maximizing an overall cumulative reward wherein the reward is a composite score)
Zhao does not explicitly disclose: 
comprising at least one hardware processor adapted to perform operations comprising 
receiving a user profile having a plurality of user attribute values; 
computing at least one composite score according to a similarity between the user profile and a plurality of other user profiles by inputting the user profile and a plurality of items into a prediction model trained by 
However, Wilson teaches: 
comprising at least one hardware processor adapted to perform operations comprising (Wilson, para. 0366: “The apparatus is optionally implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine readable storage device or in a propagated signal, for execution by a programmable processor; and method steps are performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output.”)
receiving a user profile having a plurality of user attribute values; (Wilson, pg. 8, para. 0106: “The system 100 may access a user profile to collect data from the user profile such as other venues liked, gender, profession, or age.”).
computing at least one composite score according to a similarity between the user profile and a plurality of other user profiles by inputting the user profile and a plurality of items into a prediction model trained by (Wilson, pg. 6, para. 0087: “Nodes in the data network represent venues, venue properties, users, user properties, reviewers, reviewer properties, and the like. Links or links represent relations between those nodes. The number of links between two items might therefore grow as data on two items grows. The strength of each link denotes the affinity between the two connected items, such as similarity of star rating (in a review of a venue), number of attributes held in common. Links can be either positive or negative in sign.” Examiner notes that the broadest reasonable interpretation of “computing [a] score” means to calculate and keep a running account such as calculating the strength of a link between two users).
	It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1.

Regarding Claim 18, Zhao and Wilson teaches the system of claim 17 (above). Wilson further teaches: 
wherein the at least one hardware processor is adapted to perform the outputting of the at least one composite score via at least one digital communication network interface connected to the at least one hardware processor (Wilson, pgs. 42-43, paras. 0366, 0370: “The apparatus is optionally implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by a programmable processor; and method steps are performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output” “The server functionality described above is optionally implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them. The components of the system are connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, e.g., a LAN, a WAN, and the computers and networks forming the Internet”. Examiner notes that the broadest reasonable interpretation of “adapted to” means to make suitable for use, irrespective of whether actually used)
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1.

Regarding Claim 19, Zhao and Wilson teaches the system of claim 17 (above). Wilson further teaches: 
wherein the at least one hardware processor is adapted to perform the receiving the user profile by at least one of: (Wilson, para. 0366: “The apparatus is optionally implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine readable storage device or in a propagated signal, for execution by a programmable processor; and method steps are performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output.”)
receiving the user profile via at least one digital communication network interface connected to the at least one hardware processor, and retrieving the user profile from at least one non-volatile digital storage connected to the at least one hardware processor (Wilson, pg. 42-43, paras. 0366, 0370: “The apparatus is optionally implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by a programmable processor; and method steps are performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output” “The server functionality described above is optionally implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them. The components of the system are connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, e.g., a LAN, a WAN, and the computers and networks forming the Internet”. Examiner notes that the broadest reasonable interpretation of “adapted to” means to make suitable for use, irrespective of whether actually used)
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1.

Regarding Claim 20, Zhao teaches a system:  
modifying at least one model value of the prediction model to maximize a reward score, (Zhao, sec. 2.1 and 2.2: “Figure 2 illustrates the agent-user interactions in MDP. By interacting with the environment (users), recommender agent takes actions (recommends items) to users in such a way that maximizes the expected return, which includes the delayed rewards. We follow the standard assumption that delayed rewards are discounted by a factor of γ per time-step”. “With this intuition, we match the current state and action to existing historical state-action pairs, and stochastically generate a simulated reward…then we calculated the similarity of the current state-action pair, say pt (st , at ), to each existing historical state-action pair in the memory…then we can map the current state-action pair pt to a reward according the above probability.” Examiner notes that Zhao provides a discount factor in its model which changes per time-step i.e. modified).
the reward score computed using a plurality of expected content-based scores computed by applying a content-based filtering method to a plurality of training user attribute values and a plurality of training item properties of a plurality of training items; and (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that Zhao teaches arrival at a maximum reward score using a user profile i.e. dynamic preferences of a changing profile and content-based filtering i.e. prediction based on a user profile as well as items.)
a plurality of training predictive collaborative-based scores computed by the prediction model for each one of the plurality of training items; and (Zhao, sec. 1 and 1.2: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” “The Actor inputs the current state and aims to output the parameters of a state-specific scoring function. Then the RA scores all items and selects an item with the highest score.”)
output the at least one composite score. (Zhao, sec. 1: “First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.” Examiner notes that Zhao teaches maximizing an overall cumulative reward wherein the reward is a composite score)
Zhao does not explicitly disclose: 
A comprising at least one hardware processor comprising: 
receive a user profile having a plurality of user attribute values; 
compute at least one composite score according to a similarity between the user profile and a plurality of other user profiles by inputting the user profile and a plurality of items into a prediction model trained by 
However, Wilson teaches: 
A comprising at least one hardware processor comprising: (Wilson, para. 0366: “The apparatus is optionally implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine readable storage device or in a propagated signal, for execution by a programmable processor; and method steps are performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output.”)
receive a user profile having a plurality of user attribute values; (Wilson, pg. 8, para. 0106: “The system 100 may access a user profile to collect data from the user profile such as other venues liked, gender, profession, or age.”).
compute at least one composite score according to a similarity between the user profile and a plurality of other user profiles by inputting the user profile and a plurality of items into a prediction model trained by (Wilson, pg. 6, para. 0087: “Nodes in the data network represent venues, venue properties, users, user properties, reviewers, reviewer properties, and the like. Links or links represent relations between those nodes. The number of links between two items might therefore grow as data on two items grows. The strength of each link denotes the affinity between the two connected items, such as similarity of star rating (in a review of a venue), number of attributes held in common. Links can be either positive or negative in sign.” Examiner notes that the broadest reasonable interpretation of “computing [a] score” means to calculate and keep a running account such as calculating the strength of a link between two users).
	It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1.

Regarding Claim 21, Zhao, as modified teaches claim 17 above. Zhao further teaches:  
the operations further comprising, in each of a plurality of training iterations: (Zhao, sec. 2.5: “The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28)”)
receiving a training user profile of a plurality of training user profiles, (Zhao, sec. 1.3: “More specifically, we build the simulator by users’ historical records The intuition is no matter what algorithms a recommender system adopt, given the same state (or a user’s historical records) and the same action (recommending the same items to the user)”), the training user profile having training user attribute values of the plurality of training user attribute values; (Wilson, para. 0106: “The system 100 may access a user profile to collect data from the user profile such as other venues liked, gender, profession, or age”)
computing by the prediction model a plurality of predicted scores, each for one of the plurality of training items, in response to inputting to the prediction model the training user profile and the plurality of training items, (Zhao, sec. 2.3: “The positive items represent key information about users’ preferences, i.e., which items the users prefer to. Thus, we only consider them for state-specific scoring function… Note that it is straightforward to extend it with non-linear relations.” Examiner notes that predicted scores are scores for training items calculated by a prediction model such as taught by Zhao).
where each of the plurality of training items is described by training item properties of the plurality of training item properties, said plurality of training item properties are selected according to relevancy of a property type of said plurality of training item properties with respect to a respective domain of the respective training item, (Zhao, sec. 2.3, 2.5 and sec. 4: “In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” “The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28). For transition generating stage (line 8): given the current state                                 
                                    
                                            s
                                        
                                            t
                                        
                            ,the RA first recommends a list of items                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                             according to Algorithm 2 (line 9); then the agent observes the reward                                 
                                    
                                            r
                                        
                                            t
                                        
                             from simulator (line 10) and updates the state to                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                             (lines 11-17) following the same strategy in Algorithm 1;” “Another common approach is content based filtering[15], which tries to recommend items with similar properties to those that a user ordered in the past”. Examiner notes that the broadest definition of “domain” means a particular field of thought, activity, or interest such as the particular field of interest being the browsing history of a user as in Zhao; Examiner additionally notes that “training item properties” means a quality, attribute, or feature of a training item such as proposed in section 2.5 of Zhao where the training includes the inclusion and removal of an item to be recommended is based on the item and browser history. Zhao also teaches the use of content-based filtering where items with similar properties are selected i.e. where the broadest definition of “relevant” means closely connected such that content based filtering will leave only those items that are closely connected to that selected)
wherein, for at least one of the training iterations, the method is configured to process the user attribute values using a first set of values corresponding to a first domain and wherein, for at least one of the training iterations, the method is configured to process the user attribute values using a second set of values corresponding to a second domain;  (Zhao, sec. 2.1 and 2.5: “State space S: A state                                 
                                    
                                            s
                                        
                                            t
                                        
                                    =
                                    
                                                    s
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            ,
                                            
                                                    s
                                                
                                                    t
                                                
                                                    N
                                                
                                    ∈
                                    S
                                
                             is defined as the browsing history of a user, i.e., previous                                 
                                    N
                                
                             items that a user browsed before time                                 
                                    t
                                
                            . The items in                                 
                                    
                                            s
                                        
                                            t
                                        
                             are sorted in chronological order we present a state-specific scoring function” “The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28). For transition generating stage (line 8): given the current state                                 
                                    
                                            s
                                        
                                            t
                                        
                            ,the RA first recommends a list of items                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                             according to Algorithm 2 (line 9); then the agent observes the reward                                 
                                    
                                            r
                                        
                                            t
                                        
                             from simulator (line 10) and updates the state to                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                             (lines 11-17) following the same strategy in Algorithm 1;” Examiner notes that Zhao teaches discrete states,                                 
                                    
                                            s
                                        
                                            t
                                        
                                            1
                                        
                            , which are defined as the “previous                                 
                                    N
                                
                             items that a user browsed before time                                 
                                    t
                                
                            .” where each                                 
                                    
                                            s
                                        
                                            t
                                        
                                            1
                                        
                             is a domain and Zhao teaches at least a first and second domain where the items browsed are the user attributes of the time domain at east time t.). 
reducing training time of a corresponding system and increasing a prediction accuracy of the prediction model by: (Zhao, sec. 3.2 and 5: “This result suggests that introducing reinforcement learning can improve the performance of recommendations…the training speed of LIRD is much faster than DQN” “we propose a list- wise recommendation framework, which can be applied in scenarios with large and dynamic item space and can reduce redundant computation significantly.” Examiner notes that Zhao teaches LIRD which results in better performance i.e. prediction accuracy than other known methods, as well as a faster training speed)
computing a plurality of expected scores for the training user profile, by applying a content based filtering method to the plurality of training user attribute values and the plurality of training item properties of the plurality of training items, (Zhao, sec. 2.3 and sec. 4: “In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” “Another common approach is content based filtering[15], which tries to recommend items with similar properties to those that a user ordered in the past. Knowledge-based systems[1] recommend items based on specific domain knowledge about how certain item features meet users’ needs and preferences and how the item is useful for the user”. Examiner notes that Zhao teaches content-based filtering using item features i.e. properties as well as users’ needs i.e. user attributes)
each of the plurality of expected scores is computed for one of the plurality of training items according to the plurality of training user attribute values and the plurality of training item properties of the training item; and (Zhao, sec. 1.3 and 2.3: “Thus we cannot get the feedbacks (rewards) of items that are not in users’ historical records. This may result in inconsistent results between offline and online measurements. Our proposed online environment simulator can also mitigate this challenge by producing simulated online rewards given any state-action pair, so that the recommender system can rate items from the whole item space.” ““In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” Examiner notes that the broadest reasonable interpretation of “scores” means any accounting including quantifying and rating items; examiner additionally notes that “training item properties” means a quality, attribute, or feature of a training item such as whether the items were clicked or ordered)
collecting at least one feedback value from at least one training user associated with at least one of the plurality of training user profiles, (Zhao sec. 2.1 and 3.2: “After the recommender agent takes an action at at the state st , i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward r (st , at ) according to the user’s feedback.”)
where the at least one feedback value is indicative of a level of agreement of the at least one training user with at least some of the plurality of predicted scores computed by the prediction model in response to the respective training user profile and the plurality of training items and updating at least one training user attribute value of the respective at least one training user profile according to the at least one feedback value. (Zhao, sec. 2.1 and 2.3: “A state                                 
                                    
                                            s
                                        
                                            t
                                        
                                    =
                                    
                                                    s
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            .
                                            .
                                            .
                                            ,
                                            
                                                    s
                                                
                                                    t
                                                
                                                    N
                                                
                                    ∈
                                     
                                    S
                                
                             is defined as the browsing history of a user, i.e., previous N items that a user browsed before time …An action state                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            .
                                            .
                                            .
                                            ,
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                                    ∈
                                     
                                    A
                                
                             is to recommend a list of items to a user at time t based on current state st , where K is the number of items the RA recommends to user each time…After the recommender agent takes an action at at the state st, i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward r (st, at) according to the user’s feedback…If user skips all the recommended items, then the next state st+1 = st; while if the user clicks/orders part of items, then the next state st+1 updates” “A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to. Thus, we only consider them for state-specific scoring function…”)
	It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1.

Regarding Claim 22, Zhao, as modified teaches claim 20 above. Zhao further teaches:  
the at least one hardware processor further adapted to, in each of a plurality of training iterations: (Wilson, para. 0366: “The apparatus is optionally implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine readable storage device or in a propagated signal, for execution by a programmable processor; and method steps are performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output.”)
receive a training user profile of a plurality of training user profiles, (Zhao, sec. 1.3: “More specifically, we build the simulator by users’ historical records The intuition is no matter what algorithms a recommender system adopt, given the same state (or a user’s historical records) and the same action (recommending the same items to the user)”), the training user profile having training user attribute values of the plurality of training user attribute values; (Wilson, para. 0106: “The system 100 may access a user profile to collect data from the user profile such as other venues liked, gender, profession, or age”)
compute by the prediction model a plurality of predicted scores, each for one of the plurality of training items, in response to inputting to the prediction model the training user profile and the plurality of training items, (Zhao, sec. 2.3: “The positive items represent key information about users’ preferences, i.e., which items the users prefer to. Thus, we only consider them for state-specific scoring function… Note that it is straightforward to extend it with non-linear relations.” Examiner notes that predicted scores are scores for training items calculated by a prediction model such as taught by Zhao).
where each of the plurality of training items is described by training item properties of the plurality of training item properties, said plurality of training item properties are selected according to relevancy of a property type of said plurality of training item properties with respect to a respective domain of the respective training item, (Zhao, sec. 2.3, 2.5 and sec. 4: “In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” “The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28). For transition generating stage (line 8): given the current state                                 
                                    
                                            s
                                        
                                            t
                                        
                            ,the RA first recommends a list of items                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                             according to Algorithm 2 (line 9); then the agent observes the reward                                 
                                    
                                            r
                                        
                                            t
                                        
                             from simulator (line 10) and updates the state to                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                             (lines 11-17) following the same strategy in Algorithm 1;” “Another common approach is content based filtering[15], which tries to recommend items with similar properties to those that a user ordered in the past”. Examiner notes that the broadest definition of “domain” means a particular field of thought, activity, or interest such as the particular field of interest being the browsing history of a user as in Zhao; Examiner additionally notes that “training item properties” means a quality, attribute, or feature of a training item such as proposed in section 2.5 of Zhao where the training includes the inclusion and removal of an item to be recommended is based on the item and browser history. Zhao also teaches the use of content-based filtering where items with similar properties are selected i.e. where the broadest definition of “relevant” means closely connected such that content based filtering will leave only those items that are closely connected to that selected)
wherein, for at least one of the training iterations, the method is configured to process the user attribute values using a first set of values corresponding to a first domain and wherein, for at least one of the training iterations, the method is configured to process the user attribute values using a second set of values corresponding to a second domain; (Zhao, sec. 2.1 and 2.5: “State space S: A state                                 
                                    
                                            s
                                        
                                            t
                                        
                                    =
                                    
                                                    s
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            ,
                                            
                                                    s
                                                
                                                    t
                                                
                                                    N
                                                
                                    ∈
                                    S
                                
                             is defined as the browsing history of a user, i.e., previous                                 
                                    N
                                
                             items that a user browsed before time                                 
                                    t
                                
                            . The items in                                 
                                    
                                            s
                                        
                                            t
                                        
                             are sorted in chronological order we present a state-specific scoring function” “The training algorithm for the proposed framework DEV is presented in Algorithm 3. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28). For transition generating stage (line 8): given the current state                                 
                                    
                                            s
                                        
                                            t
                                        
                            ,the RA first recommends a list of items                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            …
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                             according to Algorithm 2 (line 9); then the agent observes the reward                                 
                                    
                                            r
                                        
                                            t
                                        
                             from simulator (line 10) and updates the state to                                 
                                    
                                            s
                                        
                                            t
                                            +
                                            1
                                        
                             (lines 11-17) following the same strategy in Algorithm 1;” Examiner notes that Zhao teaches discrete states,                                 
                                    
                                            s
                                        
                                            t
                                        
                                            1
                                        
                            , which are defined as the “previous                                 
                                    N
                                
                             items that a user browsed before time                                 
                                    t
                                
                            .” where each                                 
                                    
                                            s
                                        
                                            t
                                        
                                            1
                                        
                             is a domain and Zhao teaches at least a first and second domain where the items browsed are the user attributes of the time domain at east time t.).
reduce training time of a corresponding system and increasing a prediction accuracy of the prediction model by: (Zhao, sec. 3.2 and 5: “This result suggests that introducing reinforcement learning can improve the performance of recommendations…the training speed of LIRD is much faster than DQN” “we propose a list- wise recommendation framework, which can be applied in scenarios with large and dynamic item space and can reduce redundant computation significantly.” Examiner notes that Zhao teaches LIRD which results in better performance i.e. prediction accuracy than other known methods, as well as a faster training speed).
computing a plurality of expected scores for the training user profile, by applying a content based filtering method to the plurality of training user attribute values and the plurality of training item properties of the plurality of training items, (Zhao, sec. 2.3 and sec. 4: “In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” “Another common approach is content based filtering[15], which tries to recommend items with similar properties to those that a user ordered in the past. Knowledge-based systems[1] recommend items based on specific domain knowledge about how certain item features meet users’ needs and preferences and how the item is useful for the user”. Examiner notes that Zhao teaches content-based filtering using item features i.e. properties as well as users’ needs i.e. user attributes)
each of the plurality of expected scores is computed for one of the plurality of training items according to the plurality of training user attribute values and the plurality of training item properties of the training item; and (Zhao, sec. 1.3 and 2.3: “Thus we cannot get the feedbacks (rewards) of items that are not in users’ historical records. This may result in inconsistent results between offline and online measurements. Our proposed online environment simulator can also mitigate this challenge by producing simulated online rewards given any state-action pair, so that the recommender system can rate items from the whole item space.” ““In the previous section, we have defined the state s as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to.” Examiner notes that the broadest reasonable interpretation of “scores” means any accounting including quantifying and rating items; examiner additionally notes that “training item properties” means a quality, attribute, or feature of a training item such as whether the items were clicked or ordered).
collecting at least one feedback value from at least one training user associated with at least one of the plurality of training user profiles, (Zhao sec. 2.1 and 3.2: “After the recommender agent takes an action at at the state st , i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward r (st , at ) according to the user’s feedback.”)
where the at least one feedback value is indicative of a level of agreement of the at least one training user with at least some of the plurality of predicted scores computed by the prediction model in response to the respective training user profile and the plurality of training items and updating at least one training user attribute value of the respective at least one training user profile according to the at least one feedback value. (Zhao, sec. 2.1 and 2.3: “A state                                 
                                    
                                            s
                                        
                                            t
                                        
                                    =
                                    
                                                    s
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            .
                                            .
                                            .
                                            ,
                                            
                                                    s
                                                
                                                    t
                                                
                                                    N
                                                
                                    ∈
                                     
                                    S
                                
                             is defined as the browsing history of a user, i.e., previous N items that a user browsed before time …An action state                                 
                                    
                                            a
                                        
                                            t
                                        
                                    =
                                    
                                                    a
                                                
                                                    t
                                                
                                                    1
                                                
                                            ,
                                            .
                                            .
                                            .
                                            ,
                                            
                                                    a
                                                
                                                    t
                                                
                                                    K
                                                
                                    ∈
                                     
                                    A
                                
                             is to recommend a list of items to a user at time t based on current state st , where K is the number of items the RA recommends to user each time…After the recommender agent takes an action at at the state st, i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward r (st, at) according to the user’s feedback…If user skips all the recommended items, then the next state st+1 = st; while if the user clicks/orders part of items, then the next state st+1 updates” “A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to. Thus, we only consider them for state-specific scoring function…”) 
	It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine the teachings of Wilson into Zhao as set forth above with respect 
to claim 1.

Response to Applicant Remarks/Argument

Applicant's arguments filed 02 Sep 2025 have been fully considered but they are not persuasive.

35 U.S.C. §103
On page 11 of applicant’s remarks/arguments, applicant argues that Zhao does not teach use of all four of the following to compute the reward score: (1) expected content-based scores; (2) training user attribute values; (3) training items; and (4) training item properties of a plurality of training items. Examiner referenced Zhao sec 1 which recites: 
First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.

(emphasis added). The rejection included a note that Zhao teaches arrival at a maximum reward score using a user profile i.e. dynamic preferences of a changing profile and content-based filtering i.e. prediction based on a user profile as well as items. All of the four are taught by Zhao, as bolded above where: (1) expected-content based scores = recommendations best fitting; (2) training user attribute values = users’ dynamic preferences; (3) training items = system converges to…identify the item (i.e. of a plurality of items rather than “an item”); (4) training item properties = a small immediate reward but making big contribution to the rewards for future (i.e. the properties of the item resulting in different rewards). 
Applicant further argues that Zhao does not disclose a reward score that is maximized using both, (1) content-based scores applied to training user attributes and to training item properties; and (2) using a plurality of expected content-based scores and a plurality of predictive collaborative-based scores.  However, Zhao teaches both collaborative and content-based filtering techniques as well as hybrid recommender systems based on a combination of the two (or more).  
A person of ordinary skill in the art of recommender systems would know that collaborative filtering necessarily relies on user attributes (i.e. it is method that relies on multiple users having similar attributes to make recommendations) and content-based filtering necessarily relies on item properties (i.e. it is a method that relies on users’ attributes and historical items to make recommendations). Therefore, that applicant’s limitation specifies applying content-based filtering to user attributes and item properties is simply content-based filtering. Moreover, the maximizing a reward score is also taught by Zhao and also generally known in the art (as evidenced by the various papers cited in Zhao) where a higher reward score would correlate with a better recommendation. 
On page 14 of applicant’s remarks, applicant argues that Wilson does not disclose or suggest applying a matrix factorization method to a plurality of item properties and user attribute values. However, as set forth in the rejection, Wilson teaches matrices of interrelationships and generated recommendations which teaches a matrix factorization method (where the interrelationships such as between users and user-items are taught by Zhao). 
The additional limitations added in claims 2 and 20 as well as newly added claims 21 and 22 are rejected under 103 for the reasons set forth in the rejection above. 
With respect to the remaining dependent claims, the dependent claims remain rejected for at least the reasons set forth above and also as set forth in the rejection above.  

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sally T. Ley whose telephone number is (571)272-3406. The examiner can normally be reached Monday - Thursday, 10:00am - 6:00pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/STL/Examiner, Art Unit 2147                                                                                                                                                                                                        
/ERIC NILSSON/Primary Examiner, Art Unit 2151
Read full office action
Prosecution Timeline

Show 14 earlier events
Sep 13, 2024
Examiner Interview Summary
Feb 18, 2025
Response Filed
Apr 02, 2025
Non-Final Rejection mailed — §103
Aug 26, 2025
Examiner Interview Summary
Aug 26, 2025
Applicant Interview (Telephonic)
Sep 02, 2025
Response Filed
Sep 25, 2025
Final Rejection mailed — §103
Nov 25, 2025
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

17/981,796
Patent 12632746
A METHOD AND APPARATUS FOR DISPLAYING CATEGORIZED CARBON EMISSIONS
3y 6m to grant Granted May 19, 2026
16/733,393
Patent 12443830
COMPRESSED WEIGHT DISTRIBUTION IN NETWORKS OF NEURAL PROCESSORS
5y 9m to grant Granted Oct 14, 2025
16/835,892
Patent 12135927
EXPERT-IN-THE-LOOP AI FOR MATERIALS DISCOVERY
4y 7m to grant Granted Nov 05, 2024
17/992,958
Patent 11880776
GRAPH NEURAL NETWORK (GNN)-BASED PREDICTION SYSTEM FOR TOTAL ORGANIC CARBON (TOC) IN SHALE
1y 2m to grant Granted Jan 23, 2024
Study what changed to get past this examiner. Based on 4 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

6-7
Expected OA Rounds
19%
Grant Probability
53%
With Interview (+33.3%)
4y 8m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 36 resolved cases by this examiner. Grant probability derived from career allowance rate.
SYSTEM AND METHOD FOR A RECOMMENDER USING DEEP REINFORCEMENT LEARNING AND Q-LEARNING BASED ON USER ATIRIBUTES

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

SYSTEM AND METHOD FOR A RECOMMENDER USING DEEP REINFORCEMENT LEARNING AND Q-LEARNING BASED ON USER ATIRIBUTES

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email