Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claim(s) 1-2, 6, and 9-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Haug et al. (NPL from IDS: Teaching inverse reinforcement learners via features and demonstrations, published March 2019, hereinafter “Haug”) in view of Najar et al. (NPL: Imitation as a model-free process in human reinforcement learning, published Oct. 2019, hereinafter “Najar”).
Regarding claim 1, Haug teaches a learning device comprising:
a memory storing instructions, expert decision-making history data comprising trajectories, and a plurality of candidate features for an objective function; and one or more processors configured to execute the instructions to (Haug, Section 3 Paragraph 2 – “The performance measure for policies we are interested in is the expected discounted reward R(π) := E( ∞ t=0 γtR(st)), where the expectation is taken with respect to the distribution over trajectories (s0,s1,...) induced by π together with the transition probabilities T and the initial-state distribution D”, and in Section 3 Paragraph 1 – “We assume that there exists a feature map φ: S → Rk such that the reward function is linear in the features given by φ, i.e., R(s) = w∗,φ(s) for some w∗ ∈ Rk which we assume to satisfy w∗ = 1.” – teaches storing expert decision-making history data comprising trajectories (st comprises trajectories s0, s1, …) and a plurality of candidate features for an objective function (feature map φ maps features for a reward function R)):
initialize a first feature list to comprise the plurality of candidate features and initialize a second feature list as an empty set (Haug, Section 3 Paragraph 1 – “We assume that there exists a feature map φ: S → Rk such that the reward function is linear in the features given by φ, i.e. …” and Section 3 Paragraph 4 – “The challenge we address in this paper is that of achieving this objective under the assumption that there is a mismatch between the worldviews of L and T, by which we mean the following: Instead of the “true” feature vectors φ(s), L observes feature vectors ALφ(s) ∈ R, where AL : Rk -> Rl is a linear map (i.e., a matrix) that we interpret as L’s worldview. The simplest case is that AL selects a subset of the features given by φ(s), thus modelling the situation where L only has access to a subset of the features relevant for the true reward, which is a reasonable assumption for many real-world situations.” – teaches a first feature list (φ(s)) to comprise the plurality of candidate features and initialize a second feature list as an empty set (AL selects a subset of the features given by φ(s)));
derive, by inverse reinforcement learning using all candidate features in the first feature list, a respective weight for each of the plurality of candidate features of a first objective function representing an ideal reward result (Haug, Section 3 Paragraph 1 – “We assume that there exists a feature map φ: S → Rk such that the reward function is linear in the features given by φ, i.e., R(s) = w∗,φ(s) for some w∗ ∈ Rk which we assume to satisfy w∗ = 1” – teaches deriving, by inverse reinforcement learning using all candidate features in the first feature list, a respective weight w* for each of the plurality of candidate features of a first objective function (reward function Rk) representing an ideal reward result (w*));
generate a worldview representation that identifies which features are included in the second feature list, the worldview representation comprising a matrix having diagonal elements indicating included features and excluded features (Haug, Section 3 Paragraph 4 – “The challenge we address in this paper is that of achieving this objective under the assumption that there is a mismatch between the worldviews of L and T, by which we mean the following: Instead of the “true” feature vectors φ(s), L observes feature vectors ALφ(s) ∈ R, where AL : Rk -> Rl is a linear map (i.e., a matrix) that we interpret as L’s worldview. The simplest case is that AL selects a subset of the features given by φ(s), thus modelling the situation where L only has access to a subset of the features relevant for the true reward, which is a reasonable assumption for many real-world situations.” – teaches generating a worldview representation that identifies which features are included in the second feature list (L’s worldview) the worldview representation comprising a matrix having diagonal elements indicating included and excluded features (linear map, or matrix, is interpreted as L’s worldview, where elements indicate included and excluded features));
calculate, for each candidate feature not yet included in the second feature list, a teaching-risk value based on weights derived for the first objective function and a null space of the matrix, the teaching-risk value indicating a degree of potential partial optimality of a second objective function when the candidate feature is added to the second feature list (Haug, Section 4 Paragraph 1 – “The teaching risk for a given worldview AL with respect to reward weights w∗ is Eq. (1)” – teaches calculating, for each candidate feature not yet included in the second feature list, a teaching-risk value (in Eq. (1), ρ(AL;w∗) is the teaching-risk value) based on weights derived for the first objective function (w*) and a null space of the matrix, the teaching-risk value indicating a degree of potential partial optimality of a second objective function when the candidate feature is added to the second feature list (calculates teaching risk when feature is added to the second feature list, as in Section 5 Paragraph 3));
select, from the first feature list, at least one candidate feature having a teaching-risk value that is among smallest teaching-risk values, remove the selected at least one candidate feature from the first feature list, and add the selected at least one candidate feature to the second feature list (Haug, Section 4 Paragraph 7 (Last Paragraph of Page 5) – “Theorem 1 shows the following: If L imitates T’s behaviour well in her worldview (meaning that ε can be chosen small) and if the teaching risk ρ(AL;w∗) is sufficiently small, then L will perform nearly as well as T with respect to the true reward. In particular, if T’s policy is optimal, πT = π∗, then L’s policy πL is guaranteed to be near-optimal.”, Section 5 Paragraph 2 – “The simplest way by which the teacher T can change L’s worldview is by informing her about features f ∈ Rk that are relevant to performing well in the task, thus causing her to update her worldview AL: Rk → R to …”, and in Section 5 Paragraph 3 – “Viewing AL as a matrix, this operation appends f as a row to AL. (Strictly speaking, the feature that is thus provided is s→<f, φ(s)>; we identify this map with the vector f in the following and thus keep calling f a ‘feature’.” – teaches selecting from the first feature list (provided by teacher T) at least one candidate feature having a teaching-risk value that is among the smallest teaching-risk values (if teaching risk is sufficiently small, the learner will perform nearly as well as the teacher with respect to the true reward), and adds the selected at least one candidate feature to the second feature list (appends feature f to second feature list AL));
generate, by inverse reinforcement learning using only features included in the second feature list, the second objective function and derive a weight for each feature included in the second objective function (Haug, Section 4 Paragraph 13 (Theorem 2) – “Theorem 2 therefore implies that, if ρ(AL;w∗) is small, a truly optimal policy π∗ is near-optimal for some choice of reward function linear in the features L observes, namely, the reward function s → w∗ L,φ(s) with w∗ L ∈ R the vector whose existence is claimed by the theorem.” – teaches generating, by inverse reinforcement learning, using only feature including in the second feature list (AL), the second objective function (L observes the reward function s → <w∗L, φ(s)>) and derive a weight for each feature included in the second objective function (w∗L ∈ R));
iteratively repeat generating the worldview representation, calculating the teaching-risk value, selecting the at least one candidate feature, generating the second objective function (Haug, Section 5 Paragraph 6 – “Our basic teaching algorithm TRGREEDY (Algorithm1) works as follows: T and L interact in rounds, in each of which T provides L with the feature f ∈ F which reduces the teaching risk of L’s worldview with respect to w∗ by the largest amount. L then trains a policy πL with the goal of imitating her current view ALµ(πT) of the feature expectations of the teacher’s policy;” – teaches iteratively repeat generating the worldview representation (T and L interact in rounds, generates worldview representation in each round), calculating the teaching-risk value, selecting the at least one candidate feature (provides L with feature f which reduces the teaching risk of L’s worldview), generating the second objective function (as in Section 4 Paragraph 13 (Theorem 2), where L observes the reward function s → w∗ L,φ(s) with w∗ L ∈ R)),
output a set of features included in the second feature list and corresponding weights (Haug, Section 4 Paragraph 13 (Theorem 2) – “Theorem 2 therefore implies that, if ρ(AL;w∗) is small, a truly optimal policy π∗ is near-optimal for some choice of reward function linear in the features L observes, namely, the reward function s → w∗ L,φ(s) with w∗ L ∈ R the vector whose existence is claimed by the theorem.” and in Section 5 Paragraph 6 – “Our basic teaching algorithm TRGREEDY (Algorithm1) works as follows: T and L interact in rounds, in each of which T provides L with the feature f ∈ F which reduces the teaching risk of L’s worldview with respect to w∗ by the largest amount. L then trains a policy πL with the goal of imitating her current view ALµ(πT) of the feature expectations of the teacher’s policy;” – teaches outputting a set of features includes in the second feature list (L provided features f, L contains set of features included in second feature list) and corresponding weights (w∗ L ∈ R))
use the second objective function to control operation of a physical system or to determine a sensor configuration subject to a mounting constraint (Haug, Section 5 Paragraph 4 – “This operation has simple interpretations in the settings we are interested in: If L is a human learner, “teaching a feature” could mean making L aware that a certain quantity, which she might not have taken into account so far, is crucial to achieving high performance. If L is a machine, such as an autonomous car or a robot, it could mean installing an additional sensor.” and in Section 5 Paragraph 5 – “We assume that this is not possible, and that instead only the elements of a fixed finite set of teachable features F={fi |i ∈ I}⊂Rk can be taught. In real-world applications, such constraints could come from the limited availability of sensors and their costs;” – teaches using the second objective function (learner L based on taught features) to control operation of a physical system or to determine a sensor configuration subject to a mounting constraint (teaching features could mean installing an additional sensor, teachable features are constrained such as the limited availability of sensors, thus determining a sensor configuration subject to a mounting constraint)).
Haug fails to explicitly teach a memory storing instructions; and one or more processors configured to execute the instructions to: calculate an information criterion comprising at least one of Akaike’s information criterion (AIC), Bayesian information criterion (BIC), or focused information criterion (FIC) for the second objective function; iteratively repeat calculating the information criterion while the information criterion is monotonically increasing across successive iterations; [[and]] output corresponding weights when the information criterion reaches a maximum.
However, analogous to the field of the claimed invention, Najar teaches:
calculate an information criterion comprising at least one of Akaike’s information criterion (AIC), Bayesian information criterion (BIC), or focused information criterion (FIC) for the second objective function (Najar, Pg. 8, Left Column, Last Paragraph – “Then, from the negative log-likelihood we derive the Akaike Information Criterion (AIC) (44) defined as AIC =−2∗LL−2∗log(ntrials).” – teaches calculating an information criterion comprising at least one of Akaike’s information criterion);
iteratively repeat calculating the information criterion while the information criterion is monotonically increasing across successive iterations (Najar, Pg. 8, Left Column, Last Paragraph – “Then, from the negative log-likelihood we derive the Akaike Information Criterion (AIC) (44) defined as AIC =−2∗LL−2∗log(ntrials)”, Pg. 8, Right Column, Paragraph 1 – “The AIC is-with other metrics- a commonly used metric for estimating the quality of fit of a model while accounting for its complexity. As such, it provides an approximation of the out-of-sample prediction performance (45). We selected the AIC as a metric after comparing its performance in model recovery along with other metrics such as the BIC (see the paragraph about model recovery)”, and in Pg. 8, Right Column, Paragraph 2 – “This procedure estimates the model expected frequencies and the exceedance probability for each model within a set of models, given the data gathered from all participants. The expected frequency is the probability of the model to generate the data obtained from any randomly selected participant– it is a quantification of the posterior probability of the model (PP). It must be compared to chance level, which is one over the number of models in the model space. The exceedance probability (XP), is the probability that a given model fits the data better than all other models in the model space. Theoretically, a model with the highest expected frequency and the highest exceedance probability is considered as ‘the winning model’.” – teaches iteratively repeat calculating the information criterion while the information criterion is increasing across successive iterations (ntrials within the AIC indicates the calculation of AIC is repeated across trials as the information criterion increases, the winning model is selected based on the highest expected frequency and highest exceedance probability which are based on the increasing AIC));
output corresponding weights when the information criterion reaches a maximum (Najar, Fig. 2 & Pg. 8, Right Column, Paragraph 2 – “Finally, the individual AIC scores (actually AIC/2) were fed into the mbb-vb-toolbox (46). Contrary to fixed-effect analyses that average the criteria for each model, the random-effect model selection allows the investigation of inter-individual differences and to discard the hypothesis of the pooled evidence to be biased or driven by some individuals– i.e. outliers. This procedure estimates the model expected frequencies and the exceedance probability for each model within a set of models, given the data gathered from all participants. The expected frequency is the probability of the model to generate the data obtained from any randomly selected participant– it is a quantification of the posterior probability of the model (PP). It must be compared to chance level, which is one over the number of models in the model space. The exceedance probability (XP), is the probability that a given model fits the data better than all other models in the model space. Theoretically, a model with the highest expected frequency and the highest exceedance probability is considered as ‘the winning model’.” – teaches outputting corresponding weights when the information criterion reaches a maximum (AIC fed into toolbox to investigate inter-individual differences, procedure estimates model expected frequency and exceedance probability based on AIC, model with highest expected frequency and exceedance probability is the winning model, and thus outputs the selected model and corresponding weights when the AIC reaches a maximum));
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the information criterion of Najar to the inverse reinforcement learning method, feature lists, and iterative computation of Haug in order to measure the gap in performance between objective functions and output results when performance reaches a maximum. Doing so would estimate the quality of fit of a model while accounting for its complexity (Najar, Section 5).
Claims 9 and 11 incorporate substantively all the limitations of Claim 1 in a method and non-transitory computer readable medium, and thus are rejected on the same grounds as above.
Regarding claim 2, the combination of Haug and Najar teaches the learning device according to claim 1, wherein the one or more processors are configured to execute the instructions to:
regard each derived weight of the plurality of candidate features as an optimal parameter to select a feature that minimizes a partial optimality of an objective function from among the plurality of candidate features (Haug, Section 5 Paragraph “Greedy minimization of teaching risk” – “Our basic teaching algorithm TRGREEDY (Algorithm 1) works as follows: T and L interact in rounds, in each of which T provides L with the feature f ∈ F which reduces the teaching risk of L’s worldview with respect to w∗ by the largest amount. L then trains a policy π L with the goal of imitating her current view ALµ(π T ) of the feature expectations of the teacher’s policy;” – teaches regarding each derived weight of candidate features (weights w* and features f) as an optimal parameter to select a feature that minimizes a partial optimality of an objective function from among the candidate features (minimizes teaching risk of L’s world view)).
Claims 10 and 12 are similar to claim 2, and are rejected on the same grounds as above.
Regarding claim 6, the combination of Haug and Najar teaches the learning device of claim 1, wherein the one or more processors are further configured to execute the instructions to
output features included in the second objective function and corresponding weights of the features (Haug, Section 4 Paragraph 13 (Theorem 2) – “Theorem 2 therefore implies that, if ρ(AL;w∗) is small, a truly optimal policy π∗ is near-optimal for some choice of reward function linear in the features L observes, namely, the reward function s → w∗ L,φ(s) with w∗ L ∈ R the vector whose existence is claimed by the theorem.” and in Section 5 Paragraph 6 – “Our basic teaching algorithm TRGREEDY (Algorithm1) works as follows: T and L interact in rounds, in each of which T provides L with the feature f ∈ F which reduces the teaching risk of L’s worldview with respect to w∗ by the largest amount. L then trains a policy πL with the goal of imitating her current view ALµ(πT) of the feature expectations of the teacher’s policy;” – teaches outputting a set of features includes in the second feature list (L provided features f, L contains set of features included in second feature list) and corresponding weights (w∗ L ∈ R))
Haug fails to explicitly teach output corresponding weights of the features when the information criterion reaches a maximum.
However, analogous to the field of the claimed invention, Najar teaches:
output corresponding weights of the features when the information criterion reaches a maximum (Najar, Fig. 2 & Pg. 8, Right Column, Paragraph 2 – “Finally, the individual AIC scores (actually AIC/2) were fed into the mbb-vb-toolbox (46). Contrary to fixed-effect analyses that average the criteria for each model, the random-effect model selection allows the investigation of inter-individual differences and to discard the hypothesis of the pooled evidence to be biased or driven by some individuals– i.e. outliers. This procedure estimates the model expected frequencies and the exceedance probability for each model within a set of models, given the data gathered from all participants. The expected frequency is the probability of the model to generate the data obtained from any randomly selected participant– it is a quantification of the posterior probability of the model (PP). It must be compared to chance level, which is one over the number of models in the model space. The exceedance probability (XP), is the probability that a given model fits the data better than all other models in the model space. Theoretically, a model with the highest expected frequency and the highest exceedance probability is considered as ‘the winning model’.” – teaches outputting a set of features included in the second feature list and corresponding weights when the information criterion reaches a maximum (AIC fed into toolbox to investigate inter-individual differences, procedure estimates model expected frequency and exceedance probability based on AIC, model with highest expected frequency and exceedance probability is the winning model, and thus outputs the selected model when the AIC reaches a maximum));
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the information criterion of Najar to the feature lists and outputs of Haug in order to measure the gap in performance between objective functions and output results when performance reaches a maximum. Doing so would estimate the quality of fit of a model while accounting for its complexity (Najar, Section 5).
Regarding claim 13, the combination of Haug and Najar teaches the learning device according to claim 1,
wherein the expert decision-making history data comprises trajectories indicating a sequence of states and actions of an expert (Haug, Section 3 Paragraph 2 – “By a policy we mean a family of distributions on A indexed by S, where πs(a) describes the probability of taking action a in state s. We denote by Π the set of all such policies. The performance measure for policies we are interested in is the expected discounted reward R(π) := E( ∞ t=0 γtR(st)), where the expectation is taken with respect to the distribution over trajectories (s0,s1,...) induced by π together with the transition probabilities T and the initial-state distribution D.” – teaches wherein the expert decision-making history data comprises trajectories indicating a sequence of states and actions of an expert (policy means family of distribution A indexed by S, where πs(a) describes the probability of taking action a in state s. Expectation is taken with respect to the distribution over trajectories (s0,s1,...) induced by π.)).
Claims 14 and 15 are similar to claim 13, hence similarly rejected.
Claim(s) 7-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Haug and Najar as applied to claims 1, 9, and 11 above, and further in view of Gomez et al. (US Pub. No. 2023/0173683, effective filing date of June 2020, hereinafter “Gomez”).
Regarding claim 7, the combination of Haug and Najar teaches the learning device according to claim 6, wherein the one or more processors are configured to execute the instructions to
The combination of Haug and Najar fails to explicitly teach outputting the features in selected order.
However, analogous to the field of the claimed invention, Gomez teaches wherein the one or more processors are configured to execute the instructions to
output the features in selected order (Gomez, [0278] – “The TAMER agent 2300 uses the behavior selection unit 2304 to select a behavior a (angle command) having a reward function R.sub.H(s, a). The behavior selection unit 2304 selects an action with the largest human reward prediction value, to maximize the reward of the human Hu according to an immediate behavior of the robot 2001.” and in [0280] – “The allocation evaluation unit 2303 receives an evaluation feedback h given by the trainer and calculates a probability (credit) h of the previously selected behavior. The allocation evaluation unit 2303 is used to deal with a temporal delay in the reward of the human due to evaluating and rewarding the behavior of the robot 2001. The allocation evaluation unit 2303 learns a prediction model R{circumflex over ( )}.sub.H of the reward of the person and provides a reward to the agent within a Markov Decision Process (MDP) designated as {S, A, T, R{circumflex over ( )}.sub.H, γ}.” – teaches inputting features in a selected order and outputting a resulting reward function based on the selected features, thus outputting features in a selected order (behavior selection unit selects behavior and action, allocation evaluation unit receives feedback and learns prediction model R)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the outputting in selected order of Gomez to the feature lists, object functions, and inverse reinforcement learning method of Haug and Najar in order to output results in a selected order. Doing so would select features predicted to directly elicit a maximum reward without considering influence on future states (Gomez, [0287]) and maximize rewards according to immediate features (Gomez, [0278]).
Regarding claim 8, the combination of Haug and Najar teaches the learning device according to claim 1, wherein the one or more processors are configured to execute the instructions to:
The combination of Haug and Najar fails to explicitly teach present the selected features to a user; accept a selection instruction from the user for the presented features; select one or more top features in a predetermined number of features to be estimated to get closer to the ideal reward result; present the selected one or more features to the user; and generate the second objective function by inverse reinforcement learning using a feature selected by the user.
However, analogous to the field of the claimed invention, Gomez teaches:
present the selected features to a user (Gomez, [0279] – “The trainer (human Hu) observes the state s of the robot 2001 and the selected operation φ.sub.a (environment 2312), evaluates quality thereof, and performs feedback.” – teaches presenting selected features to a user (observes state of robot and selected operation));
accept a selection instruction from the user for the presented features [[,]] (Gomez, [0280] – “The allocation evaluation unit 2303 receives an evaluation feedback h given by the trainer and calculates a probability (credit) h of the previously selected behavior.” – teaches accepting a selection instruction (evaluation feedback h given by trainer) for the presented features, as in [0279] – “The trainer (human Hu) observes the state s of the robot 2001 and the selected operation φ.sub.a (environment 2312), evaluates quality thereof, and performs feedback. The allocation evaluation unit 2303 acquires an evaluation h fed back in this way.” – teaches a selection instruction from user (evaluation feedback) for the presented features (observed state and selected operation));
select one or more top features in a predetermined number of features to be estimated to get closer to the ideal reward result[[,]] (Gomez, [0283] – “The TAMER agent 2300 learns a reward function of the human user and tries to maximize the reward of the human using argmax aR{circumflex over ( )}.sub.H(s, a).” – teaches selection one or more top features (s, a - state action pair) in a predetermined number of features to be estimated to get closer to the ideal reward result (learns a reward function of the human user and tries to maximize the reward of the human));
present the selected one or more features to the user[[,]] (Gomez, [0279] – “The trainer (human Hu) observes the state s of the robot 2001 and the selected operation φ.sub.a (environment 2312), evaluates quality thereof, and performs feedback.” – teaches presenting selected features to a user (observes state of robot and selected operation)); and
generate the second objective function by inverse reinforcement learning using a feature selected by the user (Gomez, [0296] – “The behavior selection unit 2304 selects another behavior (suggested angle command) φ.sub.a using the updated reward function R.sub.H(s, a). A trajectory generated through the demonstration of the human and planning consist of a sequence of pairs of state and operation {(s.sub.0, a.sub.0), . . . , (s.sub.n, a.sub.n)}, which are supplied into an inverse RL algorithm. The agent 2300 learns the reward function of the human and selects a behavior of the robot 2001 that maximizes the reward of the human using argmaxaR{circumflex over ( )}.sub.H(s, a)” – teaches generating the second objective function (updated reward function) by inverse reinforcement learning using a feature selected by the user (trajectory through demonstration of the human and planning consisting of sequence of pairs of state and operation which are supplied into an inverse RL algorithm)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the presentation of features to a user for feature selection of Gomez to the inverse reinforcement learning of Haug and Najar in order to incorporate user feature selection into the system. Doing so would maximize rewards according to immediate features (Gomez, [0278]) and enable evaluation feedback from users to provide faster learning and teaching of the robot to perform tasks (Gomez, [0006]).
Response to Arguments
Applicant’s arguments, see pp. 8-10, filed 29 December 2025, with respect to the rejection(s) of claim(s) 1, 9, and 11 under 35 U.S.C. 103 have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made over Haug in view of Najar et al. (NPL: Imitation as a model-free process in human reinforcement learning, published Oct. 2019). Haug teaches “a memory storing instructions, expert decision-making history data…”, “initialize a first feature list to comprise the plurality of candidate features…” , “derive, by inverse reinforcement learning using all candidate features in the first feature list, a respective weight…”, “generate a worldview representation that identifies…”, “calculate, for each candidate feature not yet included in the second feature list, a teaching-risk value…”, “select, from the first feature list, at least one candidate feature having a teaching-risk value that is among smallest teaching-risk values…”, “generate, by inverse reinforcement learning using only features included in the second feature list, the second objective function…”, “iteratively repeat generating the worldview representation, calculating the teaching-risk value…”, “output a set of features included in the second feature list…”, and “use the second objective function to control operation of a physical system…”. Najar teaches “calculate an information criterion…”, “output… when information criterion reaches…” and “calculating the information criterion while the information criterion is…”
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Choi et al. (NPL: Hierarchical Bayesian Inverse Reinforcement Learning, published April 2015) teaches an inverse reinforcement learning method for inferring the underlying reward function from expert’s behavior data. Teaches using a Bayesian likelihood probability that represents the compatibility of the reward function with the features.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOUIS C NYE whose telephone number is 571-272-0636. The examiner can normally be reached Monday - Friday 9:00AM - 5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MATT ELL can be reached at 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LOUIS CHRISTOPHER NYE/Examiner, Art Unit 2141
/MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141