DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR
1.17(e), was filed in this application after final rejection. Since this application is eligible for continued
examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the
finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's
submission filed on 9 January 2026 has been entered.
Response to Amendment
The amendment filed on 17 November 2025 has been entered.
Claims 1-19 are pending.
Claims 5, 7, 10 are cancelled.
Claims 1, 6, 8, 11, 15-19 are amended.
Claims 1-4, 6, 8-9, 11-19 will be pending.
Response to Arguments
Applicant’s arguments, filed 17 November 2025, with respect to USC 101 have been fully considered and are persuasive. The rejections of Claims 1-4, 6, 8-9, 11-19 under 35 USC 101 have been withdrawn.
Applicant’s remarks, regarding the rejections of claims under 35 USC 103, have been fully considered.
Applicant respectfully submits that the rejection of Claim 1 under 35 USC 103 is rendered moot by the present amendment to Claim 1. However, since Claim 1 is amended to incorporate features recited in previous Claim 5, Applicant will address the rejection of Claim 5 set forth in the Office Action.
Applicant submits '145 application is completely silent regarding the performing of a first re-learning process of updating the learning model in response to determining that a change in the determined reward amount exceeds a predetermined threshold, and the performance of a second re-learning process. different from the first re-learning process, of updating the learning model, in response to determining that the change in the determined reward amount does not exceed the predetermined threshold, as recited in amended Claim 1. In particular, the '145 application is completely silent regarding the predetermined threshold and the two different learning processes based on the reward amount being above or below the threshold, as claimed.
Applicant respectfully submits no matter how the teachings of the ‘465, '145, and '381 applications are combined, the combination does not teach or suggest the functionality of the processing circuitry recited in amended Claim 1.
Examiner respectfully disagrees. In response to Applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which Applicant relies (i.e., “performing of a first re-learning process of updating the learning model in response to determining that a change in the determined reward amount exceeds a predetermined threshold, and the performance of a second re-learning process different from the first re-learning process, of updating the learning model, in response to determining that the change in the determined reward amount does not exceed the predetermined threshold”), which were previously cited in Claim 5 and now currently cited in amended Claim 1, under broadest reasonable interpretation (BRI), are given their plain meaning, unless such meaning is inconsistent with the Specification, see MPEP § 2111.01(I). As clarified in the present Office Action below, Examiner submits Jain teaches a learning and relearning process in the same manner as the claimed invention (cf. Jain, ([0067] perform a first re-learning process of updating the learning model FIG. 6 depicts an example of a model-based approach 600 of the experience analysis system 110 for determining user experience values based on interaction data received from the host system 108. Interaction data for a user session is received from a remote system, and the state prediction system 202 perform a second re-learning process of updating the learning model, different from the first re-learning process of updating the learning model determines probabilities of transitioning from a current state to multiple next states based on the interaction data.). Jain discusses the determination of user reaction and experience values, based on the collection of user behavior information and interaction data received, using model-based and model-free reinforcement learning approaches. The model-based learning of Jain (learning and relearning) iteratively updates the model through reinforcement learning using the determination of user reaction and experience values as reward values for the learning process, determining probabilities of transitioning from a current state to multiple next states.
Further clarified in the present Office Action below, model-based reinforcement learning processes model learning and relearning of the environment by simulation and prediction of future states and rewards, continually updating the model to convergence, similar to the in response to determining that a change in the determined reward amount exceeds a predetermined threshold and in response to determining that the change in the determined reward amount does not exceed the predetermined threshold of the claimed invention (cf. Jain, [0066] User experience values may be determined at every action, where in response to determining that a change in the determined reward amount rewards and penalties are tied to suitable achievement and non-achievement of goals. For assigning user experience values to all other states, relative to the reward states, the techniques represented by FIGS. 6 and 7 implement a Bellman equation and fixed-point iteration technique. These dynamic programing-based techniques utilize the transition probabilities learned from the pre-trained state prediction system 202 to iteratively improve the estimate of state values. exceeds a predetermined threshold, in response to determining that the change in the determined reward amount does not exceed the predetermined threshold At every iteration, the value of a state is updated to its immediate reward added to the expected sum of values of following states. The techniques converge when the values of all states in subsequent iterations stop changing.). The model-based reinforcement learning taught by Jain uses a reward function to determine reward values associated with the transitioning from current states to next states in each iteration, adjusting the learning model according to the environment and (predetermined threshold) achievement and non-achievement of goals, as claimed in the instant application.
The rejection of Claim 1 under 35 USC 103 has been maintained. Similarly, the rejections of Claims 17, 18 under 35 USC 103 have been maintained. Rejections of Claims 2-4, 6, 8-9, 11-16, 19, which depend directly or indirectly from Claim 1, under 35 USC 103 have been maintained.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-4, 6, 13-14, 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Kimura et al. (U.S. Pre-Grant Publication No. 20190272465, hereinafter ‘Kimura'), in view of Jain et al. (U.S. Pre-Grant Publication No. 20200118145, hereinafter 'Jain'), and further in view of Lee et al. (U.S. Pre-Grant Publication No. 20050054381, hereinafter 'Lee').
Regarding claim 1 and analogous claims 17, 18, Kimura teaches An information processing device, comprising: processing circuitry configured to determine an action in response to input information input by a user ([0139] As shown in FIG. 10, the computer system 10 is shown in the form of a general-purpose computing device. The components of the computer system 10 may include, but are not limited to, a processor (or processing circuitry configured processing circuitry) 12 and a memory 16 coupled to the processor 12 by a bus including a memory bus or memory controller, and a processor or local bus using any of a variety of bus architectures.; [0033] The reinforcement learning system 110 performs reinforcement learning with the novel reward estimation. During a phase of inverse reinforcement learning (IRL), the reinforcement learning system 110 learns a reward function appropriate for the environment 102 by using the expert demonstrations that are actually performed by the expert 104. During runtime of the reinforcement learning (RL), the reinforcement learning system 110 estimates a reward by using the learned reward function, for each action the agent takes, and subsequently learns an action policy for to determine an action in response to input information the agent to perform a given task, using the estimated rewards.; [0050] The state prediction model 130 is configured to predict, for input by a user an inputted state, a state similar to the expert demonstrations that has been used to train the state prediction model 130. By inputting an actual state observed by the agent 120 into the state prediction model 130, the state prediction model 130 calculates a predicted state for the inputted actual state.; [0062] In a particular embodiment, the modules used for the IRL phase (130, 150, 160, 170) and the modules used for the RL phase (120, 130, 140) may be implemented on respective computer systems separately. For example, the modules 130, 150, 160, 170 for the IRL phase are implemented on a vender-side computer system and the modules 120, 130, 140 for the RL phase are implemented on a user-side (edge) device.),
by applying the input information to a predetermined learning model, which outputs the determined action ([0034] As shown in FIG. 1, the reinforcement learning system 110 includes an agent 120 that executes an action and observes a state in the environment 102; a state prediction model 130 that is trained using the expert demonstrations; and a reward estimation module 140 that estimates a reward signal based on a state predicted by the by applying the input information to a predetermined learning model state prediction model 130 and an actual state observed by the agent 120.; [0050] The state prediction model 130 is which outputs the determined action configured to predict, for an inputted state, a state similar to the expert demonstrations that has been used to train the state prediction model 130.);
cause an agent to perform the determined action ([0035] The agent 120 is the aforementioned reinforcement learning agent that interacts with the environment 102 in time steps and updates the action policy. At each time, the agent 120 observes a state (s) of the environment 102. The cause an agent to perform the determined action agent 120 selects an action (a) from the set of available actions according to the current action policy and executes the selected action (a).);
Kimura fails to teach determine a reward amount based on a reaction of the user to the action performed by the agent, the reaction being obtained by a sensor that detects the reaction of the user; perform a first re-learning process of updating the learning model, in response to determining that a change in the determined reward amount exceeds a predetermined threshold; and perform a second re-learning process of updating the learning model, different from the first re-learning process of updating the learning model, in response to determining that the change in the determined reward amount does not exceed the predetermined threshold.
Jain teaches perform a first re-learning process of updating the learning model, in response to determining that a change in the determined reward amount exceeds a predetermined threshold; and perform a second re-learning process of updating the learning model, different from the first re-learning process of updating the learning model, in response to determining that the change in the determined reward amount does not exceed the predetermined threshold ([0067] perform a first re-learning process of updating the learning model FIG. 6 depicts an example of a model-based approach 600 of the experience analysis system 110 for determining user experience values based on interaction data received from the host system 108. Interaction data for a user session is received from a remote system, and the state prediction system 202 perform a second re-learning process of updating the learning model, different from the first re-learning process of updating the learning model determines probabilities of transitioning from a current state to multiple next states based on the interaction data.; [0066] As stated above, user experience values may be determined by one of two different techniques: a model-based approach that utilizes a value interaction model as represented by FIG. 6, and a model-free approach that utilizes temporal difference learning as represented by FIG. 7. The behavior of users is simulated by learning a model of the environment. For example, a recurrent neural network (RNN) with a long short-term memory (LSTM) module may be used to model the interactive computing environment. Multi-dimensional and continuous historical information may be encoded in the LSTM memory cell along with the current event to characterize states. User experience values may be determined at every action, where in response to determining that a change in the determined reward amount rewards and penalties are tied to suitable achievement and non-achievement of goals. For assigning user experience values to all other states, relative to the reward states, the techniques represented by FIGS. 6 and 7 implement a Bellman equation and fixed-point iteration technique. These dynamic programing-based techniques utilize the transition probabilities learned from the pre-trained state prediction system 202 to iteratively improve the estimate of state values. exceeds a predetermined threshold, in response to determining that the change in the determined reward amount does not exceed the predetermined threshold At every iteration, the value of a state is updated to its immediate reward added to the expected sum of values of following states. The techniques converge when the values of all states in subsequent iterations stop changing.; Model-based reinforcement learning, as taught by Jain above, processes model learning and relearning of the environment by simulation and prediction of future states and rewards, continually updating the model to convergence.).
Kimura and Jain are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Kimura, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Jain to Kimura before the effective filing date of the claimed invention in order to provide a flexible framework that may extend to different types of reward structures and multiple goals of users (cf. Jain, [0005] The user experience evaluation techniques provide advantages over conventional techniques by measuring user experience from interaction data. The resulting user experience values represent user behavior on online platforms, which is more reliable than survey responses and likely more accurate. For one advantage, the user experience values are determined by events or actions at an individual user level, consistent with long-view online behaviors. In addition, a decision theoretic framework, Partially Observable Markov Decision Process (POMDP), is used to represent browsing behaviors, thus maximizing the overall reward from each entire journey. The decisions by each user are conditional on rewards of past actions and expectations of future rewards, recognizing that the user learns from current actions and may change future actions in view of what has been learned. The POMDP is also used for representing partially observable states and measuring latent experiences in the journeys of users. The user experience evaluation techniques further provide a flexible framework that may extend to different types of reward structures and multiple goals of users. The above advantages distinguish the user experience evaluation techniques from conventional approaches.; [0106] The user experience evaluation techniques for the interactive computing environment described herein have the advantage of determining experienced values based on unsupervised learning of long-view online behaviors without the need for user responses to surveys. The user experience evaluation techniques have the further capability of being compared to user responses to surveys as a final step for the purpose of evaluating the performance of the techniques.).
Lee teaches determine a reward amount based on a reaction of the user to the action performed by the agent, the reaction being obtained by a sensor that detects the reaction of the user ([0081] According to another optional but preferred embodiment of the present invention, there is provided performed by the agent one or more intelligent agents for use with a mobile information device over a mobile information device network, preferably including an avatar through which the agent may communicate with the human user.; [0030] Adaptiveness and/or emotions are optionally and preferably assisted through the use of rewards for learning by the proactive user interface. based on a reaction of the user to the action Suggestions or actions of which the user approves preferably determine a reward amount provide a reward, or a positive incentive, to the proactive interface to continue with such suggestions or actions; disapproval by the user preferably causes a disincentive to the proactive user interface to continue such behavior(s).; [0190] Grade=(weighting_factor*feedback_reward)+((1−weighting_factor)*world_reward), in which the feedback_reward results from the feedback provided by the user and the world_reward is the aggregated total reward from the virtual environment as described above; weighting_factor is optionally and preferably a value between 0 and 1, which indicates the weight of the user feedback as opposed to the virtual environment (world) feedback.; [0107] According to an optional embodiment of the present invention, the computational device may feature the reaction being obtained by a sensor one or more biological sensors, for sensing various types of biological information about the user, such as emotional state, physical state, movement, etc. This information may then be fed to sensors 104 for assisting perception unit 106 to that detects the reaction of the user determine the state of the user, and hence to determine the proper state for the device. Such biological sensors may include but are not limited to sensors for body temperature, heart rate, oxygen saturation or any other type of sensor which measures biological parameters of the user.; [0269] Optionally and preferably, this system is in communication with one or more biological sensors, which may optionally and preferably sense the biological state of the user, and/or sense movement of the user, etc. These additional sensors preferably provide information which enables the adaptive system to determine the correct state for the mobile information device, without receiving specific input from the user and/or querying the user about the user's current state. Images captured by a device camera could also optionally be used for this purpose.); and
Kimura, Jain, and Lee are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Kimura and Jain, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Lee to Kimura before the effective filing date of the claimed invention in order to provide an enhanced user experience and interaction with the computational device (cf. Lee, [0034] The present invention is preferably implemented in order to provide an enhanced user experience and interaction with the computational device, as well as to change the current generic, non-flexible user interface of such devices into a flexible, truly user friendly interface. More preferably, the present invention is implemented to provide an enhanced emotional experience of the user with the computational device, for example according to the optional but preferred embodiment of constructing the user interface in the form of an avatar which would interact with the user.).
Regarding claim 2, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura teaches wherein the learning model is a learning model generated or updated through reinforcement learning ([0033] The wherein the learning model is a learning model generated or updated through reinforcement learning reinforcement learning system 110 performs reinforcement learning with the novel reward estimation. During a phase of inverse reinforcement learning (IRL), the reinforcement learning system 110 learns a reward function appropriate for the environment 102 by using the expert demonstrations that are actually performed by the expert 104. During runtime of the reinforcement learning (RL), the reinforcement learning system 110 estimates a reward by using the learned reward function, for each action the agent takes, and subsequently learns an action policy for the agent to perform a given task, using the estimated rewards.).
Kimura, Jain, and Lee are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 3, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 2.
Kimura teaches wherein the reinforcement learning is reinforcement learning that uses long short-term memory (LSTM) ([0084] FIG. 4B illustrates a schematic of other example of the temporal sequence prediction model that can be used as the state prediction model 130. The example of the temporal sequence wherein the reinforcement learning is reinforcement learning that uses long short-term memory (LSTM) prediction model shown in FIG. 4B is a long short term memory (LSTM) based model 320. The LSTM based model 320 shown in FIG. 4B may have an input layer 322, one or more (two in the case shown in FIG. 4B) LSTM layers 324, 326 with certain activation function, one fully-connected layer 328 with certain activation units and a fully connected final layer 330 with same dimension to the input layer 322.).
Kimura, Jain, and Lee are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 4, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura teaches wherein the processing circuitry is further configured to determine whether or not a change in an environment has occurred by determining whether or not the determined reward amount has varied over time ([0054] If an wherein the processing circuitry is further configured to determine whether or not a change in an environment has occurred actual state observed by the agent 120 is similar to the state predicted by the state prediction model 130, the estimated reward value becomes higher. On the other hand, if an actual state observed by the agent 120 is different from the state predicted by the state prediction model 130, the estimated reward value becomes lower.; [0081] If the agent's policy takes an action that by determining whether or not the determined reward amount has varied changes the environment towards states far away from the expert state trajectories τ, the reward is estimated to be low. If the action of the agent 120 brings it close to the expert state trajectories τ, thereby making the predicted next state match with the actual state, the reward is estimated to be high.; [0082] Further referring to FIGS. 4A and 4B and FIGS. 5A and 5B, examples of the over time temporal sequence prediction models that can be used in the IRL according to one or more embodiments of the present invention are schematically described.).
Kimura, Jain, and Lee are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 6, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Jain teaches wherein the first re-learning process changes the learning model to a greater extent than the another second re-learning process ([0066] As stated above, user experience values may be determined by one of two different techniques: a model-based approach that utilizes a value interaction model as represented by FIG. 6, and a model-free approach that utilizes temporal difference learning as represented by FIG. 7. The behavior of users is simulated by learning a model of the environment. For example, a recurrent neural network (RNN) with a long short-term memory (LSTM) module may be used to model the interactive computing environment. Multi-dimensional and continuous historical information may be encoded in the LSTM memory cell along with the current event to characterize states. User experience values may be determined at every action, where rewards and penalties are tied to suitable achievement and non-achievement of goals. For assigning user experience values to all other states, relative to the reward states, the techniques represented by FIGS. 6 and 7 implement a Bellman equation and fixed-point iteration technique. These dynamic programing-based techniques utilize the transition probabilities learned from the pre-trained state prediction system 202 to iteratively improve the estimate of state values. wherein the first re-learning process changes the learning model to a greater extent than the another second re-learning process At every iteration, the value of a state is updated to its immediate reward added to the expected sum of values of following states. The techniques converge when the values of all states in subsequent iterations stop changing.).
Kimura, Jain, and Lee are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 13, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Jain teaches wherein when the change in the determined reward amount is a change exceeding the predetermined threshold, a cause of the change is inferred and a re-learning is performed based on the inferred cause ([0058] The user experience evaluation techniques incorporate domain knowledge in the form of a reward function, r. The cause of the change is inferred and a re-learning is performed based on the inferred cause reward function includes multiple reward values associated with transitioning from a current state to one or more next states. The rewards may be formulated to capture the meaning of a user experience per the needs of the host system 108, including the online platform 122. The rewards may be formulated in a variety of ways, and example formulations are described below.; [0059] For one formulation, the rewards may focus on a goal of a “Purchase” event by a user, and all other events may be assigned a small penalty to reflect a lack of accomplishing the goal of making a purchase.; [0060] where, −ε represents a small penalty. In other words, for a purchase action, the reward may be assigned is “1” and, for all other states, the reward may be a penalty. when the change in the determined reward amount is a change exceeding the predetermined threshold, a cause of the change is inferred Rewards may include significant penalties for certain events considered to be important. Also, it is to be understood that this specific formulation directed to a purchase action does not imply that every user having a goal of purchasing will actually make a purchase. It is expected that the interaction data to be collected will include purchase and non-purchase events.; [0061] For another formulation, the rewards may still focus on the “Purchase” goal and the small penalty while adding a negative effect for ending a session before making a purchase.).
Kimura, Jain, and Lee are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 14, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura teaches wherein when a time period in which the determined reward amount does not vary extends for a predetermined time period, a re- learning for generating a new learning model is performed ([0035] The agent 120 is the aforementioned reinforcement learning agent that interacts with the environment 102 in time steps and updates the action policy. At each time, the agent 120 observes a state (s) of the environment 102. The agent 120 selects an action (a) from the set of available actions according to the current action policy and executes the selected action (a). The environment 102 may transit from the current state to a new state in response to the execution of the selected action (a). The agent 120 observes the new state and receives a reward signal (r) from the environment 102, which is associated with the transition. In the reinforcement learning, a well-designed reward function may be required to learn a good action policy for performing the task.; [0071] At step S105, the processing circuitry may observe an initial actual state sl by the agent 120. The when a time period in which the determined reward amount does not vary extends for a predetermined time period, a re- learning for generating a new learning model is performed loop from step S106 to step S111 may be repeatedly performed for every time steps t (=1, 2, . . . ) until a given termination condition is satisfied (e.g., max number of steps, convergence determination condition, etc.).).
Kimura, Jain, and Lee are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 19, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Lee teaches wherein the sensor is a camera configured to capture at least one of an emotion, a degree of stress, and a level of satisfaction, as the reaction of the user to the performed action ([0107] According to an optional embodiment of the present invention, the computational device may feature one or more biological sensors, for configured to capture at least sensing various types of biological information about the user, such as one of an emotion emotional state, degree of stress, and a level of satisfaction physical state, movement, etc. This information may then be fed to sensors 104 for assisting perception unit 106 to determine the state of the user, and hence to determine the proper state for the device. Such biological sensors may include but are not limited to sensors for body temperature, heart rate, oxygen saturation or any other type of sensor which measures biological parameters of the user.; [0269] Optionally and preferably, this system is in communication with one or more biological sensors, which may optionally and preferably sense the biological state of the user, and/or sense movement of the user, etc. These additional sensors preferably provide information which enables the adaptive system to as the reaction of the user to the performed action determine the correct state for the mobile information device, without receiving specific input from the user and/or querying the user about the user's current state. Images captured by a wherein the sensor is a camera device camera could also optionally be used for this purpose.).
Kimura, Jain, and Lee are combinable for the same rationale as set forth above with respect to claim 1.
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, in view of Jain, Lee, and further in view of Chalmers et al. (NPL: "Context-Switching and Adaptation: Brain-Inspired Mechanisms for Handling Environmental Changes", hereinafter ‘Chalmers').
Regarding claim 8, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura, as modified by Jain and Lee, fails to teach wherein a new learning model obtained as a result of the first re-learning process is newly generated based on the predetermined learning model.
Chalmers teaches wherein a new learning model obtained as a result of the first re-learning process is newly generated based on the predetermined learning model ([B. Context switching: detecting changes in the environment, pg. 3525] The agent maintains a history of its recent state transitions, and a library of saved models. It constantly checks its state transition history against its current model: wherein a new learning model obtained as a result of the first re-learning process is newly generated based on the predetermined learning model if the model no longer explains it satisfactorily, the agent assumes the environment has changed and saves its current model for future use. It can then retrieve from the library a previously saved model that better predicts the recent state transitions.).
Kimura, Jain, Lee, and Chalmers are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Kimura, Jain, and Lee, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Chalmers to Kimura before the effective filing date of the claimed invention in order to allow a previously-learned model to be efficiently adapted for use in a new task, while context switching allows learned models to be saved and recalled at the appropriate times (cf. Chalmers, [Abstract, pg. 3522] Reinforcement learning (RL) allows an intelligent agent to learn optimal behavior as it interacts with its environment. Conventional model-based RL algorithms learn rapidly, but can be slow to adapt to sudden changes in the environment. Animals’ brains, however, are thought to employ model-based RL mechanisms for learning, but are able to adapt to changes with relative ease. By employing “transfer learning”, they can recycle previously learned information to solve new problems with minimal new learning. We developed two brain-inspired methods that can allow model-based RL to cope with changes to the underlying process being learned: hierarchical state abstraction, and context-switching. Hierarchical state abstraction allows a previously-learned model to be efficiently adapted for use in a new task, while context switching allows learned models to be saved and recalled at the appropriate times. We test these mechanisms using grid-world simulations in which the goal remains constant, but contingencies for reaching it frequently change. These mechanisms allow an agent to significantly outperform a conventional model-based RL algorithm in the task.).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, in view of Jain, Lee, and further in view of Gao et al. (U.S. Pre-Grant Publication No. 20220044454, hereinafter ‘Gao').
Regarding claim 9, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura, as modified by Jain and Lee, fails to teach wherein when a change exceeding the predetermined threshold occurs, the predetermined learning model is switched to another learning model different from the predetermined learning model, the another learning model being one of a plurality of learning models included in the information processing device or being obtainable from outside by the information processing device.
Gao teaches wherein when a change exceeding the predetermined threshold occurs, the predetermined learning model is switched to another learning model different from the predetermined learning model, the another learning model being one of a plurality of learning models included in the information processing device or being obtainable from outside by the information processing device ([0026] In some embodiments, the modification policy may be generated by a Markov decision process, dynamic programming, Q-learning, and/or any other suitable deep-reinforcement learning process. The when a change exceeding the predetermined threshold occurs, the predetermined learning model is switched to another learning model different from the predetermined learning model deep-reinforcement learning process applies a feedback learning procedure that is used to generate an optimal (or preferred) model or model modification, referred to herein as a “policy.” The model includes a plurality of states and state transitions that identify, for any given state, the optimal action (e.g., adjustment to the model) to be performed. Modifications that improve a model (e.g., beneficial modifications) are rewarded using a reward-feedback system implemented by the deep-reinforcement learning process.; [0027] The feedback data set 162 a may further include data regarding artifact reductions, acquisition protocols, reconstruction protocols, customer-specific requirements, changes, cross-sections, scanner specific data or adjustments incorporated during generation of the reconstructed image, other system specific feedback, a score or threshold indication related to the reconstructed image and/or the received model, modifications to the reconstructed image or received model made by the human expert, and/or other suitable feedback based on human expert interaction with the reconstructed image, image model, and/or image processing element 158.; [0032] During each iteration of the deep-reinforcement learning process 168, the another learning model being one of a plurality of learning models included in the information processing device or being obtainable from outside by the information processing device model modification element 166 modifies a preexisting image model, such as a predefined model 152 and/or refined image model generated during a prior iteration of the deep-reinforcement learning process 168, to incorporate the beneficial modifications identified using the feedback data sets 162 a, 162 b.).
Kimura, Jain, Lee, and Gao are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Kimura, Jain, and Lee, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Gao to Kimura before the effective filing date of the claimed invention in order to employ deep reinforcement learning to maximize utility of image model based on feedback data (cf. Gao, [0023] In some embodiments, the deep-reinforcement learning process 168 includes a goal-oriented process configured to maximize the utility of each image model based on the feedback data 162 a, 162 b received from the image processing element 158. For example, in some embodiments, one or more agents are tasked with maximizing an optimal policy related to the changes, modifications, or selections made by the image processing element 158 based on the received scan data and/or adjustments made by a human expert. The adjustments made by the human expert may be related to planning considerations (e.g., surgical planning, primary care, post-surgery follow-up, etc.), model deficiencies, patient scan data deficiencies, reconstruction, artifact correction, etc. The deep-reinforcement learning process 168 considers each choice or modification made by the image processing element 158 and/or the human expert and generates modifications for one or more models that account for and predict user changes, uses, and/or preferences.).
Claims 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Kimura, in view of Jain, Lee, and further in view of Aggarwal et al. (U.S. Pre-Grant Publication No. 20200341976, hereinafter ‘Aggarwal').
Regarding claim 11, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura, as modified by Jain and Lee, fails to teach wherein the performed action includes generating text and presenting the text to a user, the determined reward amount includes the reaction of the user to whom the text is presented, and the re-learning includes a re-learning of the learning model, which is a model for generating the text.
Aggarwal teaches wherein the performed action includes generating text and presenting the text to a user, the determined reward amount includes the reaction of the user to whom the text is presented, and the re-learning includes a re-learning of the learning model, which is a model for generating the text ([0014] To initiate the search, a user submits a search query to the search system, typically via a browser-based search interface accessible by the user's computing device, based on the informational need of the user. The search query is in the form of text, e.g., one or more query terms or a question. The search system traverses through the search database, selects and scores resources based on their relevance to the search query, and the performed action includes generating text and presenting the text to a user provides the search results. The search results usually link to the selected resources. The search results can be ordered according to the scores and presented according to this order. Unfortunately, existing search interfaces do not allow for the search system to have a meaningful interaction with the user, and therefore such search systems are unable to obtain useful contextual cues, which are often missed or not provided in the initial search query provided by the user.; [0016] In an embodiment, the agent that facilitates an interactive determined reward amount includes the reaction of the user to whom the text is presented search experience is implemented the re-learning includes a re-learning of the learning model, which is a model for generating the text using an appropriate machine learning algorithm, such as reinforcement learning (RL) or a comparable technique. Reinforcement learning is type of machine learning that deals with how a software agent should take actions in a given environment so as to maximize some notion of cumulative reward.).
Kimura, Jain, Lee, and Aggarwal are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Kimura, Jain, and Lee, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Aggarwal to Kimura before the effective filing date of the claimed invention in order to improve search results (cf. Aggarwal, [0019] In contrast, an agent as variously discussed in the present disclosure provides assistance in subjective search tasks, wherein the nature of the search problem at hand is fundamentally different from such slot filling exercises. In particular, in subjective search, simple search modalities and slots cannot be defined in advance and need to be discovered. To this end, an agent as variously described herein engages the user directly into the search which comprises a sequence of alternate turns between user and agent with more degrees of freedom (in terms of different actions the agent can take). For example, assume a scenario where a designer is searching for digital assets (e.g., over a repository of images, or videos) to be used in a movie poster. The user would start with a broad idea or concept, and her initial search criteria would be refined as the interactive search progresses. The modified search criteria involve an implicit cognitive feedback (such as conversationally acquired context), which can be used to improve the search results. The agent is trained for this type of subjective search task.).
Regarding claim 12, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura, as modified by Jain and Lee, fails to teach wherein the performed action includes making a recommendation to a user, the determined reward amount includes the reaction of the user to whom the recommendation is presented , and the re-learning includes a re-learning for making a new recommendation dependent on a change in a state of the user.
Aggarwal teaches wherein the performed action includes making a recommendation to a user, the determined reward amount includes the reaction of the user to whom the recommendation is presented ([0071] As second type of rewards of equation 1 is an extrinsic reward rextrinsic. This reward may be awarded at individual conversation turns, and hence, this reward is a summation of the extrinsic rewards at various conversation turns. This reward may be provided based on a response that the user 101 provides subsequent to an agent action. User actions may be categorized into two or more feedback categories, such as good, average, bad, etc. (or may be scaled in a scale or 1 to 5, with 5 being best or as intended by the agent 229, and 1 being worst). For example, if the the performed action includes making a recommendation to a user agent 229 prompts the user 101 to refine a search query and the user does follow the prompt, then the determined reward amount includes the reaction of the user to whom the recommendation is presented agent 229 receives a relatively high extrinsic reward rextrinsic, e.g., because the user 101 played along with the agent 229. On the other hand, if the user 101 refuses to refine the search query, a relatively low (or zero, or even negative) extrinsic reward rextrinsic is awarded to the agent 229.; [0119] As discussed herein, the search agent 229 assistant can be used to interact with the user 101, for helping the user 101 to search through the search database, while providing personalized recommendations, thereby making the environment an interactive recommendation plus search system.), and
the re-learning includes a re-learning for making a new recommendation dependent on a change in a state of the user ([0017] This information acquired by the RL-based agent is generally referred to herein as the state of the interactive search. So, at any given point in time during a conversational search session between an agent and a user, the interactive search has a known state, and the state can change or otherwise evolve in response to each cycle of the conversation. Thus, the the re-learning includes a re-learning for making a new recommendation dependent on a change in a state of the user agent action policy changes and evolves with each interactive cycle, based on the state of the interactive search. In this manner, the search results are updated based on what the RL-based agent has learned from its interactions with the user.; [0119] As discussed herein, the search agent 229 assistant can be used to interact with the user 101, for helping the user 101 to search through the search database, while providing personalized recommendations, thereby making the environment an interactive recommendation plus search system. In an example, the user 101 may possibly make an open-ended query, which may result in a diverse set of results, even though none of the results may be a good match. In such scenarios, the agent 229 prompts the user to refine the search query, or add additional details (e.g., such as where the search results would be used), in addition to providing recommendations.).
Kimura, Jain, Lee, and Aggarwal are combinable for the same rationale as set forth above with respect to claim 11.
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, in view of Jain, Lee, and further in view of Zhi et al. (U.S. Pre-Grant Publication No. 20190317472, hereinafter ‘Zhi').
Regarding claim 15, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura, as modified by Jain and Lee, fails to teach wherein the performed action includes control of a moving object, the determined reward amount includes environment information relating to the moving object, and the re-learning includes a re-learning of the learning model, which is for controlling the moving object.
Zhi teaches wherein the performed action includes control of a moving object, the determined reward amount includes environment information relating to the moving object, and the re-learning includes a re-learning of the learning model, which is for controlling the moving object ([0055] The learning unit 83 performs machine learning using learning data created by the preprocessing unit 90. The learning unit 83 generates a learning model by using a well-known machine learning method, such as unsupervised learning, supervised learning, or reinforcement learning and stores the generated learning model in the learning model storage 84.; [0056] FIG. 5 illustrates an internal functional configuration of the learning unit 83 that performs reinforcement learning, as an example of learning methods. performed action includes control of a moving object, the determined reward amount includes environment information relating to the moving object Reinforcement learning is a method in which a cycle of observing a current state (i.e. an input) of an environment in which an object to be learned exists, performing a given action (i.e. an output) in the current state, and giving some reward for the action is repeated in a try-and-error manner and a policy (setting of coefficients of the Lugre model in the present embodiment) that maximizes the sum of rewards is learned as the optimal solution.; [0057] The re-learning includes a re-learning of the learning model, which is for controlling the moving object learning unit 83 includes a state observation unit 831, a determination data acquisition unit 832, and a reinforcement learning unit 833. The functional blocks illustrated in FIG. 5 are implemented by the CPU 11 of the controller 1 and the processor 101 of the machine learning device 100 illustrated in FIG. 3 executing their respective system programs and controlling operations of components of the controller 1 and the machine learning device 100.).
Kimura, Jain, Lee, and Zhi are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Kimura, Jain, and Lee, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Zhi to Kimura before the effective filing date of the claimed invention in order to utilize reinforcement learning unit to generate a learning model that is capable of outputting the optimal solution of an action responsive to the current state (cf. Zhi, [0090] By repeating the learning cycle as described above, the reinforcement learning unit 833 becomes able to automatically identify features that imply correlation of coefficients S1 of the Lugre model with a position command S2, a speed command S3 and a position feedback S4. At the beginning of a learning algorithm, correlation of the coefficients S1 of the Lugre model with the position command S2, the speed command S3 and the position feedback S4 is practically unknown. However, as the learning proceeds, the reinforcement learning unit 833 gradually becomes able to identify features and understand correlation. When the correlation of the coefficients S1 of the Lugre model with the position command S2, the speed command S3 and the position feedback S4 is understood to a certain reliable level, results of the learning that are repeatedly output from the reinforcement learning unit 833 become usable for performing selection (decision-making) of the action of determining what coefficients S1 of the Lugre model are to be set in response to the current state, namely, a speed command S3 and a position feedback S4. In this way, the reinforcement learning unit 833 generates a learning model that is capable of outputting the optimal solution of an action responsive to the current state.).
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, in view of Jain, Lee, and further in view of Han (U.S. Pre-Grant Publication No. 20200034524) and Nazari et al. (U.S. Pre-Grant Publication No. 20190102676, hereinafter 'Nazari').
Regarding claim 16, Kimura, as modified by Jain and Lee, teaches The information processing device of claim 1.
Kimura, as modified by Jain and Lee, fails to teach wherein the performed action includes an attempt to authenticate a user, the determined reward amount includes evaluation information regarding authentication accuracy based on a result of the attempt to authenticate the user, and when the change in the determined reward amount exceeds the predetermined threshold, the processing circuitry is further configured to determine that the user is in a predetermined specific state and a re-learning suitable for the specific state is performed.
Han teaches wherein the performed action includes an attempt to authenticate a user, the determined reward amount includes evaluation information regarding authentication accuracy based on a result of the attempt to authenticate the user ([0054] In addition, voice emotion-related data 314 may be may be used by emotion recognizer 204 to the performed action includes an attempt to authenticate a user, the determined reward amount includes evaluation information regarding authentication accuracy based on a result of the attempt to authenticate the user authenticate user's identity and accumulate user's personal information and habit data in order to help the system to more accurately recognize user's voice and understand user's emotion in the voice. Text converted from voice emotion-related data 314 may be stored as history data and used by user intention computing processor 206 to derive interactive context in future interaction. Also, text converted from voice emotion-related data 314 may be used to derive scenario content. Furthermore, visual data, such as image, video, etc., containing facial expression emotion-related data 316 and gesture emotion-related data 318 may be used by emotion recognizer 204 to record and authenticate user's identity, for example, face ID unlock.; [0073] In some other embodiments, affective strategy formulator 208 may be implemented to build a Markov decision process (MDP) model through reinforcement learning based on a collection of status data (emotion-related data, emotion state, and/or semantic data), a collection of actions (normally referring to instructions), state conversion distribution function (the probability of user' emotion state to change after a certain action), reward function (to determine the ultimate purpose of an affective interaction session, e.g., when chatting with a robot, the longer the conversation is, the higher the reward function is).), and
Kimura, Jain, Lee, and Han are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Kimura, Jain, and Lee, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Han to Kimura before the effective filing date of the claimed invention in order to utilize reinforcement learning to determine what action to choose at each time instance, finding an optimal path to a solution (cf. Han, [0092] Semi-supervised learning is machine learning method that makes use of both labeled training data and unlabeled training data; [0093] One semi-supervised learning technique involves reasoning the label of unlabeled training data, and then using this reasoned label for learning. This technique may be used advantageously when the cost associated with the labeling process is high.; [0094] Reinforcement learning may be based on a theory that given the condition under which a reinforcement learning agent can determine what action to choose at each time instance, the agent can find an optimal path to a solution solely based on experience without reference to data.).
Nazari teaches when the change in the determined reward amount exceeds the predetermined threshold, the processing circuitry is further configured to determine that the user is in a predetermined specific state and a re-learning suitable for the specific state is performed ([0167] In response, the agent receives some value rt∈Rt i as a reward and observes the next state st+1 (denoted in FIG. 14 by the available next states 1406-1, 1406-2, . . . 1406-n in the next state space 1406). Note that, in some embodiments, a given action in a particular environment might probabilistically cause the environment to transition to different states (e.g., action 1404-1 might transition to next state 1406-1 with probability p1 and to next state 1406-2 with probability p2). These probabilities may be given by Pt i.; [0172] By decoupling the output of these two networks, the reward amounts can be identified. when the change in the determined reward amount exceeds the predetermined threshold Based on the reward amounts, policy weights are initialized at block 1502. The policy weights may be initialized in a greedy manner (e.g., by assigning the most weight to the policy that achieves the highest amount of value immediately), or may be initialized in other manners (e.g., random).; [0174] τt, t=0, 1, . . . represents the sequence of time points when the processing circuitry is further configured to determine that the user is in a predetermined specific state and a re-learning suitable for the specific state is performed either the agent needs to take an action for a user or update the policy (i.e., τt is the time point where Ct u ∪Ct a≠Ø). At each time step t, the agent observes the state of each customer i∈Ct a denoted by st i (block 1504) and chooses an action at i from the set of available actions A (block 1506). After executing the actions, the agent observes the value of the users (block 1508) and updates their next states (block 1510). At block 1510, the agent may optionally update its policy weights if one course of action yields more or less value than expected. The process then repeats (block 1512) for each available time step. At the end of the training process, the policy weights yielding the highest value over time are chosen to be used in the online process (block 1514).).
Kimura, Jain, Lee, Han, and Nazari are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Kimura, Jain, Lee, and Han, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Nazari to Kimura before the effective filing date of the claimed invention in order to capture changes of dynamics in the environment and used for predictive maintenance (cf. Nazari, [0019] The online process may be configured to capture changes of dynamics in the environment caused by changing one or more of the reward for taking the action or the probability of transitioning to a next state given that the action is taken. In one embodiment, the online process may be a deep concurrent temporal difference (DCTD) algorithm that applies a deep neural network to a model-free reinforcement learning method.; [0021] Exemplary embodiments may be used for, among other things, predictive maintenance in a network of connected devices. For example, the above-described actions may involve taking a device offline for maintenance with the goal of reducing future downtime (e.g., due to system failures). The above-described value may be an uptime, bandwidth, or other usage metric for the network. Thus, by removing a device in the short-term (thereby potentially incurring a cost or penalty), the long-term usage of the network may be improved, although usage improvements are not guaranteed—for instance, the predictive maintenance may be unnecessary because the likelihood of device failure is low. By applying the combination of the offline and online training process, the system can make better decisions to improve long-term network usage.).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Doya et al. (NPL: “Multiple Model-based Reinforcement Learning”) teaches organization of multiple model-based reinforcement learning (MMRL) architecture.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAGGIE MAIDO whose telephone number is (703) 756-1953. The examiner can normally be reached M-Th: 6am - 4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MM/Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129