Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
2. A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/20/2025 has been entered. Claims 1, 5, 11, 14 have been amended. Claim 21 has been added. Claims 1, 5-7, 9-12, 14-16, 18, and 21 remain pending in the application.
Response to Amendments
3. Applicant’s amendments to claims 1 and 11 have been fully considered and are persuasive. The amendments provided to overcome the 101 rejection (abstract idea) issued in the last office action is sufficient. The 35 U.S.C § 101 rejection of the claims is respectfully withdrawn.
Response to Arguments
Applicant argues that Yoon addresses a different problem and applies a fundamentally different learning structure. Examiner respectfully disagrees and notes that Yoon trains a policy agent to imitate the behavior of an expert (the actual delivery worker), generates trajectories from that policy, trains a reward network to evaluate those trajectories against expert trajectories, and updates the policy via PPO reinforcement learning. Thus, Yoon’s learning structure is directly relevant to the claimed behavior-learning process.
Applicant argues that Yoon does not disclose generating an initial behavior policy through supervised learning for the first behavior trajectory information. Examiner respectfully disagrees and notes that Yoon teaches generating a policy agent from the first trajectory. Specifically, Yoon discloses that the reward network generation unit may use the state of the first trajectory as training data to generate a policy agent configured to output an action for an input state. Yoon further discloses generating a second trajectory through the policy agent, where the second trajectory is obtained by imitating the action from the state of the first trajectory. Claussen teaches that imitation learning aims to mimic behavior in a task, and states that behavioral cloning learns a policy as a supervised learning problem over state-action pairs from expert trajectories. According, the combination of teachings teach the argued features.
Applicant argues that Yoon does not use a learned policy to autonomously control a device and generate real-world trajectory data. Yoon only generates synthetic trajectories to train the reward model. Examiner respectfully disagrees and notes that claim 1 contains no limitation requiring trajectories to be generated through real-world device deployment. The claim recites only “obtaining second behavior trajectory information based on the initial behavior policy, the second behavior trajectory information comprising a pair of the state data and autonomous control data derived by processing the state data based on the initial behavior policy.” Nothing in this language restricts the second trajectory to real-world data. The claim language is fully satisfied by any process that derives autonomous control data from state data using the initial behavior policy. Yoon teaches that the second trajectory is a trajectory obtained by imitating the action from the state of the first trajectory and includes the state of the first trajectory and the action imitated based on the state of the first trajectory. Yoon further states that the reward network generation unit acquires an action for the state of the first trajectory through the policy agent and generates the second trajectory based on the state of the first trajectory and the acquired action.
Applicant argues that Yoon does not disclose or suggest, "training an evaluation model for classifying the first behavior trajectory information and the second behavior trajectory information." Examiner respectfully disagrees and notes that Yoon teaches the reward network is trained on both the first trajectory and second trajectory and learns to assign different reward values to each. This is exactly the training of an evaluation model for classifying the two trajectory types.
Applicant argues that Yoon does not disclose, "updating the initial behavior policy through reinforcement learning using the trained evaluation model directly as a reward function." Examiner respectfully disagrees and notes that Yoon teaches updating the policy agent based on rewards output by the reward network. Yoon states that the reward generation unit may update the weight of the policy agent through a proximal policy optimization (PPO) algorithm on the basis of a second reward for the second trajectory acquired through the reward network.
Applicant argues that Yoon does not disclose or suggest, "wherein the method further comprises iteratively generating additional behavior trajectory information based on an updated behavior policy." Examiner respectfully disagrees and notes that the claim does not require that each iteration be performed through physical real-world deployment of the device. It recites iteratively generating additional behavior trajectory information based on an updated behavior policy. Yoon teaches that the reward network and policy agent are generated though an iterative learning process. Yoon further states that “by repeatedly performing this process, the policy agent and the reward network are simultaneously trained through interaction. As the training is repeated, the policy agent adopts an action pattern similar to that of the motorcycle delivery worker.” Accordingly, Yoon teaches the claimed iterative generation of additional behavior trajectory information based on an updated behavior policy.
Claim Rejections – 35 USC § 103
4. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
5. Claims 1, 5, 11-12, 14, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Yoon et al. (U.S. Patent Application Pub. No. US 20220405682 A1) in view of Claussen et al. (U.S. Patent Application Pub. No. US 20200293013 A1).
Claim 1: Yoon teaches a method of operating an electronic device for learning a behavior of a user (i.e. FIG. 2, the inverse reinforcement learning-based delivery means detection apparatus (hereinafter referred to as a delivery means detection apparatus) 100 according to a desirable embodiment of the present invention may train an artificial neural network model using a driving record of an actual delivery worker and an imitated driving record; para. [0041]), the method comprising:
receiving input data related to the behavior of the user, the input data comprising state data associated with a current state of the electronic device and control data input by the user to control the electronic device (i.e. fig. 6, the first trajectory is a trajectory acquired from the driving record of the actual delivery worker and may include a state-action pair. The state indicates the current static state and may include information on latitude, longitude, interval, distance, speed, cumulative distance, and cumulative time. The action indicates an action taken dynamically in the corresponding state and may include information on velocity in the x-axis direction, velocity in the y-axis direction, and acceleration; para. [0113]);
obtaining first behavior trajectory information by processing the input data (i.e. the first trajectory is a trajectory acquired from the driving record of the actual delivery worker and may include a state-action pair. The state indicates the current static state and may include information on latitude, longitude, interval, distance, speed, cumulative distance, and cumulative time. The action indicates an action taken dynamically in the corresponding state and may include information on velocity in the x-axis direction, velocity in the y-axis direction, and acceleration; para. [0113]), the first behavior trajectory information comprising a pair of the state data and the control data generated by matching the state data and the control data (i.e. the first trajectory is a trajectory acquired from the driving record of the actual delivery worker and may include a state-action pair. The state indicates the current static state of the delivery worker and may include information on latitude, longitude, interval, distance, speed, cumulative distance, and cumulative time. The action indicates an action taken dynamically by the delivery worker in the corresponding state and may include information on velocity in the x-axis direction, velocity in the y-axis direction, and acceleration. For example, when the state is “interval=3 seconds & speed=20 m/s,” an action that can be taken in the state in order to increase the speed may be “acceleration=30 m/s2” or “acceleration=10 m/s2.”; para. [0045]);
generating an initial behavior policy through learning for the first behavior trajectory information (i.e. Then, the delivery means detection apparatus 100 may initialize a policy agent and a reward network (S112). That is, the delivery means detection apparatus 100 may initialize the weight of the reward network and the weight of the policy agent using a Gaussian distribution; para. [0046, 0114, 0115]);
obtaining second behavior trajectory information based on the initial behavior policy, the second behavior trajectory information comprising a pair of the state data and autonomous control data derived by processing the state data based on the initial behavior policy (i.e. the delivery means detection apparatus 100 may generate a second trajectory through the policy agent (S113). Here, the second trajectory is a trajectory obtained by imitating the action from the state of the first trajectory and may include a pair of the state of the first trajectory and the action imitated based on the state of the first trajectory. In this case, the delivery means detection apparatus 100 may generate the policy agent configured to output an action for an input state using the state of the first trajectory as training data. The delivery means detection apparatus 100 may acquire an action for the state of the first trajectory through the policy agent and may generate the second trajectory on the basis of the state of the first trajectory and the acquired action; para. [0046, 0115]);
sampling the first behavior trajectory information and the second behavior trajectory information (i.e. Also, the delivery means detection apparatus 100 may select a sample from the first trajectory and the second trajectory (S114). That is, the delivery means detection apparatus 100 may select a portion of the second trajectory as a sample through an importance sampling algorithm and may acquire, from the first trajectory, a sample corresponding to the portion of the second trajectory selected as the sample; para. [0047, 0116]);
training an evaluation model for classifying the first behavior trajectory information and the second behavior trajectory information (i.e. the reward network generation unit 110 may acquire a distributional difference between rewards on the basis of a first reward for the first trajectory acquired through the reward network and a second reward for the second trajectory acquired through the reward network and may update the weight of the reward network; para. [0049, 0066]); and
updating the initial behavior policy through reinforcement learning (i.e. the reward network generation unit 110 may update the weight of the policy agent on the basis of the second reward for the second trajectory acquired through the reward network. Also, the reward network generation unit 110 may update the weight of the policy agent on the basis of the second reward through a proximal policy optimization (PPO) algorithm; para. [0049, 0050, 0054]) using the trained evaluation model (i.e. Also, the delivery means detection apparatus 100 may update the weight of the policy agent through the proximal policy optimization (PPO) algorithm on the basis of the second reward (S117); para. [0118, 0119]) through reinforcement learning using the trained evaluation model directly as a reward function (i.e. the reward network generation unit 110 may update the weight of the policy agent on the basis of the second reward for the second trajectory acquired through the reward network. Also, the reward network generation unit 110 may update the weight of the policy agent on the basis of the second reward through a proximal policy optimization (PPO) algorithm; para. [0049, 0050, 0054]),
wherein the method further comprises iteratively generating additional behavior trajectory information based on an updated behavior policy (i.e. the reward network generation unit 110 may initialize the weight of the reward network and the weight of the policy agent using a Gaussian distribution and may generate the reward network and the policy agent through an iterative learning process; para. [0048, 0088]).
Yoon does not explicitly teach supervised learning.
However, Claussen teaches supervised learning (i.e. As shown in the Figures, the learning approach taken by the ML-PLC 152 may be, as examples, reinforcement learning or imitation learning. Imitation learning techniques aim to mimic behavior in a given task. That is, the goal of the ML-PLC 152 is to copy the behavior of the legacy PLC 142. If no reward is define or available, techniques such as behavioral cloning, which learns a policy as a supervised learning problem over state-action pairs from expert trajectories (e.g., the legacy PLC 142 outputs) may be applied. Alternatively, inverse reinforcement learning may be applied to the ML-PLC 152. In inverse reinforcement learning, the ML-PLC 152 determines the legacy PLC 142 objectives, values, or rewards by observing the behavior of the legacy PLC 142 and tries to determine the reward function that leads to the behavior of the legacy PLC 142; para. [0034]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Yoon to include the feature of Claussen. One would have been motivated to make this modification because it ensures the generated policy is rooted in expert knowledge and minimizes ambiguity during the initial stages of learning.
Claim 5: Yoon and Claussen teach the method of claim 1. Yoon further teaches wherein the obtaining of the second behavior trajectory information further includes generating the second behavior trajectory information comprising the pair of the state data and autonomous control data by matching the state data and the autonomous control data (i.e. the delivery means detection apparatus 100 may generate a second trajectory through the policy agent (S113). Here, the second trajectory is a trajectory obtained by imitating the action from the state of the first trajectory and may include a pair of the state of the first trajectory and the action imitated based on the state of the first trajectory. In this case, the delivery means detection apparatus 100 may generate the policy agent configured to output an action for an input state using the state of the first trajectory as training data. The delivery means detection apparatus 100 may acquire an action for the state of the first trajectory through the policy agent and may generate the second trajectory on the basis of the state of the first trajectory and the acquired action; para. [0115]).
Claim 11 is similar in scope to Claim 1 and is rejected under a similar rationale.
Claim 12: Yoon and Claussen teach the electronic device of claim 11. Yoon further teaches wherein the first behavior trajectory information includes information on a behavior feature vector composed of a pair of the state data and the control data (i.e. the first trajectory is a trajectory acquired from the driving record of the actual delivery worker and may include a state-action pair. The state indicates the current static state and may include information on latitude, longitude, interval, distance, speed, cumulative distance, and cumulative time. The action indicates an action taken dynamically in the corresponding state and may include information on velocity in the x-axis direction, velocity in the y-axis direction, and acceleration; para. [0113]).
Claim 14 is similar in scope to Claim 5 and is rejected under a similar rationale.
Claim 21: Yoon and Claussen teach the method of claim 1. Yoon further teaches wherein iteratively generating the additional behavior trajectory information (i.e. the reward network generation unit 110 may initialize the weight of the reward network and the weight of the policy agent using a Gaussian distribution and may generate the reward network and the policy agent through an iterative learning process; para. [0048, 0088]) comprises executing the updated behavior policy on the electronic device (i.e. the reward network generation unit 110 may update the weight of the policy agent on the basis of the second reward for the second trajectory acquired through the reward network. Also, the reward network generation unit 110 may update the weight of the policy agent on the basis of the second reward through a proximal policy optimization (PPO) algorithm; para. [0050]) to autonomously control the electronic device (i.e. The delivery means detection apparatus 100 may acquire an action for the state of the first trajectory through the policy agent and may generate the second trajectory on the basis of the state of the first trajectory and the acquired action; para. [0115]) and collect new state data and autonomous control data generated based on the updated behavior policy (i.e. the second trajectory is a trajectory obtained by imitating the action from the state of the first trajectory and may include a pair of the state of the first trajectory and the action imitated based on the state of the first trajectory; para. [0113, 0115]).
Claussen further teaches executing the updated behavior policy on the electronic device to autonomously control the electronic device (i.e. In the second phase, after convergence, the ML-PLC 152 may be incorporated into the automation environment 100 (222). That is, the ML-PLC 152, after training, may take over the control operations of the legacy PLC 142. After replacement by the ML-PLC, the legacy PLC 142 may be removed from the automation environment 100; para. [0013, 0020, 0041]) and collect new state data (i.e. The ML-PLC 152 may store and retrieve data from a local or remote process data repository 316. The data may train the ML-PLC. For instance, the process data repository 316 may store batch data 318 for training the ML-PLC 152 with sets of sampled or captured PLC inputs 150 and PLC outputs 146. In other implementations, the process data repository 316 may store real-time process data (including the sampled or captured PLC inputs 150 and PLC outputs 146) as it streams in from the automation environment 100; para. [0029]) and autonomous control data generated based on the updated behavior policy (i.e. the ML-PLC 152 refines its operational model over time (218) until the operational model successfully represents the behavior of the legacy PLC 142 (220); para. [0018, 0019, 0041]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Yoon to include the feature of Claussen. One would have been motivated to make this modification because it ensures the generated policy is rooted in expert knowledge and minimizes ambiguity during the initial stages of learning.
6. Claims 6-7, 9-10, and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Claussen, and further in view of Lyu et al. (U.S. Patent Application Pub. No. US 20240160945 A1).
Claim 6: Yoon and Claussen teach the method of claim 1. Yoon further teaches wherein the sampling further includes: generating a first data set by tracking the first behavior trajectory information; generating first sample data by sampling the first data set with a batch size; generating a second data set by tracking the second behavior trajectory information (i.e. A delivery means detection method performed by an inverse reinforcement learning-based delivery means detection apparatus according to a desirable embodiment of the present invention for achieving the above object includes steps of generating a reward network that outputs a reward for an input trajectory using, as training data, a first trajectory including a pair of a state, which indicates a current static state, and an action, which indicates an action that is dynamically taken in the state, and a second trajectory including a pair of the state of the first trajectory and an action imitated based on the state of the first trajectory; and acquiring a reward for a trajectory to be detected from the trajectory to be detected using the reward network and detecting a delivery means for the trajectory to be detected on the basis of the reward for the trajectory to be detected; para. [0016]); and generating second sample data by sampling the second data set with the batch size (i.e. In this case, the reward network generation unit 110 may select a portion of the second trajectory as a sample through an importance sampling algorithm, acquire, from the first trajectory, a sample corresponding to the portion of the second trajectory selected as the sample, and generate the reward network using, as training data, the portion of the first trajectory acquired as the sample and the portion of the second trajectory acquired as the sample. Here, the importance sampling is a scheme of giving a higher probability of sampling to a less learned sample and may be calculated as a reward for an action equal to the probability of the policy agent selecting the action. For example, assuming one action is “a,” the probability that “a” will be sampled becomes the probability of choosing (reward for “a”)/“a.”; para. [0047]).
Yoon does not explicitly teach a specified batch size.
However, Lyu teaches a specified batch size (i.e. Based on Eq. (9), the batch gradient of the policy network 202 can be given by, where N is the batch size sample from experience replay buffer Figure US20240160945A1-20240516-P00019; para. [0063, 0072]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Yoon and Claussen to include the feature of Lyu. One would have been motivated to make this modification because it helps optimize memory usage during training.
Claim 7: Yoon, Claussen, and Lyu teach the method of claim 6. Yoon further teaches wherein the training of the evaluation model further includes: distinguishing a source of a behavior policy and whether a task is with respect to the first sample data and the second sample data; and training the evaluation model to distinguish the first sample data and the second sample data using learning (i.e. the delivery means detection apparatus 100 may acquire a distributional difference on the basis of the first reward and the second reward and update the weight of the reward network (S116). For example, the delivery means detection apparatus 100 may acquire a distributional difference between rewards through an evidence of lower bound (ELBO) optimization algorithm on the basis of the first reward and the second reward and may update the weight of the reward network; para. [0118]).
Yoon does not explicitly teach adding a label for distinguishing a source and whether a task is successful, supervised learning.
Lyu further teaches wherein the training of the evaluation model further includes: adding a label for distinguishing a source of a behavior policy (i.e. where C denotes the CNN-based policy network, θC denotes network parameters, st denotes input data at the time step t, yt denotes labels, i.e., actions from human participants, and DIL denotes the entire dataset; para. [0160]) and whether a task is successful with respect to the first sample data and the second sample data (i.e. The success rate of task completion and the vehicle dynamic states (i.e., the yaw rate and lateral acceleration) are selected as evaluation parameters to assess the control performance of the autonomous driving agent. The heat map shown in FIG. 14 a shows that the agent trained by Hug-DRL successfully completed tasks in all untrained scenarios, while agents under all baseline methods could complete only parts of the testing scenarios. Specifically, the success rates of the baseline methods are 84.6% for vanilla-DRL and DAgger, 76.9% for HI-RL, 73.1% for vanilla-IL, and 65.3% for IA-RL. In addition, the yaw rate and lateral acceleration of the agent for each method under scenario 1 are recorded and assessed, as shown in FIG. 14 b . Hug-DRL led to the smoothest driving behaviour, with an acceleration of 0.37 m/s2, and HI-RL resulted in the most unstable driving behaviour (1.85 m/s2). The performances of the other baseline methods are roughly similar; para. [0127]); and training the evaluation model to distinguish the first sample data and the second sample data based on the label using supervised learning (i.e. They either directly replace the output actions of DRL by using human actions or use supervised learning (SL) with human demonstrations to pre-train the DRL agent, while the learning algorithm architecture remains unchanged; para. [0048]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Yoon and Claussen to include the feature of Lyu. One would have been motivated to make this modification because it enhances the model’s ability to assess behavior policies with precision.
Claim 9: Yoon, Claussen, and Lyu teach the method of claim 7. Yoon further teaches wherein the updating of the initial behavior policy further includes: obtaining third behavior trajectory information based on the trained behavior policy; and generating third sample data by sampling the third behavior trajectory information (i.e. fig. 6, When the learning is not finished (S118-N), the delivery means detection apparatus 100 may perform steps S113 to S117 again; para. [0120]).
Claim 10: Yoon, Claussen, and Lyu teach the method of claim 9. Yoon further teaches comprising: training the evaluation model based on the first sample data and the third sample data and updating the trained behavior policy (i.e. fig. 6, When the learning is not finished (S118-N), the delivery means detection apparatus 100 may perform steps S113 to S117 again; para. [0120]).
Claim 15: Yoon and Claussen teach the electronic device of claim 11. Yoon further teaches wherein the behavior policy learning circuit is further configured to: generate first sample data by sampling the first behavior trajectory information (i.e. A delivery means detection method performed by an inverse reinforcement learning-based delivery means detection apparatus according to a desirable embodiment of the present invention for achieving the above object includes steps of generating a reward network that outputs a reward for an input trajectory using, as training data, a first trajectory including a pair of a state, which indicates a current static state, and an action, which indicates an action that is dynamically taken in the state, and a second trajectory including a pair of the state of the first trajectory and an action imitated based on the state of the first trajectory; and acquiring a reward for a trajectory to be detected from the trajectory to be detected using the reward network and detecting a delivery means for the trajectory to be detected on the basis of the reward for the trajectory to be detected; para. [0016]); generate second sample data by sampling the second behavior trajectory information (i.e. In this case, the reward network generation unit 110 may select a portion of the second trajectory as a sample through an importance sampling algorithm, acquire, from the first trajectory, a sample corresponding to the portion of the second trajectory selected as the sample, and generate the reward network using, as training data, the portion of the first trajectory acquired as the sample and the portion of the second trajectory acquired as the sample. Here, the importance sampling is a scheme of giving a higher probability of sampling to a less learned sample and may be calculated as a reward for an action equal to the probability of the policy agent selecting the action. For example, assuming one action is “a,” the probability that “a” will be sampled becomes the probability of choosing (reward for “a”)/“a.”; para. [0047]); and distinguishing a source of a behavior policy and whether a task is successful with respect to the first sample data and the second sample data (i.e. the delivery means detection apparatus 100 may acquire a distributional difference on the basis of the first reward and the second reward and update the weight of the reward network (S116). For example, the delivery means detection apparatus 100 may acquire a distributional difference between rewards through an evidence of lower bound (ELBO) optimization algorithm on the basis of the first reward and the second reward and may update the weight of the reward network; para. [0118]).
Yoon does not explicitly teach adding a label for distinguishing a source and whether a task is successful.
Lyu teaches add a label for distinguishing a source of a behavior policy (i.e. where C denotes the CNN-based policy network, θC denotes network parameters, st denotes input data at the time step t, yt denotes labels, i.e., actions from human participants, and DIL denotes the entire dataset; para. [0160]) and whether a task is successful with respect to the first sample data and the second sample data (i.e. The success rate of task completion and the vehicle dynamic states (i.e., the yaw rate and lateral acceleration) are selected as evaluation parameters to assess the control performance of the autonomous driving agent. The heat map shown in FIG. 14 a shows that the agent trained by Hug-DRL successfully completed tasks in all untrained scenarios, while agents under all baseline methods could complete only parts of the testing scenarios. Specifically, the success rates of the baseline methods are 84.6% for vanilla-DRL and DAgger, 76.9% for HI-RL, 73.1% for vanilla-IL, and 65.3% for IA-RL. In addition, the yaw rate and lateral acceleration of the agent for each method under scenario 1 are recorded and assessed, as shown in FIG. 14 b . Hug-DRL led to the smoothest driving behaviour, with an acceleration of 0.37 m/s2, and HI-RL resulted in the most unstable driving behaviour (1.85 m/s2). The performances of the other baseline methods are roughly similar; para. [0127]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Yoon and Claussen to include the feature of Lyu. One would have been motivated to make this modification because it enhances the model’s ability to assess behavior policies with precision.
Claim 16: Yoon, Claussen, and Lyu teach the electronic device of claim 15. Yoon further teaches wherein the behavior policy learning circuit is further configured to train the evaluation model to distinguish the first sample data and the second sample data based using learning (i.e. the delivery means detection apparatus 100 may acquire a distributional difference on the basis of the first reward and the second reward and update the weight of the reward network (S116). For example, the delivery means detection apparatus 100 may acquire a distributional difference between rewards through an evidence of lower bound (ELBO) optimization algorithm on the basis of the first reward and the second reward and may update the weight of the reward network; para. [0118]).
Yoon does not explicitly teach the label using supervised learning.
Lyu further teaches to train the evaluation model to distinguish the first sample data and the second sample data based on the label using supervised learning (i.e. They either directly replace the output actions of DRL by using human actions or use supervised learning (SL) with human demonstrations to pre-train the DRL agent, while the learning algorithm architecture remains unchanged; para. [0048]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Yoon and Claussen to include the feature of Lyu. One would have been motivated to make this modification because it enhances the model’s ability to assess behavior policies with precision.
7. Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Claussen, Lyu, and further in view of Candido et al. (U.S. Patent Application Pub. No. US 20210181768 A1).
Claim 18: Yoon, Claussen, and Lyu teach the electronic device of claim 16. Yoon further teaches wherein the behavior policy learning circuit is further configured to evaluate the trained behavior policy and store a final behavior policy when a of the trained behavior policy meets (i.e. A computer program according to a desirable embodiment of the present invention for achieving the above object is stored in a computer-readable recording to execute, in a computer, the inverse reinforcement learning-based delivery means detection method; para. [0021]).
Yoon does not explicitly teach evaluate the policy when a performance of the policy meets a criteria.
Lyu further teaches to evaluate the trained behavior policy and a final behavior policy when a performance of the trained behavior policy meets a criteria (i.e. FIG. 4 illustrates the experimental set-up. The experimental platform is a human-in-the-loop driving simulator 400. Key components used in the platform include a steering wheel 402 controlled by a human participant 408, a real-time computation platform 404, three monitors, and simulated driving scenarios 406 in the monitors. As shown in FIG. 4 , there are two different model initial conditions of the DRL agent during training, namely the ‘cold-start’ (i.e., 410) and ‘pre-trained’ (i.e., 412). The condition of cold-start is used in the initial training of the DRL agent, and the condition of the pre-trained policy is used for evaluating the fine-tuning performance of the DRL agent; para. [0077]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Yoon and Claussen to include the feature of Lyu. One would have been motivated to make this modification because it enhances the model’s ability to assess behavior policies with precision.
However, Candido teaches to evaluate the trained behavior policy and store a final behavior policy when a performance of the trained behavior policy meets a criteria (i.e. FIG. 7B, method 750 is an exemplary method for evaluating learned flight policies and learning systems, in accordance with one or more embodiments. A learned flight policy may be obtained from the policy server at step 716, and a reward for the learned flight policy may be evaluated at step 718. As discussed above, an evaluation server (e.g., evaluation server 512) may run a set of simulations on a learned flight policy to evaluate its performance. At 720, a determination may be made as to whether the reward for the learned flight policy meets or exceeds a performance threshold (e.g., a threshold for operation-ready flight policies). If yes, the learned flight policy that was evaluated may be stored in an operation-ready policies server at step 722; para. [0064]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the combination of Yoon, Claussen, and Lyu to include the feature of Candido. One would have been motivated to make this modification because it saves processing time and storage space.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Kimura et al. (Pub. No. US 20190272465 A1), Inverse reinforcement learning (IRL) is one of methods collectively referred to as “imitation learning”. In the IRL, an optimal reward function is tried to be recovered as a best description behind given expert demonstrations obtained from humans or other experts. In the conventional IRL, it is typically assumed that the expert demonstrations contain both the state and action information to solve the imitation learning problem.
It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way. A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33, 216 U.S.P.Q. 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 U.S.P.Q. 275, 277 (C.C.P.A. 1968)).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAN TRAN whose telephone number is (303)297-4266. The examiner can normally be reached on Monday - Thursday - 8:00 am - 5:00 pm MT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matt Ell can be reached on 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TAN H TRAN/Primary Examiner, Art Unit 2141