Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
Claims 1 – 14 are pending and examined herein.
Claims 1 – 14 are rejected under 35 U.S.C. 112(b).
Claims 1 – 14 are rejected under 35 U.S.C. 101.
Claims 1, 9, 12 are rejected under 35 U.S.C. 103.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: “P” in Fig. 4. Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because reference character “100” has been used to designate both “a method” and “the autonomous vehicle/ the vehicle” in Fig. 2, Fig. 5. Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities:
Reference number (100) is used to refer “a method” in [0082], [0086], [0088], [0089], [0090] and “the autonomous vehicle/ the vehicle” in [0104 – 0106].
Typo in [0098], [0104] for reference character (s). “sate signal (s)” should be “state signal (s)”.
Reference number (13) is not explained in specification.
Reference number (12) in specification only exists in format of 12a, 12b in Fig. 6.
Appropriate correction is required.
Claim Objections
Claim 12 is objected to because of the following informalities:
Claim 12 recites the limitation “potentially unsafe actions provided by the policy module are mapped to safe actions, by the safety module of the machine learning system, to safe actions.” Removing the last “to safe actions” would be clearer to understand the claim.
Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1 - 14 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
MPEP § 2109(III) sets out steps for evaluating whether a claim is drawn to patent-eligible subject matter. The analysis of claims 1 – 14, in accordance with these steps, follows.
Step 1 Analysis:
Step 1 is to determine whether the claim is directed to a statutory category (process, machine, manufacture, or composition of matter.
Claims 1 – 10, 12 are directed to the computer-implemented method, meaning that it is directed to the statutory category of process. Claims 11, 13 are directed to a machine learning and a training system, which is directed to the statutory category of machine. Claim 14 is directed to a non-transitory machine-readable storage medium, which can be an article of manufacture.
Step 2A Prong One, Step 2A Prong Two, and Step 2B Analysis:
Step 2A Prong One asks if the claim recites a judicial exception (abstract idea, law of nature, or natural phenomenon). If the claim recites a judicial exception, analysis proceeds to Step 2A Prong Two, which asks if the claim recites additional elements that integrate the abstract idea into a practical application. If the claim does not integrate the judicial exception, analysis proceeds to Step 2B, which asks if the claim amounts to significantly more than the judicial exception. If the claim does not amount to significantly more than the judicial exception, the claim is not eligible subject matter under 35 U.S.C. 101.
Regarding claim 1, the following claim elements are abstract ideas:
determining, …, a distribution of potentially unsafe actions that could be executed by the technical system, (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
wherein the policy module determines the distribution based on the obtained state signal; (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
sampling a potentially unsafe action from the distribution; (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
determining a loss value based on the state signal and the safe action, (Determining a loss value based on state and action is merely reciting mathematical calculation, which is mathematical concept.)
updating parameters of the policy module according to a gradient of the loss value with respect to the parameters. (Updating parameters according to a loss value is merely reciting mathematical calculation, which is mathematical concept.)
The following claim elements are additional elements which, taken alone or in combination with the other additional elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
wherein the machine learning system is configured to determine a control signal characterizing an action to be executed by a technical system, (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
obtaining a safe action to be executed by the technical system, including (This is mere data gathering, an insignificant extra solution activity, which is a well-understood, routine conventional activity. It does not integrate the judicial exception into a practical application. See MPEP § 2106.05(d). Therefore, this does not amount to significantly more than the judicial exception.)
obtaining a state signal, (This is mere data gathering, an insignificant extra solution activity, which is a well-understood, routine conventional activity. It does not integrate the judicial exception into a practical application. See MPEP § 2106.05(d). Therefore, this does not amount to significantly more than the judicial exception.)
wherein the state signal characterizes a state of an environment; (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
by a parametrized policy module of the machine learning system, (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
obtaining, … ,the safe action, (This is mere data gathering, an insignificant extra solution activity, which is a well-understood, routine conventional activity. It does not integrate the judicial exception into a practical application. See MPEP § 2106.05(d). Therefore, this does not amount to significantly more than the judicial exception.)
, by a safety module of the machine learning system, (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
wherein the safe action is obtained based on the sampled potentially unsafe action and a set of safe actions with respect to a current environment of the technical system; (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
wherein the loss value characterizes a reward obtained based on the safe action; (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
training the machine learning system by (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
Regarding claim 2, the rejection of claim 1 is incorporated herein. Further, claim 2 recites the following abstract ideas:
wherein the obtaining of the safe action by the safety module includes mapping the potentially unsafe action to an action from the set of safe actions when the potentially unsafe action is not in the set of safe actions, (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
wherein the mapping is performed using a piecewise diffeomorphism. (Mapping by using a piecewise diffeomorphism is merely reciting mathematical formula, which is mathematical concept.)
Claim 2 does not recite additional elements.
Regarding claim 3, the rejection of claim 2 is incorporated herein. Further, claim 3 recites the following abstract ideas:
determining a countable partition of the space of actions; (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
determining, … , whether the set is safe set or an unsafe set, (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
wherein a set is determined as safe set when the set includes only actions from the set of safe actions and when there exists a trajectory of actions for future states that includes only safe actions and wherein a set is determined as unsafe set otherwise; (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
when the potentially unsafe action is in an unsafe set: (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
determining a safe set from the partition based on the distribution of the potentially unsafe actions; (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
mapping the potentially unsafe action to an action from the safe set; (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
Otherwise, when the potentially unsafe action is not in an unsafe set, (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
Claim 3 further recites following additional element
wherein the mapping of the potentially unsafe action to an action from the set of safe actions includes: (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
for each set of the countable partition, (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
providing the action as the safe action; (This is merely insignificant extra-solution activity corresponding to mere data gathering or presenting information (e.g. presenting offer). See MPEP § 2106.05(g). Therefore, this does not amount to significantly more than the judicial exception.)
providing the potentially unsafe action as the safe action. (This is merely insignificant extra-solution activity corresponding to mere data gathering or presenting information (e.g. presenting offer). See MPEP § 2106.05(g). Therefore, this does not amount to significantly more than the judicial exception.)
Regarding claim 4, the rejection of claim 3 is incorporated herein. Further, claim 4 recites the following abstract ideas:
wherein the determining the safe set includes determining, …, a probability density of a representative action of the safe set of the partition with respect to the distribution of potentially unsafe actions, (Determining a probability distribution is merely reciting mathematical calculation, which is mathematical concept.)
wherein the safe set including the representative action with a highest probability density value is provided as determined safe set. (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
Claim 4 further recites following additional element
for each safe set in the partition, (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
Regarding claim 5, the rejection of claim 3 is incorporated herein. Further, claim 5 recites the following abstract ideas:
wherein the determining of the safe set includes determining, …, a probability density of a representative action of the safe set of the partition with respect to the distribution of potentially unsafe actions, (Determining a probability distribution is merely reciting mathematical calculation, which is mathematical concept.)
Claim 5 further recites following additional element
for each safe set in the partition, (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
wherein the safe set is sampled based on the determined probability densities and the sampled safe set is provided as determined safe set. (This is merely insignificant extra-solution activity corresponding to mere data gathering or presenting information (e.g. presenting offer). See MPEP § 2106.05(g). Therefore, this does not amount to significantly more than the judicial exception.)
Regarding claim 6, the rejection of claim 3 is incorporated herein. Further, claim 6 recites the following abstract ideas:
wherein the safe set is determined by choosing the set from the partition that is deemed safe and has a minimal distance to the potentially unsafe action. (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
Claim 6 does not recite additional elements.
Regarding claim 7, the rejection of claim 3 is incorporated herein. Further, claim 7 recites the following abstract ideas:
wherein the mapping of the potentially unsafe action to an action from the safe set and the providing of the action as the safe action includes determining a relative position of the potentially unsafe action in the unsafe set (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
Claim 7 further recites following additional element
and providing the action at the relative position in the safe set as the safe action. (This is merely insignificant extra-solution activity corresponding to mere data gathering or presenting information (e.g. presenting offer). See MPEP § 2106.05(g). Therefore, this does not amount to significantly more than the judicial exception.)
Regarding claim 8, the rejection of claim 3 is incorporated herein. Further, claim 8 recites the following abstract ideas:
wherein the mapping of the potentially unsafe action to an action from the safe set and the providing of the action as the safe action includes determining an action from the safe set that has a minimal distance to the potentially unsafe action (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
Claim 8 further recites following additional element
and providing the action as the safe action. (This is merely insignificant extra-solution activity corresponding to mere data gathering or presenting information (e.g. presenting offer). See MPEP § 2106.05(g). Therefore, this does not amount to significantly more than the judicial exception.)
Regarding claim 9, the rejection of claim 3 is incorporated herein. Further, claim 9 recites the following additional elements:
wherein the loss value is determined by a discriminator, and training the machine learning system includes training the policy module and the discriminator according to generative adversarial imitation learning. (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
Regarding claim 12, the rejection of claim 1 is incorporated herein. Further, claim 12 recites the following abstract ideas:
potentially unsafe actions provided by the policy module are mapped to safe actions, (This is practical to perform in the human mind under its broadest reasonable interpretation aside from the recitation of generic computer components or by a human using a pen and paper.)
Claim 12 further recites following additional element
wherein the policy module is trained according to a reinforcement learning paradigm or an imitation learning paradigm, wherein during inference of the machine learning system, (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
by the safety module of the machine learning system, to safe actions. (These are mere instructions to apply abstract idea on a generic computer. See MPEP § 2106.05(f). Therefore, this does not amount to significantly more than the judicial exception.)
Claim 10 – 11, 13, 14 recite substantially similar subject matter to claim 1 respectively and are rejected with the same rationale, mutatis mutandis.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Kolter et al. (Foreign Pub. EP 3838505 A1) in view of Pham et al. (U.S. Pub. 11468310 B2).
Regarding Claim 1, Kolter teaches
obtaining a state signal, wherein the state signal characterizes a state of an environment; ([0035] of Kolter states “Optionally, the data indicating the current state of the physical environment may comprise sensor data of a computer-controlled device. In such cases, control data may be determined for letting the computer-controlled device effect the selected action in the physical environment, e.g., through one or more actuators. This way, the compute-controlled device may be configured to interact with a physical environment in a way that is safe, e.g., avoids being forced to take actions that are labelled as unsafe.” [0037] of Kolter states “Various types of sensor data may be comprised in the state of the physical system, for example, video data, radar data, LiDAR data, ultrasonic data, motion data, etcetera. The state of the physical system is preferably represented by a discrete variable”)
by a parametrized policy module of the machine learning system, …, wherein the policy module determines the distribution based on the obtained state signal; ([0020] of Kolter states “This determined interaction may be selected for being performed only if the action in the current state of the physical environment is comprised in the safe set of state-action pairs. This way, exploring towards a goal may be balanced with ensuring safety during the exploration. Accordingly, safety may be assured while still achieving a relatively high sample efficiency. The function selecting an action to be performed based on the reward function is commonly known as a policy.” [0070] of Kolter states “An unknown, stochastic dynamics function T:S×A↦RS, which maps a state-action pair to a probability distribution over next states, and a discount factor γ. In various embodiments, S and/or A is finite. An interaction system may interact with the physical environment according to the reward function, e.g., by maximizing an expected accumulated reward as is known from reinforcement learning, etc. The reward function may also provide the labelling of state-action pairs as unsafe, e.g., a state-action pair may be labelled as unsafe if its reward is negative.” )
and a set of safe actions with respect to a current environment of the technical system; ([0020] of Kolter states “In an iteration of the interaction, a goal-oriented action to be performed in the current state of the physical environment may be determined based on the reward function. Interestingly, this determined interaction may be selected for being performed only if the action in the current state of the physical environment is comprised in the safe set of state-action pairs.” [0034] of Kolter states “To keep updating the safe set of state-action pairs also during use, for example, the configuration of the interaction system may be based on a predefined set of safe state-action pairs and a predefined policy for selecting action of the system, the configuration comprising updating the safe set of state-action pairs as described herein, selecting actions based on the policy, and only executing the actions if they are in the safe set.”)
determining a loss value based on the state signal and the safe action, wherein the loss value characterizes a reward obtained based on the safe action; ([0020] of Kolter states “Optionally, the interaction with the physical system that takes place may be for interacting according to a reward function. For example, an action may be selected for maximizing an expected reward, e.g., an expected accumulated reward. The reward function may be known to the system selecting the actions, e.g., predefined. The reward function may also define which state-action pairs are labelled as unsafe, e.g., a state-action pair may be labelled unsafe if it has a negative reward.”)
training the machine learning system by updating parameters of the policy module according to a gradient of the loss value with respect to the parameters. ([0020] of Kolter states “Accordingly, safety may be assured while still achieving a relatively high sample efficiency. The function selecting an action to be performed based on the reward function is commonly known as a policy.” [0022] of Kolter states “In various embodiments for interaction according to a reward function, it can even be shown that the policy for selecting an action to be performed is a so-called optimal policy in the PAC-MDP sense, while still guaranteeing safety.”)
Kolter does not explicitly teach that
determining a distribution of potentially unsafe actions that could be executed by the technical system
sampling a potentially unsafe action from the distribution;
obtaining, by a safety module of the machine learning system, the safe action, wherein the safe action is obtained based on the sampled potentially unsafe action
However, Pham teaches that
determining a distribution of potentially unsafe actions that could be executed by the technical system (Column 3 Lines 9 – 16 of Pham states “In an embodiment, a reinforcement learning architecture, interchangeably referred to herein as “OptLayer”, takes as inputs possibly unsafe actions predicted by a neural network and outputs the closest actions (to the possibly unsafe actions) that satisfy chosen constraints (e.g., safety constraints). The architecture ensures that only safe actions are actually executed and unsafe actions are penalized during training.” In Pham’s embodiment, neural network outputs these possibly unsafe actions to be used by the technical system. )
sampling a potentially unsafe action from the distribution; (Column 3 Lines 9 – 16 of Pham states “In an embodiment, a reinforcement learning architecture, interchangeably referred to herein as “OptLayer”, takes as inputs possibly unsafe actions predicted by a neural network and outputs the closest actions (to the possibly unsafe actions) that satisfy chosen constraints (e.g., safety constraints). The architecture ensures that only safe actions are actually executed and unsafe actions are penalized during training.” Pham uses these actions predicted by the policy as possibly unsafe before corrected to safer actions.)
obtaining, by a safety module of the machine learning system, the safe action, wherein the safe action is obtained based on the sampled potentially unsafe action (Column 3 Lines 9 – 16 of Pham states “In an embodiment, a reinforcement learning architecture, interchangeably referred to herein as “OptLayer”, takes as inputs possibly unsafe actions predicted by a neural network and outputs the closest actions (to the possibly unsafe actions) that satisfy chosen constraints (e.g., safety constraints). The architecture ensures that only safe actions are actually executed and unsafe actions are penalized during training.” Pham’s embodiment directly convert unsafe actions into safe ones that satisfy constraints.)
It would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to combine the teachings of Kolter and Pham. Pham teaches generating potentially unsafe actions using a neural network and selecting actions based on a policy distribution. Kolter teaches determining a safe set of state-action pairs and using reinforcement learning to selection actions that maximize reward while ensuring safety. Kolter also determine a unsafe set of state-action pairs resulting in negative reward. One with ordinary skill in the art would be motivated to incorporate the teachings of Pham into Kolter to improve the training of the policy and safety module of the system and optimize modules to accurately map potentially unsafe actions into closest safe actions. The combination would have been predictable to ensure that potentially unsafe actions sampled from the policy distribution are safely executed according to the safe set, while reinforcing the policy based on reward.
Claims 9, 12 are rejected under 35 U.S.C. 103 as being unpatentable over Kolter et al. (Foreign Pub. EP 3838505 A1) in view of Pham et al. (U.S. Pub. 11468310 B2), further in view of Bhattacharyya et al. (NPL: “Modeling Human Driving Behavior through Generative Adversarial Imitation Learning”).
Regarding claim 9, the rejection of claim 1 is incorporated herein. Furthermore, combination of Kolter and Pham does not explicitly teach
wherein the loss value is determined by a discriminator, and training the machine learning system includes training the policy module and the discriminator according to generative adversarial imitation learning.
However, Bhattacharyya teaches that
wherein the loss value is determined by a discriminator, and training the machine learning system includes training the policy module and the discriminator according to generative adversarial imitation learning. (Pg. 8 5.2 Connection to Generative Adversarial Networks section of Bhattacharyya states “This optimization objective established in eq. (19) provides a connection to GANs (Goodfellow et al., 2014). In GANs, the goal is to model the distribution pdata(x). The generative modeling objective is formulated as
PNG
media_image1.png
41
455
media_image1.png
Greyscale
Here, G is the generator that maps input noise variables z to the data space as G(z) and D is the discriminator which outputs a single scalar D(x) that represents the probability that x came from the data rather than pg, a binary classification task. This objective is solved using simultaneous gradient descent wherein the parameters of G and D are updated. Optimization over the GAIL objective is performed by alternating between a gradient step to increase eq. (19) with respect to the discriminator parameters D, and a Trust Region Policy Optimization (TRPO) step (Schulman et al., 2015) to decrease eq. (19) with respect to the parameters θ of the policy πθ.” Pg. 11 Algorithm 1 PS-GAIL of Bhattacharyya states “Rewards are then assigned to each state action pair in these trajectories by the critic. Subsequently, observed trajectories are used to perform a TRPO (Schulman et al., 2015) update for the policy, and an Adam (Kingma & Ba, 2014) update for the critic.
PNG
media_image2.png
239
686
media_image2.png
Greyscale
”)
It would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to combine the teachings of Pham, Kolter and Bhattacharyya. Bhattacharyya teaches generating and sampling actions from a policy distribution and updating the policy via reinforcement learning using a discriminator in a generative adversarial imitation learning (GAIL). One with ordinary skill in the art would be motivated to incorporate the teachings of Bhattacharyya into the combination of Pham and Kolter because GAIL provides an effective approach for imitation learning in problems with large or continuous state and action spaces. The combination would have been predictable to handle safety issue using GAIL and reinforcement learning across multiple environments while leveraging policy optimization and imitation learning.
Regarding claim 12, the rejection of claim 1 is incorporated herein. Furthermore, combination of Kolter, Pham, and Bhattacharyya teaches
wherein the policy module is trained according to a reinforcement learning paradigm or an imitation learning paradigm, wherein during inference of the machine learning system, potentially unsafe actions provided by the policy module are mapped to safe actions, by the safety module of the machine learning system, to safe actions. (Pg. 8 5.2 Connection to Generative Adversarial Networks section of Bhattacharyya states “optimization over the GAIL objective is performed by alternating between a gradient step to increase eq. (19) with respect to the discriminator parameters D, and a Trust Region Policy Optimization (TRPO) step (Schulman et al., 2015) to decrease eq. (19) with respect to the parameters θ of the policy πθ.” Column 3 Lines 9 – 14 of Pham states “In an embodiment, a reinforcement learning architecture, interchangeably referred to herein as “OptLayer”, takes as inputs possibly unsafe actions predicted by a neural network and outputs the closest actions (to the possibly unsafe actions) that satisfy chosen constraints (e.g., safety constraints).” [0034] of Kolter states “To keep updating the safe set of state-action pairs also during use, for example, the configuration of the interaction system may be based on a predefined set of safe state-action pairs and a predefined policy for selecting action of the system, the configuration comprising updating the safe set of state-action pairs as described herein, selecting actions based on the policy, and only executing the actions if they are in the safe set.” Bhatta uses policy trained using GAIL, which is a type of imitation learning and Pham maps from potentially unsafe actions to safe actions. )
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BYUNGKWON HAN whose telephone number is (571)272-5294. The examiner can normally be reached M-F: 9:00AM-6PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached at (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BYUNGKWON HAN/Examiner, Art Unit 2121
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121