Office Action Analysis: 17625361 — TRAINING A NEURAL NETWORK TO CONTROL AN AGENT USING TASK-RELEVANT ADVERSARIAL IMITATION LEARNING

Office Action

§103
DETAILED ACTION
This action is responsive to the application filed on 12/16/2025. Claims 1-17, and 23-25 are pending and have been examined. This action is Non-Final.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C.
120, 121, 365(c), or 386(c) is acknowledged.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/16/2025 has been entered.

Response to Arguments
Argument 1: The applicant argues on pages 1-3 that the rejection under 35 U.S.C. 101 is improper because the Office action allegedly evaluates the claims at an overly high level of generality and improperly equates machine learning with an unpatentable algorithm. The applicant contends that the amended claims recite a specific technical improvement in the way a discriminator network is trained, particularly by classifying first and second tuple datasets and subsets thereof that include portions identified as having task-irrelevant characteristics, and by training the discriminator on an objective that both encourages accurate classification and penalizes classification of such subsets above a threshold accuracy. The applicant asserts that this constrained training reduces the likelihood that the discriminator learns from task-irrelevant features, thereby improving imitation learning performance, reducing computational time, and improving generalization. The applicant further argues that these features integrate any alleged judicial exception into a practical application and provide significantly more than an abstract idea, and that the examiner’s analysis improperly evaluates the claims at a high level of generality.
Examiner Response to Argument 1: The examiner has considered the applicant’s arguments and finds them persuasive. In view of the amendments and arguments, the claims, when considered as a whole, are directed to a specific improvement in the training of a discriminator network that is supported by the specification, including at least paragraphs like [0067]-[0070], [0073]-[0074], [0077]-[0081], and [0082]-[0084]. In particular, the claimed limitations directed to classifying portions of the first and second tuple datasets as having task-irrelevant characteristics and training the discriminator network on an objective that both encourages accurate classification and penalizes classification beyond a threshold are directed to a specific manner of improving how the discriminator network is trained, rather than merely reciting a desired result. Accordingly, the claimed invention is directed to a specific technological improvement and is analogous to improvements found eligible in Ex parte Desjardins. Therefore, the rejection under 35 U.S.C. 101 is withdrawn.
Argument 2: The applicant argues that the cited combination of Ho, Blonde, and Peng fails to teach or suggest the amended limitations of claim 1. In particular, the applicant contends that Blonde does not disclose obtaining expert and action trajectory subsets that include portions classified as having task irrelevant characteristics, nor does Blonde disclose penalizing a discriminator network based on a threshold accuracy. The applicant acknowledges that the Office relies on Peng for these features but argues that Peng does not teach the claimed functionality. Specifically, the applicant asserts that Peng’s variational discriminator bottleneck limits mutual information between inputs and internal representations to exclude irrelevant or redundant information, rather than classifying portions of datasets as task irrelevant characteristics or using such classifications during training. The applicant further argues that Peng’s approach does not involve classifying expert and action trajectory subsets that include task irrelevant portions, nor does it evaluate or penalize discriminator accuracy based on a threshold. Instead, Peng prevents irrelevant information from being processed at all, and therefore does not generate outputs or perform classification decisions on such data. Accordingly, the applicant concludes that the cited references, alone or in combination, fail to teach or suggest the claimed discriminator training objective and requests withdrawal of the rejection.
Examiner Response to Argument 2: The examiner has considered the applicant’s arguments but is not
persuaded. While the applicant argues that Peng does not classify task irrelevant portions or apply a
threshold based penalty, the combination of Ho, Blonde, Bousmalis, and Ganin teaches or suggests the
claimed limitations, especially with the amendments. Blonde teaches training a discriminator to classify between expert and agent generated data using state action pairs, which corresponds to classifying pluralities of first and second tuple datasets. Bousmalis teaches separating task relevant and task irrelevant characteristics using shared and private feature representations, which corresponds to identifying portions of data that have task irrelevant characteristics. The examiner interprets these private feature representations as the claimed task irrelevant portions within trajectory subsets. Ganin teaches training a discriminator to make features indistinguishable between domains, which discourages accurate classification and corresponds to penalizing the
discriminator for achieving high classification accuracy. The claim does not require a specific method of
how the task irrelevant portions are processed but only requires that the training objective penalizes
classification beyond a threshold. The combined teachings of Bousmalis and Ganin would have suggested
modifying the discriminator training to reduce reliance on irrelevant characteristics and limit classification
performance where appropriate. A person of ordinary skill in the art would have been motivated to combine these teachings to improve generalization and reduce overfitting to irrelevant features. Therefore, the cited combination teaches or suggests the claimed limitations, and the rejection under 35 U.S.C. 103 is
maintained.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this
Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not
identically disclosed as set forth in section 102, if the differences between the claimed invention and the
prior art are such that the claimed invention as a whole would have been obvious before the effective filing
date of the claimed invention to a person having ordinary skill in the art to which the claimed invention
pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are
summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness. 

Claim(s) 1,2,4,6,8, and 23-25 is/are rejected under 35 U.S.C. 103 as being unpatentable over NPL reference “Generative Adversarial Imitation Learning” by Ho et. al (referred herein as Ho) in view of NPL reference “Sample-Efficient Imitation Learning via Generative Adversarial Nets” by Blonde et. al (referred herein as Blonde) in view of US10970589B2, by Bousmalis et. al. (referred herein as Bousmalis) further in view of NPL reference “Domain-adversarial training of neural networks.”, by Ganin et. al. (referred herein as Ganin). 

Regarding claim 1,  Ho teaches: 
A method of training a neural network to generate action data for controlling an agent to perform a task in an environment, the method comprising: obtaining, for each of a plurality of expert performances of the task, one or more first tuple datasets, each first tuple dataset comprising state data characterizing a state of the environment at a corresponding time during the corresponding expert performance of the task; and ([Ho, page 2 and 7], “In practice, πE will only be provided as a set of trajectories sampled by executing πE in the environment, so the expected cost of πE in Eq. (1) is estimated using these samples …

    PNG
    media_image1.png
    357
    813
    media_image1.png
    Greyscale

, wherein the examiner interprets the collection of trajectories—and the use of a discriminator as detailed in Algorithm 1 to be the same as “obtaining, for each of a plurality of expert performances of the task, one or more first tuple datasets, each first tuple dataset comprising state data characterizing a state of the environment at a corresponding time during the corresponding expert performance of the task”).
Ho does not teach and performing a concurrent process of training the neural network and a discriminator network, the process comprising. (1) a plurality of neural network update steps, each of which comprises: receiving state data characterizing a current state of the environment; using the neural network and the state data to generate action data indicative of an action to be performed by the agent, forming a second tuple dataset comprising the state data; using the second tuple dataset to generate a reward value, wherein the reward value comprises an imitation value generated by the discriminator network based on the second tuple dataset; and training the neural network based on the reward value; (ii) a plurality of discriminator network update steps, each which of comprises: obtaining one or more expert trajectories; obtaining one or more action trajectories; obtaining a plurality of expert trajectory subsets, wherein each expert trajectory subset is a subset of a respective plurality of first tuple datasets corresponding to a respective expert trajectory and includes a portion of the respective plurality of first tuple datasets that has been classified as having task-irrelevant characteristics; obtaining a plurality of action trajectory subsets, wherein each action trajectory subset is a subset of a respective plurality of second tuple datasets corresponding to a respective action trajectory and includes a portion of the respective plurality of second tuple datasets that has been classified as having task-irrelevant characteristics, and training the discriminator network classifying, by the discriminator network, the pluralities of first and second tuple datasets corresponding to the expert and action trajectory subsets that include, within each action or expert trajectory subset, the portion of the respective plurality of first or second tuple datasets that have been classified as having task-irrelevant characteristics, datasets, and classifying, by the discriminator network, the pluralities of first and second tuple training the discriminator network on an objective that (1) encourages the discriminator network to accurately classify the pluralities of first and second tuple datasets while (u) penalizing the discriminator network for classifying the expert and action trajectory subsets with more than a threshold accuracy.
Blonde teaches performing a concurrent process of training the neural network and a discriminator network, the process comprising: (i) a plurality of neural network update steps, each of which comprises: receiving state data characterizing a current state of the environment; using the neural network and the state data to generate action data indicative of an action to be performed by the agent; forming a second tuple dataset comprising the state data; using the second tuple dataset to generate a reward value, wherein the reward value comprises an imitation value generated by the discriminator network based on the second tuple dataset; and training the neural network based on the reward value; ([Blonde, sec 4] “As an off-policy method, SAM cycles through the following steps: i) the agent uses my to interact with M, ii) equal the agent uses my to interact with stores the experienced transitions C in a replay buffer R, iii) updates the reward module psi with an mixture of uniformly sampled state-action pairs from C and Te, iv) updates the reward module 6 with an equal mixture of uniformly sampled state-action pairs from R and 7., and v) updates the policy module Theta and critic module psi with transitions sampled from R. Note that while sampling uniformly from C (iii)) gives states and actions distributed as p^(pi_theta) and pi_theta respectively (on-policy), sampling wniformly from R iv) gives states and actions distributed as p^beta and beta respectively, where beta denotes the off-policy sampling mixture distribution corresponding to sampling transitions uniformly from the replay buffer. A more detailed description of the training procedure is laid out in the algorithm pseudo-code. (Algorithm 1)” (shown below), wherein the examiner interprets the “iv) updates the reward module with an equal mixture of uniformly sampled state-action pairs from R and te, and v) updates the policy module 6 and critic module y with transitions sampled from R.” to be describing how the policy module (which corresponds 10 the neural network that generates actions) gets updated from reply buffer (R) samples, and updating 8 (the policy parameters) accordingly, thereby generating state-action pairs. This is the same as from the instant application claim that “receiving state data characterizing a current state of the environment; using the neural network [a.k.a. policy module] and the state data to generate action data” because both describe how the NN (policy module) is trained and captures state data (a.k.a. “second tuple”) and action data. That tuple is fed to a discriminator to get an imitation-based reward which is used to update the NN.) 
    PNG
    media_image2.png
    825
    331
    media_image2.png
    Greyscale


(ii) a plurality of discriminator network update steps, each which of comprises: obtaining one or more expert trajectories; obtaining one or more action trajectories; ([Blonde, Sec 4] “We introduce a reward network…the cross-entropy loss used to train the reward network…The reward network is trained each iteration first on the mini-batch most recently collected by π, then on mini-batches sampled from the replay buffer”, wherein the examiner interprets “mini-batches most recently collected by π” to be the same as “obtaining one or more action trajectories,” because both describe agent-generated state-action data. The examiner further interprets “mini-batches sampled from the replay buffer corresponding to expert trajectories” to be the same as “obtaining one or more expert trajectories,” because both provide state-action samples from expert demonstrations for discriminator updates.)
obtaining a plurality of expert trajectory subsets, wherein each expert trajectory subset is a subset of a respective plurality of first tuple datasets corresponding to a respective expert trajectory ([Blonde, page 3, sec 3], “Trajectories are traces of interaction between an agent and an MDP [Markov decision process]. Specifically, we model trajectories as sequences of transitions (st, at, rt, st+1), atomic units of interaction. Demonstrations are provided to the agent through a set of expert trajectories τe, generated by an expert policy πe in M.”, wherein the examiner interprets “provided to the agent through a set of expert trajectories”, to be the same as “obtaining a plurality of expert trajectory subsets,” because they are both describing multiple expert trajectories being available as demonstration data (a plurality of expert trajectory data units available for training. The examiner further interprets “trajectories are traces of interaction between an agent and an MDP” to be the same as “a respective plurality of first tuple datasets corresponding to a respective expert trajectory,” because they are both describing that a trajectory is made up of multiple per-timestep items (transitions), which corresponds to a plurality of data items within a trajectory.)
obtaining a plurality of action trajectory subsets, wherein each action trajectory subset is a subset of a respective plurality of second tuple datasets corresponding to a respective action trajectory and ([Blonde, page 3, sec 3], “Trajectories are traces of interaction between an agent and an MDP [Markov decision process]. Specifically, we model trajectories as sequences of transitions (st, at, rt, st+1), atomic units of interaction. Demonstrations are provided to the agent through a set of expert trajectories τe, generated by an expert policy πe in M.” and [Blonde, page 7, Algorithm 1], “Sample uniformly a minibatch Bc of state-action pairs from C. Sample uniformly a minibatch Bc_e of state-action pairs from the expert dataset τe, with |Bc | = |Bc e |.” wherein the examiner interprets “Sample uniformly a minibatch Bc of state-action pairs from C” to be the same as obtaining a plurality of action trajectory subsets because they are both directed to selecting a subset of agent generated state action data from a larger collection of agent interaction data. The examiner further interprets “we model trajectories as sequences of transitions (st, at, rt, st+1)” to be the same as a respective plurality of second tuple datasets corresponding to a respective action trajectory because they are both directed to a trajectory being made up of multiple per time step data items, and wherein the examiner interprets “Sample uniformly a minibatch Bc of state-action pairs from C” to be the same as each action trajectory subset is a subset of a respective plurality of second tuple datasets corresponding to a respective action trajectory because they are both directed to selecting a subset from a larger plurality of trajectory data.)
training the discriminator network comprising classifying, by the discriminator network, ([Blonde, page 3, sec 3], “Generative Adversarial Imitation Learning (Ho and Ermon, 2016) introduces an extra neural network Dφ to play the role of discriminator, while the role of generator is carried out by the agent’s policy πθ. Dφ tries to assert whether a given state-action pair originates from trajectories of πθ or πe, while πθ attempts to fool Dφ into believing her state-action pairs come from πe.”, wherein the examiner interprets “to assert whether a given state-action pair originates from trajectories” to be the same as training the discriminator network comprising classifying, by the discriminator network, because they are both directed to the discriminator performing a classification decision about whether data comes from the agent policy or the expert.)
the pluralities of first and second tuple datasets corresponding to the expert and action trajectory subsets ([Blonde, page 3, sec 3], “ Demonstrations are provided to the agent through a set of expert trajectories τe, generated by an expert policy πe in M … The situation can be described as a minimax problem minθ maxφ V (θ, φ), where the value of the two-player game is V (θ, φ)=Eπθ [log(1- Dφ(s, a))] +Eπe [log Dφ(s, a)].”, wherein the examiner interprets “two player game” and “Eπθ [log(1- Dφ(s, a))] +Eπe [log Dφ(s, a)]” to be the same as pluralities of first and second tuple datasets, because they are both directed to using two collections of state action data, one associated with the agent policy and one associated with the expert, as the discriminator inputs for classification.)
classifying, by the discriminator network, the pluralities of first and second tuple datasets, and ([Blonde, page 3, sec 3], “Generative Adversarial Imitation Learning introduces an extra neural network Dφ to play the role of discriminator, while the role of generator is carried out by the agent’s policy πθ. Dφ tries to assert whether a given state-action pair originates from trajectories of πθ or πe, while πθ attempts to fool Dφ into believing her state-action pairs come from πe.”, wherein the examiner interprets "tries to assert whether a given state-action pair originates from trajectories of πθ or πe" to be the same as classifying, by the discriminator network, the pluralities of first and second tuple datasets because they are both directed to a discriminator performing a classification decision between two sets of data, one associated with the agent policy and one associated with the expert.)
training the discriminator network on an objective that (i) encourages the discriminator network to accurately classify the pluralities of first and second tuple datasets ([Blonde, page 5, sec 5] “Reward We introduce a reward network with parameter vector φ, operating as the discriminator. The cross-entropy loss used to train the reward network is: Eπθ [- log(1 - Dφ(s, a))] + Eπe [- log Dφ(s, a)]”, wherein the examiner interprets “cross-entropy loss used to train the reward network” to be the same as “training the discriminator network on an objective” because they are both directed to an objective function used to train the discriminator. The examiner further interprets “Eπθ [- log(1 - Dφ(s, a))] + Eπe [- log Dφ(s, a)]” to be the same as encourages the discriminator network to accurately classify the pluralities of first and second tuple datasets because they are both directed to training the discriminator to distinguish between agent policy data and expert data.)
Blonde does not teach and includes a portion of the respective plurality of first tuple datasets that has been classified as having task-irrelevant characteristics; … includes a portion of the respective plurality of second tuple datasets that has been classified as having task-irrelevant characteristics, and … while (ii) penalizing the discriminator network for classifying the expert and action trajectory subsets with more than a threshold accuracy.
Bousmalis teaches:
 and includes a portion of the respective plurality of first tuple datasets that has been classified as having task-irrelevant characteristics; ([Bousmalis, col 4, lines 27-31], “The private target encoder neural network 210 is specific to the target domain and is configured to receive images from the target domain and to generate, for each received image, a private feature representation of the image.” and [Bousmalis, col 7-8, lines 65-68, 1-7], “The difference loss trains the shared encoder neural network to (i) generate shared feature representations for input images from the target domain that are different from private feature representations for the same input images from the target domain generated by the private target encoder neural network and (ii) generate shared feature representations for input images from the source domain that are different from private feature representations for the same input images from the source domain generated by the private source encoder neural network.” wherein the examiner interprets the private feature representations generated by the private target encoder neural network to correspond to task-irrelevant characteristics of the input images because they are domain-specific representations separated from the shared feature representations used for the task. The examiner further interprets the first tuple datasets including a portion classified as having task-irrelevant characteristics to be the same as the input images having associated private feature representations that capture domain-specific information separated out by the difference loss, because they are both describing identifying or isolating non-task-shared characteristics within the input data.)
includes a portion of the respective plurality of second tuple datasets that has been classified as having task-irrelevant characteristics; and ([Bousmalis, col 4, lines 27-31], “The private target encoder neural network 210 is specific to the target domain and is configured to receive images from the target domain and to generate, for each received image, a private feature representation of the image.” and [Bousmalis, col 7-8, lines 65-68, 1-7], “The difference loss trains the shared encoder neural network to (i) generate shared feature representations for input images from the target domain that are different from private feature representations for the same input images from the target domain generated by the private target encoder neural network and (ii) generate shared feature representations for input images from the source domain that are different from private feature representations for the same input images from the source domain generated by the private source encoder neural network.” wherein the examiner interprets the private feature representations generated by the private target encoder neural network to correspond to task-irrelevant characteristics of the input images because they are domain-specific representations separated from the shared feature representations used for the task. The examiner further interprets the first tuple datasets including a portion classified as having task-irrelevant characteristics to be the same as the input images having associated private feature representations that capture domain-specific information separated out by the difference loss, because they are both describing identifying or isolating non-task-shared characteristics within the input data.)
that include, within each action or expert trajectory subset, the portion of the respective plurality of first or second tuple datasets that have been classified as having task-irrelevant characteristics; ([Bousmalis, col 4, lines 27-31], “The private target encoder neural network 210 is specific to the target domain and is configured to receive images from the target domain and to generate, for each received image, a private feature representation of the image.” and [Bousmalis, col 7-8, lines 65-68, 1-7], “The difference loss trains the shared encoder neural network to (i) generate shared feature representations for input images from the target domain that are different from private feature representations for the same input images from the target domain generated by the private target encoder neural network and (ii) generate shared feature representations for input images from the source domain that are different from private feature representations for the same input images from the source domain generated by the private source encoder neural network.” wherein the examiner interprets “private target encoder neural network 210 is specific to the target domain … difference loss trains the shared encoder neural network” to be the same as "the portion of the respective plurality of first or second tuple datasets that have been classified as having task-irrelevant characteristics" because they are both describing how data from each trajectory/domain contains a portion that has been identified as carrying task-irrelevant (domain-specific/private) characteristics, separated from the task-relevant (shared) portion. This is consistent with the existing mapping where private feature representations correspond to task-irrelevant characteristics.)
Bousmalis does not teach while (ii) penalizing the discriminator network for classifying the expert and action trajectory subsets with more than a threshold accuracy.
Ganin teaches while (ii) penalizing the discriminator network for classifying the expert and action trajectory subsets with more than a threshold accuracy. ([Ganin, page 12, Fig 1] “Gradient reversal ensures that the feature distributions over the two domains are made similar (as indistinguishable as possible for the domain classifier), thus resulting in the domain-invariant features.”, wherein the examiner interprets “as indistinguishable as possible for the domain classifier” to be the same as “penalizing the discriminator network for classifying” because they are both directed to discouraging a classifier from successfully distinguishing between two categories by using an objective that reduces its ability to classify accurately.)
Ho, Blonde, Bousmalis, Ganin, and the instant application are analogous art because they are all directed to training neural networks using adversarial or discriminator-based learning frameworks to improve performance.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the Generative adversarial imitation learning approach disclosed by Ho to include the reward network disclosed by Blonde. One would be motivated to do so to effectively train a discriminator network to distinguish between expert and agent-generated data and provide meaningful reward signals for policy learning, as suggested by Blonde ([Blonde, page 5, sec 5] “The cross-entropy loss used to train the reward network is: Eπθ [- log(1 - Dφ(s, a))] + Eπe [- log Dφ(s, a)].”).
It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the Generative adversarial imitation learning approach disclosed by Ho to include the private and shared feature representation approach disclosed by Bousmalis. One would be motivated to do so to efficiently separate domain-specific information from task-relevant information within the input data, thereby improving generalization and robustness of the trained model across different domains, as suggested by Bousmalis ([Bousmalis, col 4, lines 27-31] “to generate, for each received image, a private feature representation of the image.”). It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the Generative adversarial imitation learning approach disclosed by Ho to include the domain classifier disclosed by Ganin. One would be motivated to do so to effectively discourage the discriminator from overfitting to domain-specific distinctions and instead learn domain-invariant features, thereby improving the robustness and transferability of the learned representations, as suggested by Ganin ([Ganin, page 12, Fig 1] “as indistinguishable as possible for the domain classifier”). Claims 23 and 24 are analogous to claim 1, and therefore will face the same rejection set forth above.

Regarding claim 2, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 1 (see rejection of claim 1). 
 	Ho further teaches in which the first tuples further include corresponding action data generated based on the state of the environment for controlling the agent, and corresponding action data generated based on the state of the environment for controlling the agent. ([Ho, page 2] “the unnormalized distribution of state-action pairs that an agent encounters when navigating the environment with the policy π, and it allows us to write Eπ[c(s, a)] = <f(s, a)> for any cost function c”, wherein the examiner interprets the inclusion of corresponding action data in the first tuples, which capture state-action pairs encountered by an agent, to be the same as “corresponding action data generated based on the state of the environment for controlling the agent” because both record the state of the environment at a given time (or index) during the expert's performance, and each first tuple is required captures the action data generated based on that state for controlling the agent.).

Regarding claim 4, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 1 (see rejection of claim 1).
Ho further teaches in which the training constrains the discriminator network such that, upon receiving any of at least a specified proportion of tuple datasets included in the expert and action trajectory subsets, the discriminator network generates (i) an imitation value below an imitation value threshold if the received tuple dataset is a first tuple dataset, and (ii) an imitation value above the imitation value threshold if the received tuple dataset is a second tuple dataset; ([Ho, page 6, Sec 5] “GAIL solves Eq. (15) by finding a saddle point (π, D) of the expression Eπ[log(D(s, a))] + EπE [log(1 − D(s, a))] … with both π and D represented using function approximators: GAIL fits a parameterized policy πθ, with weights θ, and a discriminator network Dw : S × A → (0, 1), with weights w”, wherein the examiner interprets “the discriminator network Dw : S × A → (0,1)” to be the same as “the discriminator network generates an imitation value below an imitation value threshold for action trajectory subsets and above an imitation value threshold for expert trajectory subsets” because they are both directed to Dw outputting values near 0 for agent (action) data and near 1 for expert data, with 0.5 serving as the natural threshold for discrimination). Claim 25 is analogous to claim 4, and therefore will face the same rejection set forth above.

Regarding claim 6, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 1 (see rejection of claim 1).
 	Ho further teaches (a) in which, for each performance of the task, the corresponding first tuple datasets form an expert sequence of first tuple datasets labelled by a time index which is zero for the first tuple dataset of the expert sequence, and one higher for each successive first tuple dataset than for the preceding one of the expert sequence, and ([Ho, page 2, sec 2] “successor states are drawn from the dynamics model P(s 0 |s, a). We work in the γ-discounted infinite horizon setting, and we will use an expectation with respect a policy π ∈ Π to denote an expectation with respect to the trajectory it generates:

    PNG
    media_image3.png
    43
    375
    media_image3.png
    Greyscale

where s0 ∼ p0, at ∼ π(·|st), and st+1 ∼ P(·|st, at) for t ≥ 0. We will use Eˆ τ to denote an empirical expectation with respect to trajectory samples τ , and we will always use πE to refer to the expert policy.”, wherein the examiner interprets the value, t, in the above summation equation which includes (st, at) as the state-action pair, to be a “time index which is zero for the first tuple dataset of the expert sequence, and one higher for each successive first tuple dataset than for the preceding one ).
Ho does not teach (b) the neural network update steps are based on one or more action sequences of second tuples, wherein for each action sequence of second tuples: a first second tuple of the action sequence has a time index of zero, and is performed for state data describing the environment in a corresponding initial state, and each of other second tuples the action sequence has a time index one greater than the preceding second tuple of the action sequence, and is performed for state data describing the environment upon the performance by the agent of the action data generated in the preceding time step.
Blonde teaches (b) the neural network update steps are based on one or more action sequences of second tuples, wherein for each action sequence of second tuples: a first second tuple of the action sequence has a time index of zero, and is performed for state data describing the environment in a corresponding initial state, and each of other second tuples the action sequence has a time index one greater than the preceding second tuple of the action sequence, and is performed for state data describing the environment upon the performance by the agent of the action data generated in the preceding time step. ([Blonde, page 3, sec 3] “Trajectories are traces of interaction between an agent and an MDPAGE Specifically, we model trajectories as sequences of transitions (st, at, rt, st+1), atomic units of interaction. Demonstrations are provided to the agent through a set of expert trajectories τe, generated by an expert policy πe in M.”, wherein the examiner interprets the ordering of transitions, where the first transition corresponds to an initial state with a time index of zero and each subsequent transition occurs at the next time step, to be the same as the “action sequences of second tuples…has a time index of zero… the action sequence has a time index one greater than the preceding second tuple of the action sequence” as in the claim).
Ho, Blonde, Bousmalis, Ganin, and the instant application are analogous art because they are all directed to expert sequences of datasets labelled by an sequential, time index. 
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 1 disclosed by Ho, Blonde, Bousmalis, and Ganin to include the tuple dataset processing disclosed by Ho. One would be motivated to do so to effectively ensure accurate temporal ordering of the expert sequence, as suggested by Ho (Ho, [page 2, sec 2] “time index which is zero for the first tuple dataset of the expert sequence, and one higher for each successive first tuple dataset than for the preceding one”). It would have further been obvious to one of ordinary skill in the art before the effective filing date to effectively include the ordering of transitions corresponding to a state with a time and subsequent transition to act as action sequences for the tuple as suggested by Blonde ([[Blonde, page 3, sec 3] “Trajectories are traces of interaction between an agent and an MDPAGE Specifically, we model trajectories as sequences of transitions (st, at, rt, st+1), atomic units of interaction. Demonstrations are provided to the agent through a set of expert trajectories τe, generated by an expert policy πe in M.”)

Regarding claim 8, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 6 (see rejection of claim 6).
Ho further teaches A method according to claim 6, in which all the second tuple datasets employed in each discriminator network update step are tuple datasets for which the corresponding time index is below a third time threshold, the expert sequences employed in the discriminator network update including first tuples having a time index above the third time threshold. ([Ho, page 2, sec 2] “We work in the γ-discounted infinite horizon setting, and we will use an expectation with respect a policy π ∈ Π to denote an expectation with respect to the trajectory it generates:  

    PNG
    media_image4.png
    21
    596
    media_image4.png
    Greyscale

We will use Eˆτ to denote an empirical expectation with respect to trajectory samples τ, and we will always use πE to refer to the expert policy” and [Ho, page 7, sec 6], “a given dataset of state-action pairs is split into 70% training data and 30% validation data. The policy is trained with supervised learning” wherein the examiner interprets the state action pairs (s,a) in the above equation being updated at each time step as part of the GAIL framework to be the same as “network update step are tuple datasets for which the corresponding time index”. The examiner further interprets “dataset of state-action pairs is split into 70% training data and 30% validation data” to be the same as “second tuple datasets” because they are both being used as input to the discriminator network.)
Ho, Blonde, Bousmalis, Ganin, and the instant application are analogous art because they are all directed to methods involving updating discriminator networks based on datasets with specified time index thresholds.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 6 disclosed by Ho, Blonde, Bousmalis, and Ganin to include the “policy is trained with supervised learning” disclosed by Ho. One would be motivated to do so to effectively improve the training stability and predictive accuracy of the discriminator network updates using state-action pair datasets, as suggested by Ho (Ho, [page 7, sec 6] “ given dataset of state-action pairs is split into 70% training data and 30% validation data”)

Claim(s) 5, 13-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ho in view of Blonde in view of Bousmalis in view of Ganin in view of NPL reference NPL reference “VARIATIONAL DISCRIMINATOR BOTTLENECK: IMPROVING IMITATION LEARNING, INVERSE RL, AND GANS BY CONSTRAINING INFORMATION FLOW” by Peng et. al (referred herein as Peng).

Regarding claim 5, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 4 (see rejection of claim 4).
	Ho, Blonde, Bousmalis, and Ganin do not teach the objective includes a term which varies inversely dependent with an accuracy parameter, the accuracy parameter (i) taking a higher value if, upon receiving one of the expert trajectory subset of first tuple datasets, the discriminator network generates with a probability above a probability threshold an imitation value above the imitation value threshold, and (ii) taking a higher value if, upon receiving one of the action trajectory subset of second tuple datasets, the discriminator network generates with a probability above the probability threshold an imitation value below the imitation value threshold.
	Peng further teaches the objective includes a term which varies inversely dependent with an accuracy parameter, the accuracy parameter (i) taking a higher value if, upon receiving one of the expert trajectory subset of first tuple datasets, the discriminator network generates with a probability above a probability threshold an imitation value above the imitation value threshold, and (ii) taking a higher value if, upon receiving one of the action trajectory subset of second tuple datasets, the discriminator network generates with a probability above the probability threshold an imitation value below the imitation value threshold. ([Peng, page 9, Sec 5.1] “Adaptive Constraint:…When β is too small, performance reverts to that achieved by GAIL… Policies trained using dual gradient descent to adaptively update β consistently achieves the best performance overall”, wherein the examiner interprets “adaptive update of β to enforce a desired information constraint” to be the same as “an accuracy parameter that varies inversely” because they are both directed to adjusting a parameter such that discriminator confidence (accuracy) increases when correctly classifying expert versus agent data, and the constraint smooths the discriminator landscape when accuracy is too high, thereby functioning as an inverse dependence).
Ho, Blonde, Bousmalis, Ganin, Peng, and the instant application are analogous art because they are all directed to modifying parameters of a discriminator network.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 4 disclosed by Ho, Blonde, Bousmalis, and Ganin to include the “adaptive constraint” disclosed by Peng. One would be motivated to do so to effectively enhance the adaptive constraint on the discriminator network, as suggested by Peng ([Peng, page 9, sec 5.1] “Policies trained using dual gradient descent to adaptively update β consistently achieves the best performance overall.”).

Regarding claim 13, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 1 (see rejection of claim 1).
	Ho, Blonde, Bousmalis, and Ganin do not teach in which the state data for each tuple dataset comprises image data defining at least one image of the environment.
Peng teaches in which the state data for each tuple dataset comprises image data defining at least one image of the environment. ([Peng, page 7, sec 5], “We evaluate our method on adversarial learning problems in imitation learning, inverse reinforcement learning, and image generation. In the case of imitation learning, we show that the VDB enables agents to learn complex motion skills from a single demonstration, including visual demonstrations provided in the form of video clips. We also show that the VDB improves the performance of inverse RL methods. Inverse RL aims to reconstruct a reward function from a set demonstrations, which can then used to perform the task in new environments, in contrast to imitation learning, which aims to recover a policy directly. Our method is also not limited to control tasks, and we demonstrate its effectiveness for unconditional image generation.”, wherein the examiner interprets “visual demonstrations provided in the form of video clips” to be the same as “image data defining at least one image of the environment,” as both terms are directed to visual representations of the environment since a video clip consist of multiple images of the environment.).
Ho, Blonde, Bousmalis, Ganin, Peng, and the instant application are analogous art because they are all directed to methods for generating and using image data representing environments in learning or decision-making tasks.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 1 disclosed by Ho, Blonde, Bousmalis, and Ganin to include the “method on adversarial learning problems in imitation learning” involving imagery disclosed by Peng. One would be motivated to do so to effectively enhance adversarial learning using Variational Discriminator Bottleneck (VDB) method, as suggested by Peng (Peng, [page 7, sec 5] “ We also show that the VDB improves the performance of inverse RL methods.”)

Regarding claim 14, Ho, Blonde, Bousmalis, Ganin, and Peng teaches A method according to claim 13 (see rejection of claim 13).
Peng further teaches in which the state data for each tuple dataset comprises image data defining a plurality of images of the environment. ([Peng, page 7, sec 5], “We evaluate our method on adversarial learning problems in imitation learning, inverse reinforcement learning, and image generation. In the case of imitation learning, we show that the VDB enables agents to learn complex motion skills from a single demonstration, including visual demonstrations provided in the form of video clips.” wherein the examiner interprets visual demonstrations provided as video clips to be the same as “image data defining a plurality of images of the environment,” as a video is a sequence or collection of multiple images. The examiner further interprets a method on “adversarial learning problems in imitation learning and inverse reinforcement learning” to be the same as collecting “state data for each tuple dataset” as both capture the state of the environment).
Ho, Blonde, Bousmalis, Ganin, Peng, and the instant application are analogous art because they are all directed to methods for capturing environmental state data comprising multiple images for training models or for decision-making tasks.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 13 disclosed by Ho, Blonde, Bousmalis, Ganin, and Peng to include the “method on adversarial learning problems in imitation learning” involving imagery disclosed by Peng. One would be motivated to do so to effectively enhance adversarial learning using Variational Discriminator Bottleneck (VDB) method using video (i.e. multiple images), as suggested by Peng (Peng, [page 7, sec 5] “we show that the VDB enables agents to learn complex motion skills from a single demonstration, including visual demonstrations provided in the form of video clips.”)

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Ho in view of Blonde in view of Bousmalis in view of Ganin further in view of NPL reference “LEARNING ROBUST REWARDS WITH ADVERSARIAL INVERSE REINFORCEMENT LEARNING” by Fu et. al (referred herein as Fu). 

Regarding claim 3, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 1 (see rejection of claim 1).
	Ho, Blonde, Bousmalis, and Ganin do not teach in which the first tuples do not further include action data, and the second tuples do not further include action data generated by the neural network.
Fu teaches in which the first tuples do not further include action data, and the second tuples do not further include action data generated by the neural network. ([Fu, page 5, sec 2] “Theorem 5.2. If a reward function r0 (s, a, s0) is disentangled for all dynamics functions, then it must be state-only. i.e. If for all dynamics T, <Q*r,T(s,a) …> then r’ is only a function of state.”, wherein the examiner interprets the absence of action data in the first tuples and the absence of action data in the second tuples to be the same as a state-only function, as both are directed to exclusively state-based information.)
Ho, Blonde, Bousmalis, Ganin, Fu, and the instant application are analogous art because they are all directed to a method in which the tuples are state-only.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 1 disclosed by Ho, Blonde, Bousmalis, and Ganin to include the reward function disclosed by Fu. One would be motivated to do so to effectively enable the calculation of a reward function that disentangles for all dynamics functions, as suggested by Fu (Fu, [page 5, sec 2] “reward function r0 (s, a, s0) is disentangled for all dynamics functions, then it must be state-only”).


Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Ho in view of Blonde in view of Bousmalis in view of Ganin further in view of NPL reference “Goal-conditioned Imitation Learning” by Ding et. al (referred herein as Ding). 

Regarding claim 7, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 6 (see rejection of claim 6). 
	Ho, Blonde, Bousmalis, and Ganin do not teach in which, in each of the discriminator network update steps, the first subset of first tuple datasets are first tuple datasets for which the corresponding time index is below a first time threshold, and the second subset of second tuple datasets are tuple datasets for which the corresponding time index is below a second time threshold.
Ding teaches in which, in each of the discriminator network update steps, the first subset of first tuple datasets are first tuple datasets for which the corresponding time index is below a first time threshold, and the second subset of second tuple datasets are tuple datasets for which the corresponding time index is below a second time threshold. ([Ding, page 4], “The expert trajectories have been collected by asking the expert to reach a specific goal gj. But they are also valid trajectories to reach any other state visited within the demonstration! This is the key motivating insight to propose a new type of relabeling: if we have the transitions

    PNG
    media_image5.png
    41
    174
    media_image5.png
    Greyscale
 in a demonstration, we can also consider the transition 

    PNG
    media_image6.png
    34
    70
    media_image6.png
    Greyscale

    PNG
    media_image7.png
    48
    184
    media_image7.png
    Greyscale

as also coming from the expert! Indeed that demonstration also went through the state...so if that was the goal, the expert would also have generated this transition. This can be understood as a type of data augmentation leveraging the assumption that the tasks we work on are quasi-static.” wherein the examiner interprets “considering transitions within a demonstration as separate valid expert transitions” to be the same as “using subsets of tuple datasets whose time indices fall below designated thresholds” because both are directed to relabeling or selecting portions of a trajectory based on temporal progression through visited states).
Ho, Blonde, Bousmalis, Ganin, Ding, and the instant application are analogous art because they are all directed to methods for updating discriminator networks based on subsets of datasets selected according to defined temporal index thresholds.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method according to claim 6 disclosed by Ho, Blonde, Bousmalis, and Ganin to include the “new type of relabeling” disclosed by Ding. One would be motivated to do so to efficiently augment training data, as suggested by Ding (Ding, [page 4] “data augmentation leveraging the assumption that the tasks we work on are quasi-static.”)

Claims 9-12 is rejected under 35 U.S.C. 103 as being unpatentable over Ho in view of Blonde in view of Bousmalis in view of Ganin further in view of NPL reference “Model-based Adversarial Imitation Learning” by Baram et. al (referred herein as Baram). 

Regarding claim 9, Ho, Blonde, Bousmalis, and Ganin teaches A method according to claim 6 (see rejection of claim 6).
	Ho, Blonde, Bousmalis, and Ganin do not teach in which, for each action sequence, a corresponding third time threshold is determined, and the second tuple datasets of the action sequence employed in each discriminator network update step only include second tuple datasets for which the corresponding time index is below a corresponding third time threshold.
Baram teaches in which, for each action sequence, a corresponding third time threshold is determined, and the second tuple datasets of the action sequence employed in each discriminator network update step only include second tuple datasets for which the corresponding time index is below a corresponding third time threshold. ([Baram, page 5, sec 3.3], “We showed that effective imitation learning requires a) to use a model, and b) to process multistep transitions instead of individual state-action pairs. This setup was previously suggested by Shalev-Shwartz et al. [2016] and Heess et al. [2015], who tried to maximize R(π) by expressing it as a multi-step differentiable graph. Our method can be viewed as a variant of their idea when setting: r(s, a) = −D(s, a). This way, instead of maximizing the total reward, we minimize the total discriminator beliefs along a trajectory … Define J(θ) as the discounted sum of discriminator probabilities along a trajectory… Jθ is calculated by applying Eq. 10 and 11 recursively, starting from t = T all the way down to t = 0.”, wherein the examiner interprets recursively calculating discriminator probabilities starting from time T, where T is a threshold, and moving downward, to be the same as “for each action sequence, a corresponding third time threshold is determined,” and using discriminator values only for time indices below that threshold in the discriminator update step).
Ho, Blonde, Bousmalis, Ganin, Baram, and the instant application are analogous art because they are all directed to methods for enhancing imitation learning by selectively utilizing discriminator updates based on threshold-based filtering of action sequences.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method according to claim 6 disclosed by Ho, Blonde, Bousmalis, and Ganin to include the “process multistep transitions instead of individual state-action pairs” disclosed by Baram. One would be motivated to do so to effectively reduce discriminator uncertainty along action trajectories, as suggested by Baram ([Baram, page 5, sec 3.3] “ instead of maximizing the total reward, we minimize the total discriminator beliefs along a trajectory “).

Regarding claim 10, Ho, Blonde, Bousmalis, Ganin, and Baram teaches A method according to claim 9 (see rejection of claim 9). 
 Ho further teaches comprising a step of, for each action sequence, selecting the third time threshold for the action sequence based on imitation values for at least a plurality of the second tuple datasets of the action sequence. ([Ho, page 6, sec 5], “The discriminator network can be interpreted as a local cost function providing learning signal to the policy—specifically, taking a policy step that decreases expected cost with respect to the cost function c(s, a) = log D(s, a) will move toward expert-like regions of state-action space, as classified by the discriminator. “ and “GAIL solves Eq. (15) by finding a saddle point (π, D) of the expression

    PNG
    media_image8.png
    21
    306
    media_image8.png
    Greyscale

with both π and D represented”, wherein the examiner interprets “discriminator value” and “saddle point” to be the same as the “imitation value” and “time threshold” as the both are related to finding the similarity between tuples of data and threshold for the action sequence where D(s,a) represents the discriminator output for a state-action pair (s, a)”) 

Regarding claim 11, Ho, Blonde, Bousmalis, Ganin, and Baram teaches A method according to claim 10 (see rejection of claim 10). 
Peng further teaches in which the third time threshold is set as the smallest time index such that a certain number Tpatience of the most recent the imitation values is above an imitation quality threshold. ([Peng, page 18 , sec C]  “Gradient descent with momentum 0.9 is used for all models. The PPO clipping threshold is set to 0.2. When evaluating the performance of the policies, each episode is simulated for a maximum horizon of 20s. Early termination is triggered whenever the character’s torso contacts the ground, leaving the policy is a maximum error of π radians for all remaining timesteps.” wherein the examiner interprets “the smallest time index such that a certain number of recent imitation values is above an imitation quality threshold,” to be the same as the stoppling condition of Tpatience  to ensure imitation values “is above an imitation quality threshold” as both are checking performance at each time step and terminating once a threshold is met.
Ho, Blonde, Bousmalis, Ganin, Baram, and the instant application are analogous art because they are all directed to evaluating and improving the imitation quality of a learned policy through trajectory analysis over time.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 11 disclosed by Ho, Blonde, Bousmalis, Ganin, and Baram to include the “early termination” trigger disclosed by Peng. One would be motivated to do so to efficiently establish a clear performance-based stopping point during trajectory evaluation, as suggested by Peng ([Peng, page 18, sec C] “leaving the policy is a maximum error of π radians for all remaining timesteps.”)

Regarding claim 12, Ho, Blonde, Bousmalis, Ganin, and Baram A method according to claim 11 (see rejection of claim 11).
 	Blonde further teaches in which the imitation quality threshold is based on the imitation values of a plurality of second tuples of that action sequence having a time index below the third time threshold. ([Blonde, page 3, sec 3], “We now introduce additional concepts and notations that will be used in the remainder of this work. The return is the total discounted reward from timestep t, onwards: ... The state-action value, or Q-value, is the expected return after picking action at in state st, and thereafter following policy πθ: Qπθ (st, at) , ... πθ[·] denotes the expectation taken along trajectories generated by πθ in M+ (respectively E>tπe[·] for πe in M) and looking onwards from state st and action at. We want our agent to find a policy πθ that maximizes the expected return from the start state, which constitutes our performance objective...”, wherein the examiner interprets the expectation of returns taken along trajectories generated from state-action pairs occurring after timestep t, where timestep t is chosen that maximizes the expected return and that time step becomes the threshold, to be the same as the “imitation quality threshold” being based on the imitation values of multiple second tuples having “a time index below the third time threshold”).
Ho, Blonde, Bousmalis, Ganin, Baram, and the instant application are analogous art because they are all directed to methods for determining imitation thresholds based on evaluating state-action pairs in action sequences along trajectories.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the rejection of claim 11 disclosed by Ho, Blonde, Bousmalis, Ganin, and Baram to include the calculation of “state-action value, or Q-value, is the expected return after picking action at in state st, and thereafter following policy πθ” disclosed by Blonde. One would be motivated to do so to effectively maximize the expected performance of imitation policies, as suggested by Blonde (Blonde, [page 3, sec 3]  “We want our agent to find a policy πθ that maximizes the expected return from the start state, which constitutes our performance objective.”)

Claims 15 and 17 is rejected under 35 U.S.C. 103 as being unpatentable over Ho in view of Blonde in view of Bousmalis in view of Ganin in view of Peng further in view of NPL reference “Model-based Adversarial Imitation Learning” by Baram et. al (referred herein as Baram). 

Regarding claim 15, Ho, Blonde, Bousmalis, Ganin, and Peng teaches A method according to claim 13 (see rejection of claim 13).
	Ho, Blonde, Bousmalis, Ganin, and Peng do not teach in which, during at least one of (i) one or more of the neural network update steps, a modified form of the second tuple datasets is generated by making a modification to the state data of the second tuple datasets, and (ii) one or more of the discriminator network update steps, a modified form of the first and/or second tuple datasets is generated by making a modification to the state data of one or more of the first and/or second tuple datasets.
Baram teaches in which, during at least one of (i) one or more of the neural network update steps, a modified form of the second tuple datasets is generated by making a modification to the state data of the second tuple datasets, and (ii) one or more of the discriminator network update steps, a modified form of the first and/or second tuple datasets is generated by making a modification to the state data of one or more of the first and/or second tuple datasets. ([Baram, page 6, sec 3], “Define as the discounted sum of discriminator probabilities along a trajectory. Following the results of Heess et al. [2015], we write the derivatives of J over a (s, a, s') transition in a recursive manner:

    PNG
    media_image9.png
    13
    148
    media_image9.png
    Greyscale

    PNG
    media_image9.png
    13
    148
    media_image9.png
    Greyscale





The final gradient Jθ is calculated by applying Eq. 10 and 11 recursively, starting from t = T all the way down to t = 0. The full algorithm is presented in Algorithm 1.”, wherein the examiner interprets recursively adjusting gradients and state information (using terms fs and faπθ derived from the forward model) during policy and discriminator updates to be the same as generating a modified form of tuple datasets by making “a modification to the state data” during neural network and discriminator network update steps. Note: D in the equation above is a discriminator Neural Network.).
Ho, Blonde, Bousmalis, Ganin, Peng, Baram, and the instant application are analogous art because they are all directed to methods for updating neural networks and discriminator networks by modifying datasets based on state data during learning.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 13 disclosed by Ho, Blonde, Bousmalis, Ganin, and Peng to include discriminator update process disclosed by Baram. One would be motivated to do so to efficiently calculate the discounted sum of discriminator probabilities, as suggested by Baram (Baram, [page 6, sec 3] “The final gradient Jθ is calculated by applying Eq. 10 and 11 recursively, starting from t = T all the way down to t = 0.”)

Regarding claim 17, Ho, Blonde, Bousmalis, Ganin, Peng, and Baram teaches A method according to claim 15 (see rejection of claim 15).
Peng further teaches in which the state data for each tuple dataset comprises image data defining a plurality of images of the environment and in which the modification comprises removing the image data for one or more of the images of the state data. ([Peng, page 2, sec 1], “In this work, we propose a simple regularization technique for adversarial learning, which constrains the information flow from the inputs to the discriminator using a variational approximation to the information bottleneck. By enforcing a constraint on the mutual information between the input observations and the discriminator’s internal representation, we can encourage the discriminator to learn a representation that has heavy overlap between the data and the generator’s distribution, thereby effectively modulating the discriminator’s accuracy and maintaining useful and informative gradients for the generator. Our approach to stabilizing adversarial learning can be viewed as an adaptive variant of instance noise (Salimans et al., 2016; Sønderby et al., 2016; Arjovsky & Bottou, 2017). However, we show that the adaptive nature of this method is critical. Constraining the mutual information between the discriminator’s internal representation and the input allows the regularizer to directly limit the discriminator’s accuracy, which automates the choice of noise magnitude and applies this noise to a compressed representation of the input that is specifically optimized to model the most discerning differences between the generator and data distributions. The main contribution of this work is the variational discriminator bottleneck (VDB), an adaptive stochastic regularization method for adversarial learning that substantially improves performance across a range of different application domains, examples of which are available in Figure 1. Our method can be easily applied to a variety of tasks and architectures. First, we evaluate our method on a suite of challenging imitation tasks, including learning highly acrobatic skills from mocap data with a simulated humanoid character. Our method also enables characters to learn dynamic continuous control skills directly from raw video demonstrations, and drastically improves upon previous work that uses adversarial imitation learning.” wherein the examiner interprets applying noise to a compressed representation of input observations to limit discriminator accuracy, thereby reducing available input information, to be the same as “the modification comprises removing the image data for one or more of the images of the state data,” as both are directed to selectively reducing input information provided to the discriminator to ensure it will focus on the specific tasks desired.)
Ho, Blonde, Bousmalis, Ganin, Peng, Baram, and the instant application are analogous art because they are all directed to selectively modifying state data to improve adversarial training in imitation learning tasks.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 15 disclosed by Ho, Blonde, Bousmalis, Ganin, Peng, and Baram to include the “variational discriminator bottleneck (VDB)” disclosed by Peng. One would be motivated to do so to effectively improve performance across a range of different application domains, as suggested by Peng (Peng, [page 2, sec 1] “substantially improves performance across a range of different application domains”).

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Ho in view of Blonde in view Bousmalis in view of Ganin in view Peng in view of Baram further in view of  NPL reference “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World” by Tobin et. al (referred herein as Tobin). 

Regarding claim 16, Ho, Blonde, Bousmalis, Ganin, Peng, and Baram teaches A method according to claim 15 (see rejection of claim 15).
	Ho, Blonde, Bousmalis, Ganin, Peng, and Baram do not teach in which the modification comprises applying to the image data one or more modifications selected from the set comprising: brightness changes; contrast changes; saturation changes; cropping; rotation; and addition of noise.
Tobin teaches in which the modification comprises applying to the image data one or more modifications selected from the set comprising: brightness changes; contrast changes; saturation changes; cropping; rotation; and addition of noise. ([Tobin, page 3, sec III], “The purpose of domain randomization is to provide enough simulated variability at training time such that at test time the model is able to generalize to real-world data. We randomize the following aspects of the domain for each sample used during training: • Number and shape of distractor objects on the table • Position and texture of all objects on the table • Textures of the table, floor, skybox, and robot • Position, orientation, and field of view of the camera • Number of lights in the scene • Position, orientation, and specular characteristics of the lights • Type and amount of random noise added to images Since we use a single monocular camera image from an uncalibrated camera to estimate object positions, we fix the height of the table in simulation, effectively creating a 2D pose estimation task. Random textures are chosen among the following: (a) A random RGB value (b) A gradient between two random RGB values (c) A checker pattern between two random RGB values.” wherein the examiner interprets randomizing textures and positions, camera orientation, lighting characteristics, and particularly adding random noise to images to be the same as applying modifications comprising brightness changes, contrast changes, saturation changes, cropping, rotation, and addition of noise.).
Ho, Blonde, Bousmalis, Ganin, Peng, Baram, Tobin, and the instant application are analogous art because they are all directed to modifying image data to enhance robustness of machine learning models by applying transformations such as brightness, contrast, rotation, and noise addition.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 15 disclosed by Ho, Blonde, Bousmalis, Ganin, Peng, and Baram to include the “domain randomization is to provide enough simulated variability at training time such that at test time the model is able to generalize to real-world data” disclosed by Tobin. One would be motivated to do so to effectively improve the generalization capabilities of machine learning models, as suggested by Tobin (Tobin, [page 3, sec III] “provide enough simulated variability at training time such that at test time the model is able to generalize to real-world data.”)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DEVAN KAPOOR whose telephone number is (703)756-1434. The examiner can normally be reached Monday - Friday: 9:00AM - 5:00 PM EST (times may vary).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/DEVAN KAPOOR/Examiner, Art Unit 2126                                                                                                                   
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
TRAINING A NEURAL NETWORK TO CONTROL AN AGENT USING TASK-RELEVANT ADVERSARIAL IMITATION LEARNING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

TRAINING A NEURAL NETWORK TO CONTROL AN AGENT USING TASK-RELEVANT ADVERSARIAL IMITATION LEARNING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email