Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
Figure 1 should be designated by a legend such as --Prior Art-- because only that which is old is illustrated. See MPEP § 608.02(g). Corrected drawings in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. The replacement sheet(s) should be labeled “Replacement Sheet” in the page header (as per 37 CFR 1.84(c)) so as not to obstruct any portion of the drawing figures. If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
The following title is suggested: ‘Reinforcement Learning Device and Method for Unsupervised Skill Discovery via State Transition Linearization’.
The disclosure is objected to because of the following informalities:
¶3: “classifying input patterns into specific group…” should be ‘classifying input patterns into a specific group…’ or ‘classifying input patterns into specific groups…’
¶3: “neural networks may use an algorithm that with a learning ability” should be ‘neural networks may use an algorithm that has a learning ability’
¶25: “using goal determining model” should be ‘using a goal determining model’
Appropriate correction is required.
Claim Rejections - 35 USC § 112(a) or 35 USC § 112, first paragraph
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 1-23 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the enablement requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to enable one skilled in the art to which it pertains, or with which it is most nearly connected, to make and/or use the invention.
Claim 1 recites the limitations:
determine, based on the state and the determined goal, an action causing a linear state transition of the electronic device in a direction toward the determined goal in a state…
Claim 11 recites the limitations
determining an action causing a linear state transition of the electronic device in a direction toward the determined goal in a state space based on the state and the determined goal…
Claim 21 recites the limitations
determine, using an action determining model, an action causing a state transition of the electronic device based on the state and the determined goal
These limitation when tied to the broad scope of other claimed limitations of "an environment interactable with the electronic device" (which includes complex physical world dynamics), this requirement may be physically impossible or require undue experimentation to achieve.
To satisfy the enable requirement of 35 U.S.C §112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, the specification must teach those skilled in the art how to make and use the full scope of the claimed invention without "undue experimentation" (see MPEP 2161.01(III)). Given how broad the claim limitations are, the full scope of the claimed invention would necessarily cover being able to determine skills and actions in myriad of scenarios in a complex physical world.
A showing of undue experimentation is given below for the recited claim limitations based on some of the factors cited in MPEP §2164.01(a) as pertaining to In re Wands.
(A) ‘The breadth of claims’…The claims are incredibly broad. The independent claims recites an "electronic device" operating in any "environment interactable with the electronic device." This encompasses everything from a simple 2D computer simulation to a complex, chaotic physical environment (e.g., a drone flying in a hurricane, a robot walking on ice). The claim requires the device to determine an action that causes a "linear state transition." In many physical environments covered by the claim, the dynamics are inherently non-linear due to gravity, friction, Coriolis forces, and external perturbations. A linear transition in the "state space" (which the specification defines as including physical locations) may be physically impossible to execute regardless of the action chosen. The claims do not limit the environment to one where linearization is feasible.
(B) ‘The nature of the invention’… The invention lies in the field of Deep Reinforcement Learning (DRL) and Robotics Control. The claims, however, recite causing a linear transition in a "state space" based on an "observed state." If interpreted as the physical observation space (e.g., GPS coordinates, joint angles), the prior art suggests this is generally not achievable for complex robots without transformation (encoding), which is not explicitly required in the independent claim.
(E) ‘The level of predictability in the art’… Reinforcement Learning is considered an unpredictable art. Training stability is notoriously sensitive to hyperparameters, reward functions, and environmental seeds. Whether a "Linearizer" model (as described in the spec) can successfully learn to force a linear trajectory in a novel, untested environment is highly unpredictable. There is no guarantee that a mathematical "objective function" (Eq. 3 in the specification) will converge to a policy that actually achieves linearity in practice.
(F) ‘The amount of direction provided by the inventor’…The specification provides the mathematical objective function (minimizing the difference between the actual transition and a linear path) but lacks guidance on handling failure cases. The specification describes training a model to maximize a reward based on linearity. It does not teach how to ensure the action causes a linear transition, which is what the claim requires. In many cases, the robot might try its best (maximize reward) but still fail to move linearly due to physics. The disclosure does not provide direction on how to enable the invention in these "failure" scenarios which are encompassed by the broad claims.
(G) ‘The existence of working examples’… The specification appears to rely on simulated environments (likely MuJoCo tasks like 'Ant', 'HalfCheetah' mentioned in the analogous IBOL paper) which are simplified physics engines. There is no evidence of working examples in real-world physical robots. Simulation to real-world transfer is a major hurdle in robotics. A method that works in a deterministic simulation (where physics can be idealized) often fails in the stochastic real world. The lack of real-world examples weakens the enablement for the broad "electronic device" claim that implies physical hardware (sensors/actuators).
(H) ‘The quantity of experimentation needed to make or use the invention based on the content of the disclosure’… Undue Experimentation is required. To practice the full scope of Claim 1 (e.g., making a robot dog walk in a perfectly straight line in state space across rough terrain), a PHOSITA would need to: 1) Design a state observer that maps complex physics to a space where linearity is possible (not taught in the claim), 2) Heavily tune the "Action Determining Model" to fight natural dynamics, 3) Likely discover that for many environments covered by the claim, the invention is inoperable (the robot falls over or cannot move linearly). Finding the subset of environments and hyperparameters where this "linearization" actually works would require extensive, iterative research, amounting to undue experimentation.
Therefore, there is undue experimentation in the independent claims and they fail to comply with the enablement requirement.
The dependent claims are rejected under the enablement requirement based on dependency to the rejected independent claims.
Claim Rejections - 35 USC § 112(b) or 35 USC § 112, second paragraph
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-23 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Per claims 1, 11 and 21, “determine a skill based on the observed state” (claims 1 and 11) and “determine…a skill based on a state of the electronic device observed according to an environment interactable with the electronic device” (claim 21) is claimed. The term "skill" is abstract and arguably subjective. In the specification at ¶53, it is defined mathematically as a latent vector z. However, in claims 1, 11 and 21, without the "vector" limitation, it could be interpreted as a human capability, e.g., "cooking". Hence, the metes and bounds of skill is unclear, i.e., is “skill” a data structure, program or physical motion? To expedite prosecution, Examiner interprets the claim to read, ‘determine a skill vector representing a skill’.
Per claims 5, 7 and 15, “determining a goal state vector…using a goal determining model based on machine learning” is claimed. However, the specification appear to state that the goal determining model is used to determine a ‘goal skill vector’ rather than a goal state vector. For instance ¶79 recites, “each dimension of the goal skill vector in the goal latent space 310, which is determined using the goal determining model, may have a value between -1 and 1, inclusive”. It is unclear and indefinite whether goal skill vector per the specification is the same as the goal state vector.
Per claim 21, this claim creates an hybrid method and apparatus claim. A single claim that recites both a method and the apparatus for performing that method is indefinite under § 112(b). MPEP § 2173.05(p)(II) states: “A single claim which claims both an apparatus and the method steps of using the apparatus is indefinite under 35 U.S.C. 112(b) or pre-AIA 35 U.S.C. 112, second paragraph”. The claim should be rewritten as either a pure method claim (with active method steps) or a pure apparatus claim.
Claims 22-23 recites the limitation "The electronic device of claim…" in the preamble. There is insufficient antecedent basis for this limitation in the claim. To expedite prosecution, Examiner interprets this to mean ‘The processor-implemented method of…”
Dependent claims 2-10, 12-19, 22 and 23 are rejected as being dependent upon a rejected base claim.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-23 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Unsupervised Skill Discovery with Bottleneck Option Learning to Kim et al. (hereinafter Kim).
Per claim 1, Kim discloses An electronic device (Section I.2…experimental environment contains electronic devices (computers/processors) running MuJoCo robot simulations (Ant, HalfCheetah, Hopper, Humanoid, D'Kitty), “We experiment with robot simulation environments in Mu-JoCo (Todorov et al., 2012): Ant, HalfCheetah, Hopper and Humanoid from OpenAI Gym (Brockman et al., 2016) adopting the configurations by Sharma et al. (2020b) and D’Kitty with random dynamics from ROBEL (Ahn et al., 2020) with the setups provided by Sharma et al. (2020a)”), comprising:
a state observer configured to observe a state of the electronic device according to an environment interactable with the electronic device (Section 2…observing states st at time t of the device/agent from an environment, “We consider a Markov Decision Process M = (S, A, p)…S and A respectively denote the state and action space…st, st+1 ∈ S”; Section 3.1-3.2…the agent observes its state at each time step st according to the environment it interacts with, “We define the linearizer πlin(at | st, gt) as a goal-conditioned policy (Schaul et al., 2015), which takes both a state st ∈ S and a goal gt ∈ G as input and outputs a probability distribution over actions at ∈ A”. Section 4…D'Kitty uses physical sensors to observe state, “We use D’Kitty with the random dynamics setting; in each episode, multiple properties of the environment, such as its joint dynamics, friction and height field, are randomized, which provides an additional challenge to agents”);
one or more processors (Section I.3…the skill is determined from observed states via the trajectory encoder, a neural network intrinsically running on processors, “For the trajectory encoder of IBOL and VALOR, we use a bidirectional LSTM with a 512-dimensional hidden layer followed by two 512-dimensional FC layers”) configured to:
determine a skill based on the observed state (Section 3.2… determining a skill (skill latent z) based on observed states, where the trajectory encoder pФ(z|s0:T) encodes observed state trajectories into skill latent z ∈ ℝd, “We encode the state trajectory s0:T... because an outside observer can only see the agent’s state”; Section 2…”we formulate the unsupervised skill discovery problem as learning a latent-conditioned skill policy π(at | st, z) where z ∈ Z represents the skill latent. We consider continuous skill latents z ∈ ℝd”);
determine a goal based on the determined skill and the observed state (Section 3.2…the skill policy πθz(gt | st, z) determines a goal gt based on the observed state st and the determined skill latent z, outputting goals conditioned on both state and skill, “The skill policy πθz(gt | st, z) learns to imitate the skills given their latents”); and
determine, based on the state and the determined goal, an action causing a linear state transition of the electronic device in a direction toward the determined goal in a state space (Section 3.1…the linearizer πlin(at | st, gt) determines an action based on the state and goal, causing a linear state transition toward the goal., 'The linearizer πlin is a pre-trained low-level policy that aims to "linearize" the environment dynamics'; Equation 1…reward Rlin is the inner product of the goal and state difference, e.g., the action causes state change in the direction of the goal); and
a controller configured to control an operation of the electronic device based on the determined action (Section 3.1…a controller intrinsically exists to execute actions to control the electronic device based on the linearizer outputting raw actions at ∈ A applied to the MuJoCo environment actuators, “The linearizer… takes as input goals produced by IBOL’s policies for skill discovery (will be discussed in Section 3.2), and translates them into raw actions in the direction of a given goal while interacting with the environment…”; Abstract…”We empirically demonstrate that IBOL outperforms multiple state-of-the-art unsupervised skill discovery methods on the information-theoretic evaluations and downstream tasks in MuJoCo environments, including Ant, HalfCheetah, Hopper and D’Kitty”; Section 4.2…for D'Kitty, physical actuators control robot movement, “We compare the skill policies trained by IBOL, DIAYN-L, VALOR-L and DADS-L with d = 2…We use the three evaluation metrics…keeping only the state dimensions for the agent’s locomotion (i.e. x-y coordinates for Ant and D’Kitty and x for the rest) denoted as (loc). One rationale behind it is that the algorithms on the linearized environments successfully discover the locomotion skills (e.g. Figure 1)”).
Per claim 2, Kim discloses claim 1, further disclosing for the observing, the state observer is configured to perform either one or both of sensing a change in a physical environment for the electronic device and collection of a data change related to a virtual environment (Section 4…application in both virtual and physical environments, where for virtual environment, the MuJoCo simulation environments is applied (Ant, HalfCheetah, Hopper, Humanoid) where data changes are collected as state observations, and for physical environment, D'Kitty from ROBEL, a physical robot benchmark (Section 4, Ahn et al., 2020), and 'Emergent real-world robotic skills' (Sharma et al., 2020a) are used…states are observed/sensed at each time step from these environments).
Per claim 3, Kim discloses claim 1, further disclosing for the determining of the skill, the one or more processors are configured to determine a skill vector representing the skill to be applied to the observed state, based on a state vector representing the observed state, using a skill determining model based on machine learning (Section 3.2…the trajectory encoder pФ(z|s0:T) which determines a skill vector z ∈ ℝd based on state vectors representing the observed states, “the trajectory encoder pФ(z|s0:T): ST+1 → P(Z) that maps a state trajectory to a probability distribution over skill latents z”; Section I.3…skill determining model for determining the skill is implemented as a bidirectional LSTM neural network (machine learning model)).
Per claim 4, Kim discloses claim 1, further disclosing control the controller with an action determined using an action determining model and a goal determining model based on a temporary skill determined using a skill determining model for an observed state (Section 3.3 and Algorithm 2…controlling the agent with actions determined using the linearizer (action determining model) and skill policy (goal determining model) based on a temporary skill, where during training (Algorithm 2), the sampling policy πθs(gt|st, u) uses a temporary context parameter u sampled from the prior p(u), producing goals that the linearizer converts to actions, “we introduce a new context parameter u ϵ U with its prior p(u) to the sampling policy, redefining it as πθs(gt | st, u) : S X U -> P(G). The new parameter u for πθs plays a similar role to the skill latent z for πθz”); determine a reward according to a state transition by the action performed by the controller (Section 3.1 and Algorithm 1…determining a reward based on state transitions per reward function in Equation (1): Rlin(st, gt, at, st+1) = (1/ℓ)(s(c+1)·ℓ − sc·ℓ)ᵀgt, “Algorithm 1…Compute linearizer reward Rlin using Equation (1)” where the reward is computed from the state transition caused by the action); and update a parameter of the skill determining model based on the determined reward (Section 3.1 and Algorithm 1… “Update πlin using collected samples from replay buffer with SAC”; Algorithm 2…all components (πθs, pФ, πθz) are jointly updated with gradients derived from the objective defined in Equation (5), which incorporates the reward signal through the training pipeline, the skill determining model parameters are updated based on the training signal that derives from the reward-driven linearizer and the IB objective).
Per claim 5, Kim discloses claim 1, further disclosing for the determining of the goal, the one or more processors are configured to determine a goal state vector representing the goal (Section 3.1…gt is goal state vector representing the goal, “goal gt ϵ G as input and outputs a probability distribution over actions at ϵ A. The goal space G is defined as G = [-1; 1]dim(S), which has the same dimensionality as the state space (up to 47 in our experiments). Each goal dimension provides a signal for the direction in the corresponding state dimension”), based on a state vector representing the observed state and a skill vector representing the determined skill, using a goal determining model based on machine learning (Section 3.2…the skill policy πθz(gt|st, z) which determines the goal state vector gt ∈ G = [−1, 1]dim(S) based on state vector st and skill vector z, using a neural network. Section I.3…implemented as MLP with two hidden layers of 512 dimensions).
Per claim 6, Kim discloses claim 1, further disclosing determine a goal state trajectory using the controller and an action determining model for sample goals extracted from randomly extracted sample skills using a goal sampling model (Section 3 and Algorithm 2…determining goal state trajectories via the sampling policy and linearizer, ”Algorithm 2…Sample trajectory using πθs on top of πlin”; Section 3.3…sampling policy πθs(gt|st, u) produces goal trajectories for randomly extracted sample skills, where p(u) = N(0, I), “Since the sampling policy πθs(gt|st, u) lacks a variable about the context (e.g. z) compared to the skill policy πθz(gt|st, z), πθs is less expressive than πθz, which could end up with a suboptimal convergence. To solve this issue, we introduce a new context parameter u ϵ U with its prior p(u) to the sampling policy, redefining it as πθs(gt|st, u) : S X U -> P(G). The new parameter u for πθs plays a similar role to the skill latent z for πθz . We also fix p(u) = N(0; I) as in r(z). To obtain roll-outs from the sampling policy, we first sample u from its prior, and then keep sampling goals with the fixed u”); determine a value of an objective function for each goal state trajectory (Section 3.3 and Equation (5)…computing an objective function value for each trajectory, namely the final objective JP (prediction) and JC (compression) are computed for each sampled trajectory, “Algorithm 2…Compute objective from Equation (5)”); and update a parameter of the goal determining model based on the determined objective function (Section 3.3…updating goal model parameters, “we optimize those two terms in Equation (4) with respect to their parameters, θz and Ф”; Algorithm 2… the skill policy πθz (goal determining model) parameters θz are updated based on the objective function, ”Jointly update πθs, pФ, πθz with gradients”).
Per claim 7, Kim discloses claim 1, further disclosing for the determining of the action based on the state and the determined goal, the one or more processors are configured to determine an action vector representing the action, based on a state vector representing the observed state and a goal state vector representing the determined goal, using an action determining model based on machine learning (Section 3.1…the linearizer πlin(at|st, gt) determines an action vector at ∈ A based on state vector st and goal vector gt, using a neural network trained with SAC, “We define the linearizer πlin(at|st, gt) as a goal-conditioned policy... which takes both a state st ∈ S and a goal gt ∈ G as input and outputs a probability distribution over actions at ∈ A”; Section I.3…action determining model is MLP with two hidden layers of 1024 dimensions).
Per claim 8, Kim discloses claim 1, further disclosing determine an action trajectory by determining an action using the action determining model for each sampled goal (Section 3.1 and Algorithm 1…determining action trajectories by running the linearizer (action determining model) for sampled goals, "Sample goals g0(i), gℓ(i), g2ℓ(i),…). Sample trajectory using πlin and goals”; Section 2…the linearizer determines actions for each goal in the trajectory, producing an action trajectory Ƭ = (s0, a0, ..., sT)); determine an objective function value for each determined action trajectory (Algorithm 1…computing objective values for action trajectories, ”Compute linearizer reward Rlin using Equation (1), which provides the reward signal that drives the objective); store the action trajectory and the objective function value in a replay buffer (Algorithm 1…store action trajectory and objected function value in a replay buffer, “Add trajectory to replay buffer…Update πlin using collected samples from replay buffer with SAC”; Section I.4… trajectories and their associated rewards/values are stored in the replay buffer of size 1e6, “We apply 4 gradient steps and consider training with and without a replay buffer, where rewards are normalized with their exponential moving average without a buffer and 2048-sized mini-batches are used with a buffer of 1e6”); and update a parameter of the action determining model based on the stored action trajectory and the objective function value (Algorithm 1…SAC updates the linearizer parameters from replay buffer data using stored trajectories and their associated objective values stored in the buffer, ”Update πlin using collected samples from replay buffer with SAC”; Section I.4, “2048-sized mini-batches are used with a buffer of 1e6”).
Per claim 9, Kim discloses claim 1, further disclosing for the determining of the goal, the one or more processors are configured to determine the goal based on the determined skill and the observed state while maintaining the determined skill for a predetermined number of times using a skill determining model (Section 1…maintaining the skill latent z for a predetermined number of time steps, “a skill policy with a form of π(a|s, z) keeping the same z for multiple steps or the full episode horizon”; Section 3.3…in IBOL, the sampling policy samples u from the prior and keeps u fixed throughout the episode, “we first sample u from its prior, and then keep sampling goals with the fixed u”; Section 4.3…the skill policy uses the same z for the entire episode or a fixed number of steps, “we employ a meta-controller on top of each skill policy learned by skill discovery methods. The meta-controller iterates observing a state from the environment and picking a skill with its own meta-policy, which invokes the pre-trained skill policy with the same skill latent value z for ℓm time steps”, e.g., for ℓm = 5 or 20 (predetermined number)).
Per claim 10, Kim discloses claim 1, further disclosing determine the action based on the determined goal and the observed state while maintaining the determined goal for a predetermined number of times using a goal determining model (Section 3.1…maintaining the goal for a predetermined number of time steps (ℓ = macro step size), “We assume that a goal gt ∈ G is given at every ℓ-th time step such that t ≡ 0 (mod ℓ) (called a macro time step), and otherwise kept fixed, i.e. gt = gt−1 for t ≢ 0 (mod ℓ)).
Claims 11-19 are substantially similar in scope and spirit to claims 1-9. Therefore, the rejections of claims 1-9 are applied accordingly.
Claim 20 is substantially similar in scope and spirit to claim 1. Therefore, the rejections of claim 1 is applied accordingly.
Claims 21-23 are substantially similar in scope and spirit to claims 1, 4, 6 and 8. Therefore, the rejections of claims 1, 4, 6 and 8 are applied accordingly.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Patents and/or related publications are cited in the Notice of References Cited (Form PTO-892) attached to this action to further show the state of the art with respect to skilled-based and goal-conditioned reinforcement learning.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALAN CHEN whose telephone number is (571)272-4143. The examiner can normally be reached M-F 10-7.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ALAN CHEN/Primary Examiner, Art Unit 2125