Last updated: May 29, 2026
Application No. 17/970,886
METHOD AND SYSTEM FOR ATTENTIVE ONE SHOT META IMITATION LEARNING FROM VISUAL DEMONSTRATION

Non-Final OA §103
Filed
Oct 21, 2022
Priority
Nov 17, 2021 — IN 202121052813
Examiner
HAN, KYU HYUNG
Art Unit
2123
Tech Center
2100 — Computer Architecture & Software
Assignee
Tata Consultancy Services Limited
OA Round
1 (Non-Final)
Interview Optional

— +7.1% interview lift. Interview lift (+7.1%) is below the 15.0% threshold. A written response is recommended.
Based on 11 resolved cases, 2023–2026
Examiner Intelligence

HAN, KYU HYUNG View full profile →
Grants 46% of resolved cases
Career Allowance Rate
5 granted / 11 resolved
-9.5% vs TC avg
Moderate +7% lift
Without
With
+7.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
15 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
4.6%
-35.4% vs TC avg
§103
95.4%
+55.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 11 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections – 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-18 are rejected under 35 U.S.C. 103 as being unpatentable over Kalakrishnan et al. (US 12083678 B2) hereinafter known as Kalakrishnan in view of Zhou et al. (US 20230330846 A1) hereinafter known as Zhou in view of James at al. (US20210205988A1).

Regarding independent claim 1, Kalakrishnan teaches:
A processor implemented method, the method comprising: receiving, by one or more hardware processors, a plurality of images pertaining to a visual demonstration for a robot, wherein the plurality of images are sequential; (Kalakrishnan [Col. 13, Line 39-42]: “environment state 202 can include one or more images of the environment captured using a RGB camera and/or other vision data captured using vision component(s)” Kalakrishnan [Col. 14, Line 23-24]: “The video can be sampled (e.g., a sequence of still images can be captured from the demonstration video)” Kalakrishnan teaches that sequential images of the environment are received.)
computing, by the one or more hardware processors, a plurality of vector embeddings based on the plurality of images using a pretrained attentive embedding network, wherein the pretrained attentive embedding network comprises a first Convolutional Neural Network (CNN), … and wherein computing the plurality of vector embeddings comprises: (Kalakrishnan [Col. 17, Line 37-40]: “the system receives a trained meta-learning model trained using a plurality of training tasks” Kalakrishnan teaches that the model is pretrained, as it receives an already trained model. Kalakrishnan [Col. 10, Line 57-60]: “In some implementations, the embedding network can consist of a convolutional neural network followed by 1-D temporal convolutions” Kalakrishnan teaches that the trained model may consist of a convolutional neural network with 1D convolutions.)
…
computing a plurality of attention vectors based on the plurality of local contextual feature vectors and a current global feature vector using a corresponding spatial attention module among the plurality of spatial attention modules; (Kalakrishnan [Col. 3, Lines 20-23]: “the control network can be used to autonomously generate, based on a task embedding of a human guided demonstration, a trial of the robot performing the task (also referred to here in a rollout)” Kalakrishnan teaches that the intermediate vector embedding is used to generate a task vector, also known as a rollout, which maps to the attention vectors that are generated.)
computing a plurality of elementwise dot products comprising an elementwise dot product between each of the plurality of attention vectors and each a corresponding plurality of local contextual feature vector; (Kalakrishnan [Col. 11, Lines 37-39]: “a dot product (in the embedding space) can be calculated for all possible pairs of rollouts in R, and pairs with a high dot product (measured by some predefined threshold α) can be labeled as belonging to the same task” Kalakrishnan teaches that the rollouts are multiplied together using a dot product operator, which operate element by element. Since the rollout is a task and is an embedding, they are vectors that contain attention and contextual data.)
generating a new global vector by concatenating the plurality of elementwise dot products; (Kalakrishnan [Col. 10, Lines 22-26]: “If a pair of rollouts is determined to belong to the same task, this pair can be added to the original dataset of optimal demonstrations. In some implementations, a new meta-policy can be learned from this expanded dataset.” Kalakrishnan teaches that the concatenation of the dot product, or the pairs of the rollouts, are part of a new group of vectors known as the expanded dataset. This is essentially a new global vector, as it is the collection of pairwise rollouts.)
…
…

Kalakrishnan does not explicitly teach:
…

…

computing a plurality of local contextual feature vectors based on the plurality of images by the first CNN;
…
…
…
and computing the plurality of vector embeddings based on the new global vector using the fully connected layer of the attentive embedding network;
and computing, by the one or more hardware processors, a control action based on the plurality of vector embeddings, an image from the plurality of images, a robot joint state vector and a robot joint velocity vector using a control network, wherein the control network comprises a second CNN and a plurality of fully connected layers, and wherein the control network is connected to the attentive embedding using multiplicative spatial skip connections.

However, Zhou teaches:
…

…

computing a plurality of local contextual feature vectors based on the plurality of images by the first CNN; (Zhou [0024]: “For example, the current observation 116 can be one or more first-person, ego-centric images of the environment, that is images captured by one or more cameras (or other image generation unit(s)) of the robot.” Zhou teaches that the images of the robots are captured by the cameras. Zhou [0025]: “generates an embedding 114 of the current observation 116 and an embedding 118 of the goal observation” Zhou teaches that embedding vectors are generated based on the input images.)
…
…
…

Kalakrishnan and Zhou are in the same field of endeavor as the present invention, as the references are directed to training robots to perform actions using imitation learning. It would have been obvious, before the effective filing date of the claimed invention, to a person of ordinary skill in the art, to combine using embeddings of images for training a CNN as taught in Kalakrishnan with calculating control actions of the robot using the layers of the neural network as taught in Zhou. Zhou provides this additional functionality. As such, it would have been obvious to one of ordinary skill in the art to modify the teachings of Kalakrishnan to include teachings of Zhou because the combination would allow for the training of the robot using images, using the layers of a CNN, to compute control actions of the robot. This has the potential benefit of enabling the robot to move more efficiently in problem solving tasks.

Kalakrishnan and Zhou do not explicitly teach:
… a fully connected layer and a plurality of spatial attention modules, …
and computing the plurality of vector embeddings based on the new global vector using the fully connected layer of the attentive embedding network;
and computing, by the one or more hardware processors, a control action based on the plurality of vector embeddings, an image from the plurality of images, a robot joint state vector and a robot joint velocity vector using a control network, wherein the control network comprises a second CNN and a plurality of fully connected layers, and wherein the control network is connected to the attentive embedding using multiplicative spatial skip connections.

However, James teaches:
… a fully connected layer and a plurality of spatial attention modules, … (James [Paragraph 0081]: “The concatenated vector may then be processed by a remainder of the CNN, such as at least one fully connected layer of the CNN, to generate the action a to be performed by the robotic device” James teaches that the model that generates the action to be performed, which is necessarily one that takes into account spatial attention, is one that has a fully connected layer. Furthermore, the CNN that comprises this has layers that are modules that take into account spatial attention.)
and computing the plurality of vector embeddings based on the new global vector using the fully connected layer of the attentive embedding network; (James [Paragraph 0081]: “The concatenated vector may then be processed by a remainder of the CNN, such as at least one fully connected layer of the CNN, to generate the action a to be performed by the robotic device” James teaches that the model that generates the action to be performed, which is necessarily one that takes into account spatial attention, is one that has a fully connected layer. The action to be performed is one that is an embedding as it is a task embedding.)
and computing, by the one or more hardware processors, a control action based on the plurality of vector embeddings, an image from the plurality of images, a robot joint state vector and a robot joint velocity vector using a control network, wherein the control network comprises a second CNN and a plurality of fully connected layers, and wherein the control network is connected to the attentive embedding using multiplicative spatial skip connections. (James [Figure 8]: James teaches that the control action is based on the vector embeddings from the input data as well as the image sample training data. As mentioned before, this CNN is based on, amongst other things, fully connected layers.)


James is in the same field as the present invention, since it is directed to task embedding for robotic device control.  It would have been obvious, before the effective filing date of the claimed invention, to a person of ordinary skill in the art, to combine using a neural network to generate a task for the robot as taught in Kalakrishnan as modified by Zhou with using a fully connected layer in the model as taught in James. James provides this additional functionality. As such, it would have been obvious to one of ordinary skill in the art to modify the teachings of Kalakrishnan as modified by Zhou to include teachings of James because the combination would allow for the robot to learn complex, non-linear mappings. This has the potential benefit of being better able to find the most effective task/action to take.


Regarding dependent claim 2, Kalakrishnan and Zhou teach:
The processor implemented method of claim 1, wherein each of the plurality of spatial attention modules comprises an addition unit, a convolution unit and a sigmoid activation function, wherein the new global vector is obtained from the final layer of the first CNN. (Kalakrishnan [Col. 7, Lines 13-17]: “the normalized advantage functions (NAF) technique can used, as it provides a simple way to represent and optimize both the actor and the critic with a single computation graph and objective” Kalakrishnan teaches normalized advantage functions, which is used to train the robot.)

The reasons to combine are substantially similar to those of claim 1. 

Regarding dependent claim 3, Kalakrishnan and Zhou teach:
The processor implemented method of claim 1, wherein the pretrained attentive embedding network and the control network are tested with a custom dataset further comprising complex background. (Kalakrishnan [Col. 10, Lines 22-26]: “If a pair of rollouts is determined to belong to the same task, this pair can be added to the original dataset of optimal demonstrations. In some implementations, a new meta-policy can be learned from this expanded dataset.” Kalakrishnan teaches that the concatenation of the dot product, or the pairs of the rollouts, are part of a new group of vectors known as the expanded dataset. This is a custom dataset.)

The reasons to combine are substantially similar to those of claim 1. 

Regarding dependent claim 4, Kalakrishnan and Zhou teach:
The processor implemented method of claim 1, wherein the method of training the attentive embedding network comprises: receiving a support dataset and a query dataset, wherein the support dataset comprises the plurality of sequential images corresponding a plurality of tasks, and wherein the query dataset comprises the plurality of sequential images pertaining to a specific task; (Zhou [0024]: “For example, the current observation 116 can be one or more first-person, ego-centric images of the environment, that is images captured by one or more cameras (or other image generation unit(s)) of the robot.” Zhou teaches that the images of the robots are captured by the cameras. These images are ego-centric, showing that they capture the robot itself doing tasks at a time.)
computing a first plurality of vector embeddings corresponding to each of the plurality of tasks associated with the support dataset using the attentive embedding network; (Zhou [0025]: “generates an embedding 114 of the current observation 116 and an embedding 118 of the goal observation” Zhou teaches that embedding vectors are generated based on the input images.)
normalizing the first plurality of vector embeddings associated with the support dataset using a normalization technique; (Kalakrishnan [Col. 7, Lines 13-17]: “the normalized advantage functions (NAF) technique can used, as it provides a simple way to represent and optimize both the actor and the critic with a single computation graph and objective” Kalakrishnan teaches normalized advantage functions, which is used to train the robot.)
computing a second plurality of vector embeddings corresponding to each of the plurality of tasks associated with the query dataset using the attentive embedding network; (Kalakrishnan [Col. 3, Lines 20-23]: “the control network can be used to autonomously generate, based on a task embedding of a human guided demonstration, a trial of the robot performing the task (also referred to here in a rollout)” Kalakrishnan teaches that the intermediate vector embedding is used to generate a task vector, also known as a rollout.)
computing the dot product similarity between the normalized first plurality of vector embeddings and the second plurality of vector embeddings; (Kalakrishnan [Col. 11, Lines 37-39]: “a dot product (in the embedding space) can be calculated for all possible pairs of rollouts in R, and pairs with a high dot product (measured by some predefined threshold α) can be labeled as belonging to the same task” Kalakrishnan teaches that the rollouts are multiplied together using a dot product operator.)
and training the attentive embedding network based on the computed dot product similarity, wherein the training is continued until the dot product similarity between the first plurality of normalized vector embeddings and the second plurality of vector embeddings are greater than a predefined similarity threshold. (Kalakrishnan [Col. 11, Lines 37-39]: “a dot product (in the embedding space) can be calculated for all possible pairs of rollouts in R, and pairs with a high dot product (measured by some predefined threshold α) can be labeled as belonging to the same task” Kalakrishnan teaches that the rollouts are multiplied together using a dot product operator. Kalakrishnan also teaches a threshold that needs to be met for the pairs to make the dataset that will eventually be used for training.)

The reasons to combine are substantially similar to those of claim 1. 

Regarding dependent claim 5, Kalakrishnan and Zhou teach:
The processor implemented method of claim 1, wherein the method of computing the plurality of control actions based on the plurality of vector embeddings and the plurality of images using the pretrained control network comprises: receiving the plurality of vector embeddings, the image from the plurality of images, the robot joint state vector and the robot joint velocity vector; (Zhou [0024]: “For example, the current observation 116 can be one or more first-person, ego-centric images of the environment, that is images captured by one or more cameras (or other image generation unit(s)) of the robot.” Zhou teaches that the images of the robots are captured by the cameras. Zhou [0025]: “generates an embedding 114 of the current observation 116 and an embedding 118 of the goal observation” Zhou teaches that embedding vectors are generated based on the input images. These include the robot joint state vector and the velocity vector, as they are derived from the images.)
tiling the plurality of vector embeddings corresponding to a size associated with the image, wherein the image is randomly selected from the plurality of images; (Zhou [0025]: “generates an embedding 114 of the current observation 116 and an embedding 118 of the goal observation” Zhou teaches that embedding vectors are generated based on the input images. The vectors are proportional, or tiled, based on the image that it was based on.)
obtaining a concatenated data by concatenating the plurality of tiled vector embeddings and the image; (Kalakrishnan [Col. 11, Lines 37-39]: “a dot product (in the embedding space) can be calculated for all possible pairs of rollouts in R, and pairs with a high dot product (measured by some predefined threshold α) can be labeled as belonging to the same task” Kalakrishnan teaches that the rollouts are multiplied together using a dot product operator. Kalakrishnan teaches a concatenation of the pairs of rollouts.)
computing a plurality of element-wise feature maps based on the concatenated data using the second CNN; (Kalakrishnan [Col. 13, Lines 50-53]: “environment state embedding 206 represents one or more visual features of the environment of the robot.” Kalakrishnan teaches that the computed embedding represents features, thus being a feature map.)
computing a plurality of fusion feature maps by multiplying each of the plurality of element-wise feature maps with a corresponding feature maps of the attentive embedding network; (Kalakrishnan [Col. 11, Lines 37-39]: “a dot product (in the embedding space) can be calculated for all possible pairs of rollouts in R, and pairs with a high dot product (measured by some predefined threshold α) can be labeled as belonging to the same task” Kalakrishnan teaches that the rollouts are multiplied together using a dot product operator. Kalakrishnan also teaches a threshold that needs to be met for the pairs to make the dataset that will eventually be used for training.)
computing a flattened feature map based on the plurality of fusion feature maps using the second CNN; (Kalakrishnan [Col. 13, Lines 50-53]: “environment state embedding 206 represents one or more visual features of the environment of the robot.” Kalakrishnan teaches that the computed embedding represents features, thus being a feature map. This feature map may be flattened, or reduced in dimension as it is an embedding.)
generating a composite feature vector by concatenating the flattened feature map, the plurality of robot joint states and the plurality of robot joint velocities; (Kalakrishnan [Col. 13, Lines 50-53]: “environment state embedding 206 represents one or more visual features of the environment of the robot.” Kalakrishnan teaches that the computed embedding represents features, thus being a feature map. This may include the specifications of the robot, such as the state and velocities.)
and computing the control action based on the composite feature vector by using the plurality of fully connected layers of the control network. (Zhou [0094]: “For each subsequent goal demonstration observation in the goal demonstration sequence, the system starts generating the trajectory from the last state of the environment at the completion of the trajectory for the preceding goal demonstration observation in the goal demonstration sequence” Zhou teaches that the model outputs a trajectory, which is a collection of control actions that the robot is to perform in the future.)

The reasons to combine are substantially similar to those of claim 1. 

Claim 6 is substantially similar to claim 5, but has the following additional elements:
Regarding dependent claim 6, Kalakrishnan and Zhou teach:
computing a second control action based on the composite feature vector by using Fully connected layers; (Zhou [0094]: “For each subsequent goal demonstration observation in the goal demonstration sequence, the system starts generating the trajectory from the last state of the environment at the completion of the trajectory for the preceding goal demonstration observation in the goal demonstration sequence” Zhou teaches that the model outputs a trajectory, which is a collection of control actions that the robot is to perform in the future.)
computing a total behavioral cloning loss function based on the first control action and the second control action; (Kalakrishnan [Col. 11, Lines 13-15]: “In order to learn to predict actions from this information, the maximum likelihood behavior cloning loss Lb can be used.” Kalakrishnan teaches computing the behavior cloning loss, used to train the neural network.)
and training the control network based on the total behavioral cloning loss function. (Kalakrishnan [Col. 11, Lines 13-15]: “In order to learn to predict actions from this information, the maximum likelihood behavior cloning loss Lb can be used.” Kalakrishnan teaches computing the behavior cloning loss, used to train the neural network.)

The reasons to combine are substantially similar to those of claim 1. 

Claim 7 is substantially similar to claim 1, but has the following additional elements:
Regarding independent claim 7, Kalakrishnan and Zhou teach:
A system comprising: at least one memory storing programmed instructions; (Zhou [0115]: “The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them” Zhou teaches a memory capable of storing instructions.)
one or more Input /Output (I/O) interfaces; (Zhou [0120]: “The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output” Zhou teaches an input/output interface.)
and one or more hardware processors operatively coupled to the at least one memory, wherein the one or more hardware processors are configured by the programmed instructions to: (Zhou [0116]: “including by way of example a programmable processor, a computer, or multiple processors or computers.” Zhou teaches a processor.)

The reasons to combine are substantially similar to those of claim 1. 


Claims 8-12 are rejected on the same grounds under 35 U.S.C. 103 as claims 2-6 as they are substantially similar, respectively. Mutatis mutandis.

Claim 13 is substantially similar to claim 1, but has the following additional elements:
Regarding independent claim 13, Kalakrishnan and Zhou teach:
One or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause: (Zhou [0116]: “including by way of example a programmable processor, a computer, or multiple processors or computers.” Zhou teaches a processor.)

The reasons to combine are substantially similar to those of claim 1. 

Claims 14-18 are rejected on the same grounds under 35 U.S.C. 103 as claims 2-6 as they are substantially similar, respectively. Mutatis mutandis.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KYU HYUNG HAN whose telephone number is (703) 756-5529.  The examiner can normally be reached on MF 9-5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on (571) 270-3428.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 

/Kyu Hyung Han/
Examiner
Art Unit 2123

/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action
Prosecution Timeline

Oct 21, 2022
Application Filed
May 08, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/332,295
Patent 12585928
HARDWARE ARCHITECTURE FOR INTRODUCING ACTIVATION SPARSITY IN NEURAL NETWORK
4y 10m to grant Granted Mar 24, 2026
17/317,300
Patent 12387101
SYSTEMS AND METHODS FOR PRUNING BINARY NEURAL NETWORKS GUIDED BY WEIGHT FLIPPING FREQUENCY
4y 3m to grant Granted Aug 12, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
46%
Grant Probability
53%
With Interview (+7.1%)
4y 1m (~6m remaining)
Median Time to Grant
Low
PTA Risk
Based on 11 resolved cases by this examiner. Grant probability derived from career allowance rate.