Prosecution Insights
Last updated: April 19, 2026
Application No. 18/539,590

OBJECT-CENTRIC VIDEO REPRESENTATION FOR ACTION PREDICTION

Non-Final OA §103
Filed
Dec 14, 2023
Examiner
NASHER, AHMED ABDULLALIM-M
Art Unit
2675
Tech Center
2600 — Communications
Assignee
Brown University
OA Round
1 (Non-Final)
81%
Grant Probability
Favorable
1-2
OA Rounds
2y 9m
To Grant
99%
With Interview

Examiner Intelligence

Grants 81% — above average
81%
Career Allow Rate
80 granted / 99 resolved
+18.8% vs TC avg
Strong +34% interview lift
Without
With
+34.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
17 currently pending
Career history
116
Total Applications
across all art units

Statute-Specific Performance

§101
9.0%
-31.0% vs TC avg
§103
63.1%
+23.1% vs TC avg
§102
14.5%
-25.5% vs TC avg
§112
10.7%
-29.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 99 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Information Disclosure Statement The information disclosure statement (IDS) submitted on 01/31/2024, 02/01/2024 are being considered by the examiner. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claim(s) 1-6, 13-15, 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Girdhar et al "Anticipative Video Transformer" and further in view of Kilickaya (US 20220414371 A1). Regarding claims 1, 16 and 20, Girdhar discloses extract a first sequence of video segments from video content associated with a domain ("abstract: We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features page 1, col 2: For instance, the presence of a plate of food with a fork may be sufficient to indicate the action of eating, whereas anticipating that same action would require recognizing and reasoning over the sequence of actions that precede it, such as chopping, cooking, serving, etc. "); detect a set of objects in the first sequence of video segments ("page 1, col 2: For instance, the presence of a plate of food with a fork may be sufficient to indicate the action of eating, whereas anticipating that same action would require recognizing and reasoning over the sequence of actions that precede it, such as chopping, cooking, serving, etc. "); generate a set of embeddings based on the extracted first sequence of video segments and the detected set of objects ("page 3, col 2: While we do not need to classify each frame individually, we still prepend a learnable [class] token embedding to the patch features, whose output will be used as a frame-level embedding input to the head."); apply a predictive transformer encoder (PTE) model on the generated set of embeddings ("abstract: We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. " (a video transformer encoder that predicts future actions)); predict, based on the application of the PTE model, a set of object-action pairs associated with a second sequence of video segments of the video content, wherein ("abstract: We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies-both critical for the anticipation task.") each object-action pair of the predicted set of object-action pairs includes an action that is to be executed using an object of the detected set of objects included in a video segment of the second sequence of video segments ("figure 1: Anticipating future actions using AVT involves encoding video frames with a spatial-attention backbone, followed by a temporal-attention head that attends only to frames before the current one to predict future actions. In this example, it spontaneously learns to attend to hands and objects without being supervised to do so. Moreover, it attends to frames most relevant to predict the next action. For example, to predict ‘wash tomato’ it attends equally to all previous frames as they determine if any more tomatoes need to be washed, whereas for ‘turn-off tap’ it focuses most on the current frame for cues whether the person might be done."), and the second sequence of video segments succeeds the first sequence of video segments in a playback timeline of the video content (abstract: We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames’ features.); and render information associated with the predicted set of object-action pairs and the second sequence of video segments ("page 2, col 1: Similar to recurrent models, AVT can be rolled out indefinitely to predict further into the future (i.e. generate future predictions), yet it does so while processing the input in parallel with long-range attention, which is often lost in recurrent architectures."). Girdhar does not explicitly teach but in a similar field of endeavor of action object detection, Kilickaya teaches circuitry configured to ([0126] The means may include various hardware and/or software component(s) and/or module(s), including, but not limited to, a circuit, an application specific integrated circuit (ASIC), or processor.). It would have been obvious to one of ordinary skill in the rat before the effective filing date of the invention to combine Girdhar’s disclosure of future action detection with Kilickaya’s teaching of using a circuit, in order to implement Girdhar’s disclosure in a physical system with a circuit. Regarding claims 2 and 17, Girdhar discloses train an action prediction model based on the predicted set of object-action pairs and on the second sequence of video segments (page 2, col 1: We train the model to jointly predict the next action while also learning to predict future features that match the true future features and (when available) their intermediate action labels.), wherein the trained action prediction model is configured to predict human-object interactions from input video frames associated with the domain (page 2, col 1: We train the model to jointly predict the next action while also learning to predict future features that match the true future features and (when available) their intermediate action labels. Page 6, col 1: Datasets and metrics. We test on four popular action anticipation datasets summarized in Table 1. EpicKitchens-100 (EK100) [14] is the largest egocentric (first-person) video dataset with 700 long unscripted videos of cooking activities totaling 100 hours.). Regarding claim 3, Girdhar discloses receive a user input indicative of a time interval including the first sequence of video segments and the second sequence of video segments (page 3, col 2: For each action segment labeled in the dataset starting at time τs, the goal is to recognize it using a τo length video segment τa units before it, i.e. from τs − (τa + τo) to τs − τa. While methods are typically allowed to use any length of observed segments (τo), the anticipation time (τa) is usually fixed for each dataset.), wherein the prediction of the set of object-action pairs is further based on the time interval (fig. 3: The resulting feature is trained to regress to the true future feature (Lfeat) and predict the action at that time point if labeled (Lcls), and the last prediction is trained to predict the future action (Lnext).). Regarding claims 4 and 18, Girdhar discloses receive a user input indicative of the domain (Figure 8: More Qualitative Results. The spatial and temporal attention visualization in EK100, similar to Figure 1. For each input frame, we visualize the effective spatial attention by AVT-b using attention rollout [1]. The red regions represent the regions of highest attention, which we find too often correspond to hands+objects in the egocentric EpicKitchens-100 videos.); and determine a set of domain objects based on the domain indicated in the received user input (fig. 8: As seen in Figure 1, spatial attention focuses on hands and objects.), wherein the detected set of objects include domain objects of the determined set of domain objects (page 12, col 1: We test on four data sets as described in the main paper. EpicKitchens-100 (EK100)[14]is the largest egocentric (first-person) video dataset with 700 long unscripted videos of cooking activities totaling 100 hours. It contains 89,977 segments labeled with one of 97 verbs, 300 nouns, and 3807 verb-noun combinations (or “actions”), and uses τa=1s.). Regarding claim 5, Girdhar discloses receive a user input indicative of a set of domain objects (Figure 8: More Qualitative Results. The spatial and temporal attention visualization in EK100, similar to Figure 1. For each input frame, we visualize the effective spatial attention by AVT-b using attention rollout [1]. The red regions represent the regions of highest attention, which we find too often correspond to hands+objects in the egocentric EpicKitchens-100 videos.), wherein the set of domain objects indicated in the received user input is based on the domain (fig. 8: As seen in Figure 1, spatial attention focuses on hands and objects.), and the detected set of objects include domain objects of the set of domain objects indicated in the received user input (page 12, col 1: We test on four data sets as described in the main paper. EpicKitchens-100 (EK100)[14]is the largest egocentric (first-person) video dataset with 700 long unscripted videos of cooking activities totaling 100 hours. It contains 89,977 segments labeled with one of 97 verbs, 300 nouns, and 3807 verb-noun combinations (or “actions”), and uses τa=1s.). Regarding claims 6 and 19, Girdhar discloses apply a visual-language model on each video segment of the first sequence of video segments and on information associated with the set of objects to be detected in the first sequence of video segments ("page 2, col 1: Figure 1 shows examples of how AVT’s spatial and temporal attention spreads over previously observed frames for two of its future predictions (wash tomato and turn-off tap). fig. 8: The text on the top show future predictions at 2 points in the video, along with the temporal attention (last layer of AVT-h averaged over heads) visualized using the width of the lines. The green color of text indicates that it matches the GT action at that future frame (or that nothing is labeled at that frame). "), wherein the detection of the set of objects in the first sequence of video segments is further based on the application of the visual-language model ("page 2, col 1: Figure 1 shows examples of how AVT’s spatial and temporal attention spreads over previously observed frames for two of its future predictions (wash tomato and turn-off tap). Fig. 1: For example, to predict ‘wash tomato’ it attends equally to all previous frames as they determine if any more tomatoes need to be washed, whereas for ‘turn-off tap’ it focuses most on the current frame for cues whether the person might be done."). Regarding claim 13, Girdhar discloses generate a set of features based on the application of the PTE model on the generated set of embeddings (Page 2, col 1: We train the model to jointly predict the next action while also learning to predict future features that match the true future features and (when available) their intermediate action labels.), wherein each feature of the set of features is associated with a video segment of the second sequence of video segments (fig. 3: patch features in every frame); and apply a transformer decoder on the generated set of features (page 4, col 2: The predicted features are then decoded into a distribution over the semantic action classes using a linear classifier θ, i.e. ˆyt = θ(ˆzt). We implement D using a masked transformer decoder inspired from popular approaches in generative language modeling, such as GPT-2 [70].), wherein the prediction of the set of object-action pairs is further based on the application of the transformer decoder (page 4, col 2: The predicted features are then decoded into a distribution over the semantic action classes using a linear classifier θ, i.e. ˆyt = θ(ˆzt).). Regarding claim 14, Girdhar discloses generate a third subset of embeddings based on timestamp information associated with each video segment of the second sequence of video segments (fig. 3: patch features and spatial position embeddings in every frame, with X1 being a start time. "The resulting feature is trained to regress to the true future feature (Lfeat) and predict the action at that time point if labeled (Lcls), and the last prediction is trained to predict the future action (Lnext)."); apply the PTE model on a first subset of embeddings associated with the first sequence of video segments, a second subset of embeddings associated with the set of objects, and the third subset of embeddings (fig. 3: transformer encoder applied at every frame, with Lnext being a predicted future.); and generate an encoded sequence based on the application of the PTE model (abstract: We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames’ features.), wherein the generation of the set of features is further based on the generated encoded sequence (abstract: We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames’ features.). Regarding claim 15, Girdhar does not explicitly teach but in a similar field of endeavor of action object detection, Kilickaya teaches determine, for each feature of the generated set of features, a set of candidate object-action pairs ("abstract: A human-object interaction may be determined based on a set of candidate interactions and the predicted human-object pairs. [0064] The classifier layer 504 includes a classifier (C) 516 that receives human-object features at each position of the input 510 (e.g., image). In some aspects, the classifier layer 504 may include bounding box classifiers to predict human and object bounding boxes and a verb-noun classifier. For example, the bounding box classifiers may be three-layer multi-layer perceptrons (MLPs) that generate four-dimensional outputs representing top left corner coordinates and the width-height of the bounding box. The verb-noun classifier maps the input features to the set of verb and object categories separately. "); and determine a confidence score associated with each candidate object-action pair of the determined set of candidate object-action pairs ([0073] The visual prior VP, on the other hand, may compute how well a given prediction-target pair matches in terms of appearance. The visual prior may be determined based on a verb and noun classification within the image. The prediction pair(s) with the highest confidence for the target interaction categories (available via image-level HOI annotations) may receive the highest score.), wherein the prediction of the set of object-action pairs is further based on the determination of the confidence score associated with each candidate object-action pair ([0030] The general-purpose processor 102 may also include code to predict human-object pairs based on the extracted first set of features. The general-purpose processor 102 may further include code to determine a human-object interaction based on a set of candidate interactions and the predicted human-object pairs.). It would have been obvious to one of ordinary skill in the rat before the effective filing date of the invention to combine Girdhar’s disclosure of future action detection with Kilickaya’s teaching of object-action pairs and confidence scores, in order to predict a triplet of <human, interaction, object> where human (interactor) and objects (interactee) are represented by a bounding box, and the interaction is a <verb, noun> tuple, such as <ride, bicycle> ([0002]). Claim(s) 7-10 are rejected under 35 U.S.C. 103 as being unpatentable over Girdhar et al "Anticipative Video Transformer", in view of Kilickaya (US 20220414371 A1) and further in view of Zhang et el, "Temporal Sentence Grounding in Videos: A Survey and Future Directions". Regarding claim 7, Girdhar discloses apply a transformer encoder on each video segment of the first sequence of video segments where an object of the set of objects is detected (page 2, col 2: While the architecture described so far can be applied on top of various frame or clip encoders (as we will show in experiments), we further propose a purely attention-based video modeling architecture by replacing the backbone with an attention-based frame encoder from the recently introduced Vision Transformer [18].). Girdhar and Kilickaya do not explicitly disclose but in a similar field of endeavor of Temporal sentence grounding in videos, Zhang teaches generate a multimodal representation based on the first sequence of video segments and the set of objects ("page 3, col 1: The interactor module, an essential component in TSGV, learns the multimodal representations by modeling the cross-modal interaction between video and query. Finally, the answer predictor generates moment predictions based on the learned multimodal representations."), wherein the multimodal representation corresponds to the set of embeddings (page 5, col 2: Answer predictor is responsible for predicting the position of a target moment based on the learned multimodal features.). It would have been obvious to one of ordinary skill in the rat before the effective filing date of the invention to combine Girdhar and Kilickaya’s disclosure of object-action pairs and confidence scores with Zhang’s teaching of multimodal representations, in order to develop anchor-based and proposal-free methods to address TSGV in an “end-to-end” manner (page 6, col 2). Regarding claim 8, Girdhar discloses wherein the transformer encoder includes a video encoder and an object encoder ("abstract: We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames’ features. (features under BRI can be features of objects) page 2, col 2: While the architecture described so far can be applied on top of various frame or clip encoders (as we will show in experiments), we further propose a purely attention-based video modeling architecture by replacing the backbone with an attention-based frame encoder from the recently introduced Vision Transformer [18]."). Regarding claim 9, Girdhar discloses apply the video encoder on each video segment of the first sequence of video segments (fig. 3: transformer encoder on each frame, with an class (token) as well as patch features and spatial position embedding); and generate, based on the application of the video encoder, a first subset of embeddings that includes an embedding associated with each video segment of the first sequence of video segments (fig. 3: transformer encoder on each frame, with an class (token) as well as patch features and spatial position embedding relating to input frame X1), wherein the set of embeddings includes the generated first subset of embeddings (fig. 3: transformer encoder on each frame, with an class (token) as well as patch features and spatial position embedding relating to input frame X1). Regarding claim 10, Girdhar discloses an object of the set of objects detected in a corresponding video segment (fig. 3: Lcls - unwrap pizza), an action executed by use of the object (page 17, fig. 9: Figure 9: Long-term anticipation. Additional results continued from Figure 5 on EK100. On top of each frame, we show the future prediction at that frame (not the action that is happening in the frame, but what the model predicts will happen next). The following text boxes show the future predictions made by the model by rolling out autoregressively, using the predicted future feature. The number next to the rolled out predictions denotes for how many time steps that specific action would repeat, according to the model. For example, ‘wash spoon: 4’ means the model anticipates the ‘wash spoon’ action to continue for next 4 time steps.), a first time-instance associated with a start of the corresponding video segment (page 3, col 2: We now present the AVT model architecture, as illustrated in Figure 3. It is designed to predict future actions given a video clip as input.), or a second time-instance associated with an end of the corresponding video segment (figure 2: Action anticipation problem setup. The goal is to use the observed video segment of length τo to anticipate the future action τa seconds before it happens.). Claim(s) 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Girdhar et al "Anticipative Video Transformer", in view of Kilickaya (US 20220414371 A1), in view of Zhang et el, "Temporal Sentence Grounding in Videos: A Survey and Future Directions", and further in view of Jones (US 20190244028 A1). Regarding claim 11, Girdhar discloses apply the object encoder on each video segment of the first sequence of video segments (fig. 3: transformer encoder on each frame, with an class (token) as well as patch features and spatial position embedding relating to input frame X1); and generate, based on the application of the object encoder, a second subset of embeddings that includes an ("page 8, col 2: We append the predicted feature and run the model on the resulting sequence, reusing features computed for past frames. As shown in Figure 5, AVT makes reason able future predictions—‘wash spoon’ after ‘wash knife’, followed by ‘wash hand’ and ‘dry hand’—indicating the model has started to learn certain ‘action schemas’ [68], a core capability of our causal attention and anticipative training architecture. We show additional results in Appendix D.3. (""features computed for past frames"" under BRI would be the same as ""subset of embeddings"" since the prior art is detecting a spoon and a fork) page 14, col 1: The only additional compute in anticipative training as opposed to naive is for applying the linear layer to classify past frame features for Lcls, since Lfeat simply matches past features, which anyway need to be computed for self-attention to predict the next action."). Girdhar, Kilickaya, and Zhang do not disclose but in a similar field of endeavor of object detection in videos, Jones teaches embedding associated with each object of the set of objects ([0037] Step S4 takes the temporal feature maps 230 and applies a third subnetwork 133 which outputs a set of bounding boxes and class probabilities which encode spatial locations and likely object classes for each detected object in the current video frame.) the set of embeddings includes the generated second subset of embeddings ([0037] Step S4 takes the temporal feature maps 230 and applies a third subnetwork 133 which outputs a set of bounding boxes and class probabilities which encode spatial locations and likely object classes for each detected object in the current video frame.). It would have been obvious to one of ordinary skill in the rat before the effective filing date of the invention to combine Girdhar, Kilickaya and Zhang’s disclosure of multimodal representations with Jone’s teaching of multiple object detection, in order to use a multi-class detectors that take multiple frames of video as input (0007). Regarding claim 12, Girdhar discloses the coordinates are associated with a video frame of a video segment of the first sequence of video segments where the object is detected (fig. 3: We add a learned [CLASS] token, along with spatial position embeddings, and the resulting features are passed through multiple layers of multi-head attention, with shared weights across the transformers applied to all frames.). Girdhar, Kilickaya, and Zhang do not explicitly disclose but in a similar field of endeavor of object detection in videos, Jones teaches each embedding of the second subset of embeddings, associated with an object of the set of objects, is generated based on coordinates of a bounding box that includes the object ([0037] Step S4 takes the temporal feature maps 230 and applies a third subnetwork 133 which outputs a set of bounding boxes and class probabilities which encode spatial locations and likely object classes for each detected object in the current video frame.). It would have been obvious to one of ordinary skill in the rat before the effective filing date of the invention to combine Girdhar, Kilickaya and Zhang’s disclosure of multimodal representations with Jone’s teaching of multiple object detection, in order to use a multi-class detectors that take multiple frames of video as input (0007). Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to AHMED A NASHER whose telephone number is (571)272-1885. The examiner can normally be reached Mon - Fri 0800 - 1700. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Moyer can be reached at (571) 272-9523. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /AHMED A NASHER/ Examiner, Art Unit 2675 /ANDREW M MOYER/ Supervisory Patent Examiner, Art Unit 2675
Read full office action

Prosecution Timeline

Dec 14, 2023
Application Filed
Mar 07, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12601840
TUNING PARAMETER DETERMINATION METHOD FOR TRACKING AN OBJECT, A GROUP DENSITY-BASED CLUSTERING METHOD, AN OBJECT TRACKING METHOD, AND AN OBJECT TRACKING APPARATUS USING A LIDAR SENSOR
2y 5m to grant Granted Apr 14, 2026
Patent 12586329
MODELING METHOD, DEVICE, AND SYSTEM FOR THREE-DIMENSIONAL HEAD MODEL, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
Patent 12582373
GENERATING SYNTHETIC ELECTRON DENSITY IMAGES FROM MAGNETIC RESONANCE IMAGES
2y 5m to grant Granted Mar 24, 2026
Patent 12567255
FEW-SHOT VIDEO CLASSIFICATION
2y 5m to grant Granted Mar 03, 2026
Patent 12561965
NEURAL NETWORK CACHING FOR VIDEO
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
81%
Grant Probability
99%
With Interview (+34.4%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 99 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month