Office Action Analysis: 18634794 — ACTION LOCALIZATION IN VIDEOS USING LEARNED QUERIES

Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 5/23/2025.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because reference character 150 in Figure 3 has been used to designate both the set of query vectors and the localization head.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference signs mentioned in the description: 
350 is mentioned on page 10 line 10
412 is mentioned on page 11 line 9
414 is mentioned on page 11 line 15
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character not mentioned in the description: 500 in Figure 5.  Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Claim Objections
Claims 10 and 17 are objected to because of the following informalities: 
Claim 10 line 4, “for each the one or more” appears to be a typo. Examiner suggests “for each of the one or more”.
Claim 17, line 1, “clam 16” appears to be a typo. Examiner suggests “claim 16”.  
Appropriate correction is required.
Positive Statement Regarding - 35 USC § 101
The Examiner’s 35 U.S.C. 101 analysis recognizes that the claimed subject matter is directed to a practical application of a technical solution. The claimed elements, taken as a whole, improve the functioning of action localization methods by improving the computational efficiency, see page 1, lines 28-29 and Figure 6. Because the claims recite specific, claimed steps and structural elements that produce a tangible technical result, they are not directed to an abstract idea absent additional inventive concept limitations. Accordingly, the record supports a positive 101 determination for the present claims.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 3- 6, 8-10, 13, and 16- 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by applicant supplied prior art Zhao et al. (Zhao, Jiaojiao, et al. "Tuber: Tubelet transformer for video action detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022) (hereafter, “Zhao”).

    PNG
    media_image1.png
    770
    1449
    media_image1.png
    Greyscale
Regarding claim 1, Zhao discloses a method performed by one or more computers (Page 13593, Table 1. Testing and evaluating the model implies a computer), the method comprising: receiving an input video comprising a sequence of video frames (Figure 1; Figure 1 caption, TubeR takes as input a video clip. The bottom panel in Figure 1 indicates as sequence of frames as input); maintaining a set of query vectors (Figure 2 “tubelet queries”; Page 13590, §3.2 TubeR Decoder, The i-th tubelet query Qi={qi,1, ..., qi,Tout}); 
Figure 2 from Zhao.
and processing the set of query vectors and the input video to perform action localization on the input video (Page 13590, §3. Action Detection by TubeR, we present our TubeR that takes as input a video clip and directly outputs a tubelet: a sequence of bounding boxes and the action label; Figure 2 caption, The decoder transforms a set of tubelet queries Q …to predict tubelet labels and coordinates), comprising: processing the input video using a video encoder neural network (Figure 2. The top left of figure 2 illustrates a video encoder neural network) to generate a feature representation of the input video (Figure 2; Page 13590, §3.1 TubeR Encoder, Fen ∈ RT′H′W′×C′ denotes the C′ dimensional encoded feature embedding); and processing the set of query vectors and the feature representation using a decoder neural network to generate an action localization output for the video (Page 13590, §3. Action Detection by TubeR, we present our TubeR that takes as input a video clip and directly outputs a tubelet: a sequence of bounding boxes and the action label; Figure 2 caption, The decoder transforms a set of tubelet queries Q …to predict tubelet labels and coordinates. The decoder in the top right of Figure 2 depicts tubule queries and feature encoding as input), wherein the action localization output comprises, for each of one or more agents depicted in the video, data specifying, for each of one or more video frames in the video, a respective bounding box in the video frame that depicts the agent (Figure 1; Figure 3; Page 13590, §3. Action Detection by TubeR, outputs a tubelet: a sequence of bounding boxes. Figures 1 and 3 illustrate bounding boxes, around individual agents, output by the model) and a respective action from a set of actions that is being performed by the agent in the video frame (Figure 1; Figure 3; Page 13590, §3. Action Detection by TubeR, outputs … and the action label. Figures 1 and 3 illustrate action labels output by the model).
Regarding claim 3, Zhao discloses the method of claim 1, wherein the feature representation includes a respective set of feature vectors for each of the video frames (Page 13590, §3. Action Detection by TubeR, Where Tin, H, W, C denote the number of frames, height, width, and colour channels; Page 13590, §3.1 TubeR Encoder, Fen ∈ RT′H′W′×C′ denotes the C′ dimensional encoded feature embedding. The dimension T represents the per frame features).
Regarding claim 4, Zhao discloses the method of claim 1, wherein the set of query vectors comprises a respective set of query vectors corresponding to each of the video frames (Page 13592, §Benefit of tubelet queries, Each query set is composed of Tout per-frame query embeddings).
Regarding claim 5, Zhao discloses the method of claim 4, wherein each query vector has a temporal index that identifies the corresponding video frame for the query vector (Page 13591, §Tubelet attention, box query embeddings across time within the same tubelet, i.e. {qi,1, ..., qi,Tout}, i={1, ...,N}) and a spatial index that identifies a spatial position of the query vector within the video frame (Page 13591, §Tubelet attention, processes the spatial relations between box query embeddings within a frame i.e. {q1,t, ..., qN,t}, t={1, ..., Tout}).
Regarding claim 6, Zhao discloses the method of claim 5, wherein each query vector is a combination of (i) a spatial query vector for the spatial index of the query vector (Page 13591, §Tubelet attention, processes the spatial relations between box query embeddings within a frame i.e. {q1,t, ..., qN,t}, t={1, ..., Tout}. The first index of q is the spatial index) and (ii) a temporal query vector for the temporal index of the query vector (Page 13591, §Tubelet attention, box query embeddings across time within the same tubelet, i.e. {qi,1, ..., qi,Tout}, i={1, ...,N}. The second index of q is the temporal index).
Regarding claim 8, Zhao discloses the method of claim 1, wherein processing the set of query vectors and the feature representation using the decoder neural network comprises: processing the set of query vectors and the feature representation using an attention neural network (Figure 2, top right; Page 13591, §Decoder, The decoder contains a tubelet-attention module and a cross-attention (CA) layer) to update each of the query vectors in the set (Eqn. 6 describes the function used to update the vectors); and after updating each of the query vectors in the set, processing the query vectors using one or more output heads to generate the action localization output (Figure 2, bottom right; Page 13591, §3.3 Task-Specific Heads, The bounding boxes and action classification for each tubelet can be done simultaneously with independent task specific heads).
Regarding claim 9, Zhao discloses the method of claim 8, wherein processing the query vectors using one or more output heads to generate the action localization output comprises (Figure 2, classification head and regression head boxes; Page 13590-13591, §3.2 TubeR Decoder-§3.3 Task-Specific Heads, The bounding boxes and action classification for each tubelet can be done simultaneously with independent task specific heads. Sections 3.2 and 3.3 describe the steps involved in processing tubelet query vectors. In summary, self- and cross-attention heads are used for feature extraction from the query vectors followed by task-specific heads that perform action localization), for each of at least a subset of the query vectors (Figure 2, tubelet queries Q): processing the query vector using a localization head to generate one or more bounding boxes that are predicted to correspond to an agent (Figure 1, bottom; Figure 3. §3.3 Task-Specific Heads, The bounding boxes and action classification for each tubelet can be done simultaneously with independent task specific heads. Figures 1 and 3 show the output of the model with bounding boxes around an agent and action labels in multiple frames); and processing the query vector using a classification head to generate, for each of the one or more bounding boxes, a respective score for each action in the set of actions (Page 13591, §Context aware classification head, where yclass ∈ RN×L denotes the classification score on L possible labels).
Regarding claim 10, Zhao discloses the method of claim 9, wherein processing the query vectors using one or more output heads to generate the action localization output comprises (Figure 2, classification head and regression head boxes; Page 13590-13591, §3.2 TubeR Decoder-§3.3 Task-Specific Heads, The bounding boxes and action classification for each tubelet can be done simultaneously with independent task specific heads. Sections 3.2 and 3.3 describe the steps involved in processing tubelet query vectors. In summary, self- and cross-attention heads are used for feature extraction from the query vectors followed by task-specific heads that perform action localization), for each of at least the subset of the query vectors (Figure 2, tubelet queries Q): for each the one or more bounding boxes, selecting an action from the set of actions using the respective scores (Page 13590, §Action Detection by TubeR, In this section, we present our TubeR that takes as input a video clip and directly outputs a tubelet: a sequence of bounding boxes and the action label. Page 13591, §Context aware classification head, where yclass ∈ RN×L denotes the classification score on L possible labels. Since the final output includes an action label and the trained model outputs action classification scores, Examiner considers this to indicate the use of the scores to determine the labels).
Regarding claim 13, Zhao discloses the method of claim 8, wherein the attention neural network comprises: one or more self-attention blocks that each update the set of query vectors by performing self-attention across the set of query vectors (Figure 2; Tubelet-attention layer (TA) in the decoder box top right); and one or more cross-attention blocks that each update the set of query vectors performing cross-attention into the feature representation of the video (Figure 2; Cross-attention layer (CA) in the decoder box top right).
Regarding claim 16, Zhao discloses the method of claim 1, wherein the video encoder neural network and the decoder neural network have been trained jointly on a loss that measures errors in action localization outputs generated for a set of training videos relative to ground truth localization outputs for the set of training videos (Eqn. 14; Page 13592, §3.4 Losses, where y is the model output and Y denotes the ground truth).
Regarding claim 17, Zhao discloses the method of clam 16, wherein the training comprises matching predicted bounding boxes generated using the decoder neural network to ground truth bounding boxes within each of one or more video frames of each training video (Eqn. 14; Page 13592, §3.4 Losses, The Lbox and Liou denote the per-frame bounding box matching error).
Regarding claim 18, Zhao discloses the method of claim 9, wherein the one or more bounding boxes are a single bounding box in a video frame corresponding to the query vector (Figure 3; Figure 3 caption, Yellow indicates our detected tubelets).
Regarding claim 19, Zhao discloses a system comprising: one or more computers (Page 13593, Table 1. Testing and evaluating the model implies a computer); and one or more storage devices storing instructions that (Page 13588, §Abstract, Code will be available on GluonCV. Code availability implies a non-transitory computer-readable storage), when executed by the one or more computers, cause the one or more computers to perform operations comprising: receiving an input video comprising a sequence of video frames (Figure 1; Figure 1 caption, TubeR takes as input a video clip. The bottom panel in Figure 1 indicates as sequence of frames as input); maintaining a set of query vectors (Figure 2 “tubelet queries”; Page 13590, §3.2 TubeR Decoder, The i-th tubelet query Qi={qi,1, ..., qi,Tout}); and processing the set of query vectors and the input video to perform action localization on the input video (Page 13590, §3. Action Detection by TubeR, we present our TubeR that takes as input a video clip and directly outputs a tubelet: a sequence of bounding boxes and the action label; Figure 2 caption, The decoder transforms a set of tubelet queries Q …to predict tubelet labels and coordinates), comprising: processing the input video using a video encoder neural network (Figure 2. The top left of figure 2 illustrates a video encoder neural network) to generate a feature representation of the input video (Figure 2; Page 13590, §3.1 TubeR Encoder, Fen ∈ RT′H′W′×C′ denotes the C′ dimensional encoded feature embedding); and processing the set of query vectors and the feature representation using a decoder neural network to generate an action localization output for the video (Page 13590, §3. Action Detection by TubeR, we present our TubeR that takes as input a video clip and directly outputs a tubelet: a sequence of bounding boxes and the action label; Figure 2 caption, The decoder transforms a set of tubelet queries Q …to predict tubelet labels and coordinates. The decoder in the top right of Figure 2 depicts tubule queries and feature encoding as input), wherein the action localization output comprises, for each of one or more agents depicted in the video, data specifying, for each of one or more video frames in the video, a respective bounding box in the video frame that depicts the agent (Figure 1; Figure 3; Page 13590, §3. Action Detection by TubeR, outputs a tubelet: a sequence of bounding boxes. Figures 1 and 3 illustrate bounding boxes, around individual agents, output by the model) and a respective action from a set of actions that is being performed by the agent in the video frame (Figure 1; Figure 3; Page 13590, §3. Action Detection by TubeR, outputs … and the action label. Figures 1 and 3 illustrate action labels output by the model).
Regarding claim 20, Zhao discloses one or more non-transitory computer-readable storage media storing instructions (Page 13588, §Abstract, Code will be available on GluonCV. Code availability implies a non-transitory computer-readable storage) that when executed by one or more computers cause the one or more computers to perform operations comprising: receiving an input video comprising a sequence of video frames (Figure 1; Figure 1 caption, TubeR takes as input a video clip. The bottom panel in Figure 1 indicates as sequence of frames as input); maintaining a set of query vectors (Figure 2 “tubelet queries”; Page 13590, §3.2 TubeR Decoder, The i-th tubelet query Qi={qi,1, ..., qi,Tout}); and processing the set of query vectors and the input video to perform action localization on the input video (Page 13590, §3. Action Detection by TubeR, we present our TubeR that takes as input a video clip and directly outputs a tubelet: a sequence of bounding boxes and the action label; Figure 2 caption, The decoder transforms a set of tubelet queries Q …to predict tubelet labels and coordinates), comprising: processing the input video using a video encoder neural network (Figure 2. The top left of figure 2 illustrates a video encoder neural network) to generate a feature representation of the input video (Figure 2; Page 13590, §3.1 TubeR Encoder, Fen ∈ RT′H′W′×C′ denotes the C′ dimensional encoded feature embedding); and processing the set of query vectors and the feature representation using a decoder neural network to generate an action localization output for the video (Page 13590, §3. Action Detection by TubeR, we present our TubeR that takes as input a video clip and directly outputs a tubelet: a sequence of bounding boxes and the action label; Figure 2 caption, The decoder transforms a set of tubelet queries Q …to predict tubelet labels and coordinates. The decoder in the top right of Figure 2 depicts tubule queries and feature encoding as input), wherein the action localization output comprises, for each of one or more agents depicted in the video, data specifying, for each of one or more video frames in the video, a respective bounding box in the video frame that depicts the agent (Figure 1; Figure 3; Page 13590, §3. Action Detection by TubeR, outputs a tubelet: a sequence of bounding boxes. Figures 1 and 3 illustrate bounding boxes, around individual agents, output by the model) and a respective action from a set of actions that is being performed by the agent in the video frame (Figure 1; Figure 3; Page 13590, §3. Action Detection by TubeR, outputs … and the action label. Figures 1 and 3 illustrate action labels output by the model).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 2 and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over applicant supplied prior art Zhao et al. (Zhao, Jiaojiao, et al. "Tuber: Tubelet transformer for video action detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022) (hereafter, “Zhao”) in view of applicant supplied prior art Li et al. (Li, Zexian, et al. "STD-TR: End-to-end spatio-temporal action detection with transformers." 2021 China Automation Congress (CAC). IEEE, 2021) (hereafter, “Li”).
	Regarding claim 2, Zhao discloses the method of claim 1.
	However, Zhao fails to disclose wherein the set of query vectors are learned during joint training of the video encoder neural network and the decoder neural network.
	Li teaches wherein the set of query vectors are learned during joint training of the video encoder neural network and the decoder neural network (Figure 3; page 7617, §Transformer decoder, we employ learnable positional embeddings as object queries).
	Both Zhao and Li are analogous to the claimed invention because they are both in the field of encoder decoder models applied towards action localization. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the learning of query vectors from Li into the action localization model of Zhao. The suggestion/motivation for doing so would have been for improved action localization performance as suggested by Li at Page 7620, demonstrates superior performance on spatio-temporal action detection.
	This method of improving Zhao was within the ordinary ability of one of ordinary skill in the art based on the teachings of Li.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Zhao with the teachings of Li to obtain the invention as specified in claim 2.
	Regarding claim 11, Zhao discloses the method of claim 9.
	However, Zhao fails to disclose wherein the set of actions includes a background action.
	Li teaches wherein the set of actions includes a background action (Page 7617, §feed-forward networks (FFNs), Since N (fixed number of detections) is always bigger than the number of objects in certain image, for ensuring the number of elements in two sets is the same, an additional class ’None’ is also added as background to represent that there is no object in certain proposal).
	Both Zhao and Li are analogous to the claimed invention because they are both in the field of encoder decoder models applied towards action localization. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the learning of query vectors from Li into the action localization model of Zhao. The suggestion/motivation for doing so would have been for improved action localization performance as suggested by Li at Page 7620, demonstrates superior performance on spatio-temporal action detection.
	This method of improving Zhao was within the ordinary ability of one of ordinary skill in the art based on the teachings of Li.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Zhao with the teachings of Li to obtain the invention as specified in claim 11.
	Regarding claim 12, in which claim 11 is incorporated, Zhao discloses wherein processing the set of query vectors and the feature representation using the decoder neural network comprises: in response to selecting the [background action] for a given bounding box (Page 13591, §Action switch regression head, FC layer for deciding whether a box prediction depicts the actor performing the action(s) of the tubelet), determining not to include the bounding box in the action localization output (Page 13591, §Action switch regression head, To remove non-action boxes in a tubelet).
	However, Zhao fails to disclose background action.
	Li teaches background action (Page 7617, §feed-forward networks (FFNs), Since N (fixed number of detections) is always bigger than the number of objects in certain image, for ensuring the number of elements in two sets is the same, an additional class ’None’ is also added as background to represent that there is no object in certain proposal).
	Both Zhao and Li are analogous to the claimed invention because they are both in the field of encoder decoder models applied towards action localization. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the learning of query vectors from Li into the action localization model of Zhao. The suggestion/motivation for doing so would have been for improved action localization performance as suggested by Li at Page 7620, demonstrates superior performance on spatio-temporal action detection.
	This method of improving Zhao was within the ordinary ability of one of ordinary skill in the art based on the teachings of Li.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Zhao with the teachings of Li to obtain the invention as specified in claim 12.
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over applicant supplied prior art Zhao et al. (Zhao, Jiaojiao, et al. "Tuber: Tubelet transformer for video action detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022) (hereafter, “Zhao”) in view of Bertasius et al. (Bertasius, Gedas, Heng Wang, and Lorenzo Torresani. "Is Space-Time Attention All You Need for Video Understanding?." arXiv preprint arXiv:2102.05095. 2021) (hereafter, “Bertasius”).
	Regarding claim 7, Zhao discloses the method of claim 6.
	However, Zhao fails to disclose wherein each query vector is a sum of (i) the spatial query vector for the spatial index of the query vector and (ii) the temporal query vector for the temporal index of the query vector.
	Bertasius teaches wherein each query vector is a sum of (i) the spatial query vector for the spatial index of the query vector and (ii) the temporal query vector for the temporal index of the query vector (Eqn. 2-5; page 3, p = 1, … , N denoting spatial locations and t = 1, … , F
depicting an index over frames; page 3, §Query-Key-Value computation, a query/key/value
vector is computed for each patch. Subscripts p and t are present for the queue/key/value variables in each of the equations in 2-5).
	Both Zhao and Bertasius are analogous to the claimed invention because the both are in the field of applying transformers for video analysis. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the query vector format of Bertasius into the action localization model Zhao. The suggestion/motivation for doing so would have been to process long videos, as suggested by Bertasius at page 2, left column 2nd paragraph, We also show that our model can be used for long-range modeling of videos spanning many minutes.
	This method of improving Zhao was within the ordinary ability of one of ordinary skill in the art based on the teachings of Bertasius.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Zhao with the teachings of Bertasius to obtain the invention as specified in claim 7.
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over applicant supplied prior art Zhao et al. (Zhao, Jiaojiao, et al. "Tuber: Tubelet transformer for video action detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022) (hereafter, “Zhao”) in view of applicant supplied prior art Arnab et al. (Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the IEEE/CVF international conference on computer vision. 2021) (hereafter, “Arnab”).
	Regarding claim 14, Zhao discloses the method of claim 13, wherein the set of query vectors comprises a respective set of query vectors corresponding to each of the video frame (Page 13592, §Benefit of tubelet queries, Each query set is composed of Tout per-frame query embeddings), wherein each query vector has a temporal index that identifies the corresponding video frame for the query vector (Page 13591, §Tubelet attention, box query embeddings across time within the same tubelet, i.e. {qi,1, ..., qi,Tout}, i={1, ...,N}. The second index (1 – Tout) is the temporal index) and a spatial index that identifies a spatial position of the query vector within the video frame (Page 13590, §Tubelet query, The i-th tubelet query Qi={qi,1, ..., qi,Tout}; Page 13591, §Tubelet attention, processes the spatial relations between box query embeddings within a frame i.e. {q1,t, ..., qN,t}, t={1, ..., Tout}. The first index (1-N) is the spatial index), [wherein each self-attention block comprises one or more self-attention heads, and wherein each self-attention head is configured to: map the set of query vectors to a respective head query vector, head key vector, and head value vector for each of the query vectors, and perform a factorized self-attention mechanism that comprises: a first self-attention mechanism that updates the head query vectors by, for each video frame, self-attending only within the video frame using the head query vectors, the head key vectors, and head value vectors for the set of query vectors corresponding to the video frame, and a second self-attention mechanism that updates the head query vectors by, for each spatial index, self-attending only among the query vectors that have the spatial index using the head query vectors, the head key vectors, and head value vectors for the query vectors that have the spatial index].
	However, Zhao fails to disclose wherein each self-attention block comprises one or more self-attention heads, and wherein each self-attention head is configured to: map the set of query vectors to a respective head query vector, head key vector, and head value vector for each of the query vectors, and perform a factorized self-attention mechanism that comprises: a first self-attention mechanism that updates the head query vectors by, for each video frame, self-attending only within the video frame using the head query vectors, the head key vectors, and head value vectors for the set of query vectors corresponding to the video frame, and a second self-attention mechanism that updates the head query vectors by, for each spatial index, self-attending only among the query vectors that have the spatial index using the head query vectors, the head key vectors, and head value vectors for the query vectors that have the spatial index.
	Arnab teaches wherein each self-attention block comprises one or more self-attention heads (Figure 5, Figure 6; Figures 5 and 6 illustrate two different implementations of self-attention blocks with multi-heads), and wherein each self-attention head is configured to: map the set of query vectors to a respective head query vector, head key vector, and head value vector for each of the query vectors (Eqn. 7; Page 5, §Model 4: Factorised dot-product attention, In self-attention, the queries Q = XWq, keys K = XWk, and values V = XWv are linear projections of the input X with X;Q;K;V ∈ RN_d), and perform a factorized self-attention mechanism (Figure 6, Self-Attention block. Examiner considers the factorised dot-product attention a “factorized self-attention mechanism”) that comprises: a first self-attention mechanism (Figure 6, temporal heads. Examiner considers the temporal heads as the “first self-attention mechanism”) that updates the head query vectors by, for each video frame (Page 5, §Model 4: Factorised dot-product attention, we attend over the temporal dimension. The temporal dimension refers to the frame index), self-attending only within the video frame using the head query vectors (Page 5, §Model 4: Factorised dot-product attention, queries Q = XWq), the head key vectors (Page 5, §Model 4: Factorised dot-product attention, The main idea here is to modify the keys and values for each query to only attend over tokens from the … temporal index, … Kt ∈ RNt x d), and head value vectors (Page 5, §Model 4: Factorised dot-product attention, Vt ∈ RNt x d) for the set of query vectors corresponding to the video frame (Figure 6 Temporal heads; Page 5, §Model 4: Factorised dot-product attention, we attend over the temporal dimension by computing Yt = Attention(Q;Kt;Vt)), and a second self-attention mechanism (Figure 6, spatial heads. Examiner considers the spatial heads as the “second self-attention mechanism”) that updates the head query vectors by, for each spatial index (Page 5, §Model 4: Factorised dot-product attention, we attend over tokens from the spatial dimension), self-attending only among the query vectors that have the spatial index using the head query vectors (Page 5, §Model 4: Factorised dot-product attention, queries Q = XWq), the head key vectors (Page 5, §Model 4: Factorised dot-product attention, The main idea here is to modify the keys and values for each query to only attend over tokens from the same spatial- … index, … Ks ∈ RNh*Nw x d), and head value vectors (Page 5, §Model 4: Factorised dot-product attention, Vs ∈ RNh*Nw x d) for the query vectors that have the spatial index (Figure 6 Spatial heads; Page 5, §Model 4: Factorised dot-product attention, we attend over tokens from the spatial dimension by computing Ys = Attention(Q;Ks;Vs)).
	Both Zhao and Arnab are analogous to the claimed invention because they are both in the field of applying transformer models towards video processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the temporal and spatial attention heads of Arnab into the action localization model of Zhao. The suggestion/motivation for doing so would have been improved efficiency, as suggested by Arnab at Page 1, §1. Introduction, we present several methods of factorising our model along spatial and temporal dimensions to increase efficiency and scalability.
	This method of improving Zhao was within the ordinary ability of one of ordinary skill in the art based on the teachings of Arnab.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Zhao with the teachings of Arnab to obtain the invention as specified in claim 14.
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over applicant supplied prior art Zhao et al. (Zhao, Jiaojiao, et al. "Tuber: Tubelet transformer for video action detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022) (hereafter, “Zhao”) in view of Yang et al. (Yang, Antoine, et al. "TubeDETR: Spatio-Temporal Video Grounding with Transformers." arXiv preprint arXiv:2203.16434v1 (2022)) (hereafter, “Yang”).
	Regarding claim 15, Zhao discloses the method of claim 13, wherein the set of query vectors comprises a respective set of query vectors corresponding to each of the video frame (Page 13592, §Benefit of tubelet queries, Each query set is composed of Tout per-frame query embeddings), wherein each query vector has a temporal index that identifies the corresponding video frame for the query vector and a spatial index that identifies a spatial position of the query vector within the video frame (Page 13590, §Tubelet query, The i-th tubelet query Qi={qi,1, ..., qi,Tout}; Page 13591, §Tubelet attention, processes the spatial relations between box query embeddings within a frame i.e. {q1,t, ..., qN,t}, t={1, ..., Tout}… box query embeddings across time within the same tubelet, i.e. {qi,1, ..., qi,Tout}, i={1, ...,N}), [wherein each cross-attention block performs factorized cross-attention comprising, for each video frame, updating the set of query vectors corresponding to the video frame by cross-attending over only the feature vectors for the video frame and not the feature vectors for any of the other video frames in the feature representation].
	However, Zhao fails to disclose wherein each cross-attention block performs factorized cross-attention comprising, for each video frame, updating the set of query vectors corresponding to the video frame by cross-attending over only the feature vectors for the video frame and not the feature vectors for any of the other video frames in the feature representation.
	Yang teaches wherein each cross-attention block performs factorized cross-attention comprising, for each video frame, updating the set of query vectors corresponding to the video frame by cross-attending over only the feature vectors for the video frame and not the feature vectors for any of the other video frames in the feature representation (Page 4, §Time-aligned cross-attention, Instead, in our cross-attention module, each time query qt only cross-attends to its temporally corresponding multi-modal features F(v,s)[t] at frame t).
	Both Zhao and Yang are analogous to the claimed invention because they are in the field of applying transformers to video processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the frame-specific cross-attention module of Yang into the action localization model of Zhao. The suggestion/motivation for doing so would have been improved performance, as suggested by Yang at Page 7, §4.3 Comparison to the state of the art, our TubeDETR out performs by a large margin all previous methods.
	This method of improving Zhao was within the ordinary ability of one of ordinary skill in the art based on the teachings of Yang.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Zhao with the teachings of Yang to obtain the invention as specified in claim 15.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Doersch et al. (US 2025/0191194) discloses a method of tracking query points in videos (¶0004, This specification describes a system implemented as computer programs on one or more computers in one or more locations that processes an input that includes (i) a video sequence that includes a plurality of video frames and (ii) a set of one or more query points).
Rawat et al. (US 2022/0222940) discloses a method of video action classification using tubelets (¶0041, The present invention includes methods of detecting and categorizing an action in an untrimmed video segment regardless of the scale of the action and the close proximity of other actions. … Instead, the methods utilize a plurality of tubelets).
Kadav et al. (US 2022/0237884) discloses a method of using encoders and decoders for action localization (¶0005, The method also includes predicting, by a hierarchical transformer encoder of the computer using the keypoint embeddings, human actions and bounding box information of when and where the human actions occur in the one or more video frames).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAOMAO DING whose telephone number is (571)272-7237. The examiner can normally be reached Mon-Fri 8:00-4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Henok Shiferaw can be reached at (571) 272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/X.D./Examiner, Art Unit 2676


/Henok Shiferaw/Supervisory Patent Examiner, Art Unit 2676
Read full office action
ACTION LOCALIZATION IN VIDEOS USING LEARNED QUERIES

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

ACTION LOCALIZATION IN VIDEOS USING LEARNED QUERIES

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email