Prosecution Insights
Last updated: April 19, 2026
Application No. 18/504,968

DYNAMIC TEMPORAL FUSION FOR VIDEO RECOGNITION

Non-Final OA §103
Filed
Nov 08, 2023
Examiner
AZIMA, SHAGHAYEGH
Art Unit
2671
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
82%
Grant Probability
Favorable
1-2
OA Rounds
2y 7m
To Grant
93%
With Interview

Examiner Intelligence

Grants 82% — above average
82%
Career Allow Rate
286 granted / 350 resolved
+19.7% vs TC avg
Moderate +11% lift
Without
With
+11.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
36 currently pending
Career history
386
Total Applications
across all art units

Statute-Specific Performance

§101
15.8%
-24.2% vs TC avg
§103
42.5%
+2.5% vs TC avg
§102
13.9%
-26.1% vs TC avg
§112
14.5%
-25.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 350 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . DETAILED ACTION This action is in response to the applicant's communication filed on 12/17/2025. In virtue of this communication, claims 1-28 as elected by applicant filled on 12/17/2025 are currently pending in the instant application. Election was made without traverse in the reply filed on 12/17/2025. Non-elected Claims 29-30 have been canceled by applicant. Information Disclosure Statement The information Disclosure statement (IDS) form PTO-1449, filed on 07/08/2024 are in compliance with the provisions of CFR 1.97. Accordingly, the information disclosed therein was considered by the examiner. Drawings The drawings were received on 11/08/2023 have been reviewed by Examiner and they are acceptable. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claim(s) 1, 2, 5, 16, 17, 20, and 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0083798), in view of Li, Bing, et al. "Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement." arXiv, (2022). As per claim 1, An apparatus for performing video action classification, comprising: at least one memory; and at least one processor coupled to at least one memory and configured to: “generate, via a first network, frame-level features obtained from a set of input frames;”(Yu, ¶[0023] discloses extract static visual features from the frames of the video. the computing devices use a deep neural network to extract the static visual features from each frame. For example, the deep neural network may include a convolutional neural network.) “generate, via a first multi-scale temporal feature fusion, first local temporal context features from a first neighboring sub-sequence of the set of input frames;”(Yu,¶[0024] discloses perform temporal-pyramid pooling on the extracted static visual features. The static visual features from adjacent frames in level 0 are pooled. ¶[0025], ¶[0027] discloses higher level features set that represent aggregating temporal segments of video that describes frames in multiple temporal scales (level 0- level 2, …). ¶[0030] discloses each feature set is pooled with both of the adjacent feature sets, although these poolings are separate.) “generate, via a multi-scale temporal feature fusion, second local temporal context features from a second neighboring sub-sequence of the set of input frames;”(Yu, ¶[0027] discloses higher level features set that represent aggregating temporal segments of video that describes frames in multiple temporal scales (level 0- level 2, …). ¶[0030] discloses the second feature set 230B in level 0 is pooled with the first feature set 230A in one pooling to produce feature set 231A and, in a separate pooling, is pooled with a third feature set 230C to produce feature set 231B. ¶[0035] discloses during the pooling each feature set is merged with only one adjacent feature set. ¶[0035] discloses more than two feature sets in a level are pooled to create a higher-level feature set some feature sets are pooled with both adjacent feature sets in the temporal order, and some feature sets are pooled with only one adjacent feature set. Further see ¶[0037-0038].) “and classify the set of input frames based on the first local temporal context features and the second local temporal context features.”(Yu, ¶[0033] discloses after generating the feature sets in the temporal-pooling pyramid, in block B123 the computing devices 100 encode the features in the feature sets from the temporal-pooling pyramid. For example, some embodiments use VLAD (vector of locally-aggregated descriptors) encoding or Fisher Vector encoding to encode the features. ¶[0062] discloses If in block B910 the video is to be classified, then the flow moves to block B916, where the encoded features, which include the encoded temporal-pooling pyramid and the encoded trajectory features, are tested with previously-trained classifiers. Next, in block B918, the classification results are stored, and the flow ends in block B920. ) However Yu does not explicitly disclose the following which would have been obvious in view of Li from similar field of endeavor “via a first multi-scale temporal feature fusion engine and via a second multi-scale temporal feature fusion engine” (Li, figure 2, page 3, section 3.2, Col. 2 discloses The multi scale block (MSB) has four separate branches with cascaded connections, and short/long-term dynamics are captured by varying kernel sizes (i.e. 1, 3, and 5) which efficiently extracts multi-scale motion patterns at multiple spatial granularities. Page 4, Col. 1 discloses to aggregate the multi-scale features, we concatenate all the output features from the four branches and fuse them by a 1X 1 X 1 3D convolution. By this means, MSB represents coarse motion cues in a more comprehensive way ) Before the effective filing date of the claimed invention it would have been obvious to a person of ordinary skill in the art to combine Li technique of video action recognition into Yu technique to provide the known and expected uses and benefits of Li technique over video event classification technique of Yu. The proposed combination would have constituted a mere arrangement of old elements with each performing their known function, the combination yielding no more than one would expect from such an arrangement. Therefore, it would have been obvious to a person of ordinary skill in the art to incorporate Li to Yu in order to improve motion enhancement and multi-modal fusion. (Refer to Li page1, Col.2.) Claim 16 has been analyzed and is rejected for the reasons indicated in claim 1 above. As per claim 2, The apparatus of claim 1, Yu as modified by Li further discloses “wherein the first multi-scale temporal feature fusion engine applies a first kernel value for generating the first local temporal context features and wherein the second multi-scale temporal feature fusion engine applies a second kernel value for generating the second local temporal context features.” (Li, page 3, Col. 2, section 3.2 the multi-scale Block (MSB) section discloses MSB has four separate branches with cascaded connections, and short/long-term dynamics are captured by varying kernel sizes (i.e. 1, 3, and 5), which efficiently extracts multi-scale motion patterns at multiple spatial granularities.) Claim 17 has been analyzed and is rejected for the reasons indicated in claim 2 above. As per claim 5, The apparatus of claim 1, Yu as modified by Li further discloses “wherein the first network comprises a two- dimensional convolutional neural network.” (Li, Figure 3, discloses using 2D CNNs.) Claim 20 has been analyzed and is rejected for the reasons indicated in claim 5 above. As per claim 23, The method of claim 20, “wherein the first neighboring sub-sequence of the set of input frames equals the second neighboring sub-sequence of the set of input frames.” (Li, Page 2, Col. 2, section 3, discloses applies encapsulation to extract I-frame and P-frame clips from compressed videos. Page 3, section 3.1, discloses calculate accumulated residuals and motion vectors, which are iterated to I-frames, further in page 4, section 3.3 discloses the feature maps from the RGB (I-frames) and MVR (P-frames) modalities The basic idea of SMC is incorporating aligned motion cues from the MVR modality into the RGB modality. Further page 5, Col. 1 discloses we uniformly sample 8 frames to generate the input clip from each video.) Claim(s) 3 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0083798), in view of Li, Bing, et al. "Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement." arXiv, (2022), in view of Paik et al. (US 2021/0216822). As per claim 3, The apparatus of claim 1, However Yu as modified by Li does not explicitly disclose the following which would have been obvious in view of Paik from similar field of endeavor “wherein at least one processor is further configured to: classify, via an auxiliary classifier, the first local temporal context features and the second local temporal context features during a training process.” (Paik, ¶[0087] discloses auxiliary classifiers, which also consist of one or more artificial neural layers, inject gradient values into the earlier modules where they would otherwise have been greatly diminished starting from the output. Further discloses figure 25 shows neural network architecture inception in which auxiliary classifiers are added at intermediate modules in order to increase the gradient signal that gets propagated from output back to input. During training, the targets used for computing the loss function are identical between final classifier and auxiliary classifiers. Further in ¶[0088] discloses the use of auxiliary classifiers in Inception where the intermediate outputs are used in a much narrower way, only to contribute to the loss function.) Before the effective filing date of the claimed invention it would have been obvious to a person of ordinary skill in the art to combine Paik technique of image data analysis into Yu as modified by Li technique to provide the known and expected uses and benefits of Paik technique over video event classification technique of Yu as modified by Li. The proposed combination would have constituted a mere arrangement of old elements with each performing their known function, the combination yielding no more than one would expect from such an arrangement. Therefore, it would have been obvious to a person of ordinary skill in the art to incorporate Paik to Yu as modified by Li in order to provide improvement in speed and efficiency of image interpretation. (Refer to Paik ¶[0002]). Claim 18 has been analyzed and is rejected for the reasons indicated in claim 3 above. Additionally, the rationale and motivation to combine the Yu, Li, and Paik references, presented in rejection of claim 3, apply to this claim. Claim(s) 4 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0083798), in view of Li, Bing, et al. "Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement." arXiv, (2022), in view of Paik et al. (US 2021/0216822), further in view of Sapoznik et al. (US 2018/0013699 ). As per claim 4, The apparatus of claim 3, Yu as modified by Li as modified by Paik does not explicitly disc lose the following which would have been obvious in view of Sapoznik from similar filed of endeavor “wherein the auxiliary classifier comprises a two- layer multilayer perceptron (MLP).” (Sapoznik, ¶[0294] discloses classifier component 1640 may be implemented using a multi-layer perceptron (MLP) classifier, such as a two-layer MLP with a sigmoid output. ) Before the effective filing date of the claimed invention it would have been obvious to a person of ordinary skill in the art to combine Sapoznik technique of using classifiers into Yu as modified by Li as modified by Paik technique to provide the known and expected uses and benefits of Sapoznik technique over video event classification technique of Yu as modified by Li as modified by Paik. The proposed combination would have constituted a mere arrangement of old elements with each performing their known function, the combination yielding no more than one would expect from such an arrangement. Therefore, it would have been obvious to a person of ordinary skill in the art to incorporate Sapoznik to Yu as modified by Li as modified by Paik in order to accurately navigating customers with using better classifiers. (Refer to Sapoznik ¶[0004]). Claim 19 has been analyzed and is rejected for the reasons indicated in claim 4 above. Additionally, the rationale and motivation to combine the Yu, Li, Paik, and Sapoznik, references, presented in rejection of claim 4, apply to this claim. Claim(s) 6, 8, and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0083798), in view of Li, Bing, et al. "Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement." arXiv, (2022), in view of Ning, Zhiqing, et al. "Person-context cross attention for spatio-temporal action detection." Huawei Noah's Ark Lab, and University of Science and Technology of China, Tech. Rep (2021). As per claim 6, The apparatus of claim 1, Yu as modified by Li further discloses “wherein at least one processor is further configured to generate, via the first multi-scale temporal feature fusion engine, the first local temporal context features from the first neighboring sub-sequence of the set of input frames by: generating, via a first convolutional neural network, first local temporal context features from the set of input frames;” (Yu, ¶[0023] discloses use a deep neural network to extract the static visual features from each frame. For example, the deep neural network may include a convolutional neural network. ¶[0024] discloses the computing devices 100 perform temporal-pyramid pooling on the extracted static visual features. The static visual features from adjacent frames in level 0 are pooled. For example, the static visual features of the first feature set 230A can be pooled with the static visual features of the second feature set 230B to generate a feature set 231A in level 1. The static visual features may be pooled by mean pooling, maximum pooling, or other pooling techniques.) However Yu as modified by Li does not explicitly disclose the following which would have been obvious in view of Ning from similar filed of endeavor “generating, via a first cross attention module, a first cross attended feature output based on the first local temporal context features; generating, via a first average pooling module, a first average pooling dataset from the set of input frames; and generating the first local temporal context features by adding the first cross attended feature output to the first average pooling dataset”. (Ning, page 2, Col.1, section 2.1 discloses a video backbone network extracts spatio-temporal features from the video clip. We perform average pooling along the temporal dimension on the video feature, which results in a feature map V. Each pooled person feature along with the global feature V is viewed as a person-context pair and fed into cross attention trans former encoder for relation modeling. In section 2.2 discloses In the first layer of cross attention transformer, the query input is a person feature and the key/value input is the person’s context feature. The scaled dot-product operation outputs an attention scores matrix and the projected context feature is multiplied by the matrix. The multiplied feature serves as the inherent dependency for person-context relations and is further added to the person feature through a shortcut connection. Further see Equation 2.) Before the effective filing date of the claimed invention it would have been obvious to a person of ordinary skill in the art to combine Ning technique of image data analysis into Yu as modified by Li technique to provide the known and expected uses and benefits of Ning technique over video event classification technique of Yu as modified by Li. The proposed combination would have constituted a mere arrangement of old elements with each performing their known function, the combination yielding no more than one would expect from such an arrangement. Therefore, it would have been obvious to a person of ordinary skill in the art to incorporate Ning to Yu as modified by Li in order to provide improvement in person-object -scene interaction and reasoning and the performance of spatio-temporal action detection. (Refer to Ning page1 Col.2 and paragraph 1, page 2, Col. 1 paragraph 1.). As per claim 8, The apparatus of claim 6, Yu as modified by Li as modified by Ning further discloses “wherein the first neighboring sub-sequence of the set of input frames equals the second neighboring sub-sequence of the set of input frames.” (Li, Page 2, Col. 2, section 3, discloses applies encapsulation to extract I-frame and P-frame clips from compressed videos. Page 3, section 3.1, discloses calculate accumulated residuals and motion vectors, which are iterated to I-frames, further in page 4, section 3.3 discloses the feature maps from the RGB (I-frames) and MVR (P-frames) modalities The basic idea of SMC is incorporating aligned motion cues from the MVR modality into the RGB modality. Further page 5, Col. 1 discloses we uniformly sample 8 frames to generate the input clip from each video.) Allowable Subject Matter Claims 7-15, 22, and 24-28 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims and on the pending conditions of the rejected and objected matter set forth in this action. The following is a statement of reasons for the indication of allowable subject matter: the prior art of record, alone or in combination, fails to teach or suggest the limitations set forth by each of claims 7-15, 22, and 24-28. Contact Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAGHAYEGH AZIMA whose telephone number is (571)272-1459. The examiner can normally be reached Monday-Friday, 9:30-6:30. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at (571)272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SHAGHAYEGH AZIMA/Examiner, Art Unit 2671
Read full office action

Prosecution Timeline

Nov 08, 2023
Application Filed
Feb 20, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12586350
DETERMINING AUDIO AND VIDEO REPRESENTATIONS USING SELF-SUPERVISED LEARNING
2y 5m to grant Granted Mar 24, 2026
Patent 12573209
ROBUST INTERSECTION RIGHT-OF-WAY DETECTION USING ADDITIONAL FRAMES OF REFERENCE
2y 5m to grant Granted Mar 10, 2026
Patent 12561989
VEHICLE LOCALIZATION BASED ON LANE TEMPLATES
2y 5m to grant Granted Feb 24, 2026
Patent 12530867
Action Recognition System
2y 5m to grant Granted Jan 20, 2026
Patent 12525049
PERSON RE-IDENTIFICATION METHOD, COMPUTER-READABLE STORAGE MEDIUM, AND TERMINAL DEVICE
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
82%
Grant Probability
93%
With Interview (+11.4%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 350 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month