Last updated: April 19, 2026

Application No. 18/504,968

DYNAMIC TEMPORAL FUSION FOR VIDEO RECOGNITION

Non-Final OA §103

Filed

Nov 08, 2023

Examiner

AZIMA, SHAGHAYEGH

Art Unit

2671

Tech Center

2600 — Communications

Assignee

Qualcomm Incorporated

OA Round

1 (Non-Final)

Interview Optional

— +11.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 350 resolved cases, 2023–2026

Examiner Intelligence

AZIMA, SHAGHAYEGH View full profile →

Grants 82% — above average

Career Allow Rate

286 granted / 350 resolved

+19.7% vs TC avg

Moderate +11% lift

Without

With

+11.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 7m

Avg Prosecution

36 currently pending

Career history

386

Total Applications

across all art units

Statute-Specific Performance

§101

15.8%

-24.2% vs TC avg

§103

42.5%

+2.5% vs TC avg

§102

13.9%

-26.1% vs TC avg

§112

14.5%

-25.5% vs TC avg

Black line = Tech Center average estimate • Based on career data from 350 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This action is in response to the applicant's communication filed on 12/17/2025. 
In virtue of this communication, claims 1-28 as elected by applicant filled on 12/17/2025 are currently pending in the instant application. Election was made without traverse in the reply filed on 12/17/2025.
Non-elected Claims 29-30 have been canceled by applicant.

  Information Disclosure Statement
The information Disclosure statement (IDS) form PTO-1449, filed on  07/08/2024 are in compliance with the provisions of CFR 1.97. Accordingly, the information disclosed therein was considered by the examiner.
 
Drawings
The drawings were received on 11/08/2023 have been reviewed by Examiner and they are acceptable.





Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 2, 5, 16, 17, 20, and 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0083798), in view of Li, Bing, et al. "Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement." arXiv, (2022).

As per claim 1, An apparatus for performing video action classification, comprising: at least one memory; and at least one processor coupled to at least one memory and configured to: “generate, via a first network, frame-level features obtained from a set of input frames;”(Yu, ¶[0023] discloses extract static visual features from the frames of the video. the computing devices use a deep neural network to extract the static visual features from each frame. For example, the deep neural network may include a convolutional neural network.)
“generate, via a first multi-scale temporal feature fusion, first local temporal context features from a first neighboring sub-sequence of the set of input frames;”(Yu,¶[0024] discloses perform temporal-pyramid pooling on the extracted static visual features. The static visual features from adjacent frames in level 0 are pooled. ¶[0025], ¶[0027] discloses higher level features set that represent aggregating temporal segments of video that describes frames in multiple temporal scales (level 0- level 2, …). ¶[0030] discloses each feature set is pooled with both of the adjacent feature sets, although these poolings are separate.)
“generate, via a multi-scale temporal feature fusion, second local temporal context features from a second neighboring sub-sequence of the set of input frames;”(Yu, ¶[0027] discloses higher level features set that represent aggregating temporal segments of video that describes frames in multiple temporal scales (level 0- level 2, …). ¶[0030] discloses the second feature set 230B in level 0 is pooled with the first feature set 230A in one pooling to produce feature set 231A and, in a separate pooling, is pooled with a third feature set 230C to produce feature set 231B. ¶[0035] discloses during the pooling each feature set is merged with only one adjacent feature set. ¶[0035] discloses more than two feature sets in a level are pooled to create a higher-level feature set some feature sets are pooled with both adjacent feature sets in the temporal order, and some feature sets are pooled with only one adjacent feature set. Further see ¶[0037-0038].)
 “and classify the set of input frames based on the first local temporal context features and the second local temporal context features.”(Yu, ¶[0033] discloses after generating the feature sets in the temporal-pooling pyramid, in block B123 the computing devices 100 encode the features in the feature sets from the temporal-pooling pyramid. For example, some embodiments use VLAD (vector of locally-aggregated descriptors) encoding or Fisher Vector encoding to encode the features. ¶[0062] discloses If in block B910 the video is to be classified, then the flow moves to block B916, where the encoded features, which include the encoded temporal-pooling pyramid and the encoded trajectory features, are tested with previously-trained classifiers. Next, in block B918, the classification results are stored, and the flow ends in block B920. )

However Yu does not explicitly disclose the following which would have been obvious in view of Li from similar field of endeavor “via a first multi-scale temporal feature fusion engine and via a second multi-scale temporal feature fusion engine” (Li, figure 2, page 3, section 3.2, Col. 2 discloses The multi scale block (MSB) has four separate branches with cascaded connections, and short/long-term dynamics are captured by varying kernel sizes (i.e. 1, 3, and 5) which efficiently extracts multi-scale motion patterns at multiple spatial granularities. Page 4, Col. 1 discloses to aggregate the multi-scale features, we concatenate all the output features from the four branches and fuse them by a 1X 1 X 1 3D convolution. By this means, MSB represents coarse motion cues in a more comprehensive way )

Before the effective filing date of the claimed invention it would have been obvious to a person of ordinary skill in the art to combine Li technique of video action recognition into Yu technique to provide the known and expected uses and benefits of Li technique over video event classification technique of Yu. The proposed combination would have constituted a mere arrangement of old elements with each performing their known function, the combination yielding no more than one would expect from such an arrangement.
Therefore, it would have been obvious to a person of ordinary skill in the art to incorporate Li to Yu in order to improve motion enhancement and multi-modal fusion. (Refer to Li page1, Col.2.)
Claim 16 has been analyzed and is rejected for the reasons indicated in claim 1 above.

As per claim 2, The apparatus of claim 1, Yu as modified by Li further discloses “wherein the first multi-scale temporal feature fusion engine applies a first kernel value for generating the first local temporal context features and wherein the second multi-scale temporal feature fusion engine applies a second kernel value for generating the second local temporal context features.” (Li, page 3, Col. 2, section 3.2 the multi-scale Block (MSB) section discloses MSB has four separate branches with cascaded connections, and short/long-term dynamics are captured by varying kernel sizes (i.e. 1, 3, and 5), which efficiently extracts multi-scale motion patterns at multiple spatial granularities.)

Claim 17 has been analyzed and is rejected for the reasons indicated in claim 2 above.

As per claim 5, The apparatus of claim 1, Yu as modified by Li further discloses “wherein the first network comprises a two- dimensional convolutional neural network.” (Li, Figure 3, discloses using 2D CNNs.)

Claim 20 has been analyzed and is rejected for the reasons indicated in claim 5 above.

As per claim 23, The method of claim 20, “wherein the first neighboring sub-sequence of the set of input frames equals the second neighboring sub-sequence of the set of input frames.” (Li, Page 2, Col. 2, section 3, discloses applies encapsulation to extract I-frame and P-frame clips from compressed videos. Page 3, section 3.1, discloses calculate accumulated residuals and motion vectors, which are iterated to I-frames, further in page 4, section 3.3 discloses the feature maps from the RGB (I-frames) and MVR (P-frames) modalities The basic idea of SMC is incorporating aligned motion cues from the MVR modality into the RGB modality. Further page 5, Col. 1 discloses we uniformly sample 8 frames to generate the input clip from each video.)


Claim(s) 3 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0083798), in view of Li, Bing, et al. "Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement." arXiv, (2022), in view of Paik et al. (US 2021/0216822).

As per claim 3,  The apparatus of claim 1, However Yu as modified by Li does not explicitly disclose the following which would have been obvious in view of Paik from similar field of endeavor  “wherein at least one processor is further configured to: classify, via an auxiliary classifier, the first local temporal context features and the second local temporal context features during a training process.” (Paik, ¶[0087] discloses auxiliary classifiers, which also consist of one or more artificial neural layers, inject gradient values into the earlier modules where they would otherwise have been greatly diminished starting from the output. Further discloses figure 25 shows neural network architecture inception in which auxiliary classifiers are added at intermediate modules in order to increase the gradient signal that gets propagated from output back to input. During training, the targets used for computing the loss function are identical between final classifier and auxiliary classifiers. Further in ¶[0088] discloses the use of auxiliary classifiers in Inception where the intermediate outputs are used in a much narrower way, only to contribute to the loss function.)
Before the effective filing date of the claimed invention it would have been obvious to a person of ordinary skill in the art to combine Paik technique of image data analysis into Yu as modified by Li technique to provide the known and expected uses and benefits of Paik technique over video event classification technique of Yu as modified by Li. The proposed combination would have constituted a mere arrangement of old elements with each performing their known function, the combination yielding no more than one would expect from such an arrangement.
Therefore, it would have been obvious to a person of ordinary skill in the art to incorporate Paik to Yu as modified by Li in order to provide improvement in speed and efficiency of image interpretation. (Refer to Paik ¶[0002]).

Claim 18 has been analyzed and is rejected for the reasons indicated in claim 3 above. Additionally, the rationale and motivation to combine the Yu, Li, and Paik references, presented in rejection of claim 3, apply to this claim.  

Claim(s) 4 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0083798), in view of Li, Bing, et al. "Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement." arXiv, (2022), in view of Paik et al. (US 2021/0216822), further in view of Sapoznik et al. (US 2018/0013699 ).
As per claim 4,  The apparatus of claim 3, Yu as modified by Li as modified by Paik does not explicitly disc lose the following which would have been obvious in view of Sapoznik from similar filed of endeavor “wherein the auxiliary classifier comprises a two- layer multilayer perceptron (MLP).” (Sapoznik, ¶[0294] discloses classifier component 1640 may be implemented using a multi-layer perceptron (MLP) classifier, such as a two-layer MLP with a sigmoid output. )
Before the effective filing date of the claimed invention it would have been obvious to a person of ordinary skill in the art to combine Sapoznik technique of using classifiers into Yu as modified by Li as modified by Paik technique to provide the known and expected uses and benefits of Sapoznik technique over video event classification technique of Yu as modified by Li as modified by Paik. The proposed combination would have constituted a mere arrangement of old elements with each performing their known function, the combination yielding no more than one would expect from such an arrangement.
Therefore, it would have been obvious to a person of ordinary skill in the art to incorporate Sapoznik to Yu as modified by Li as modified by Paik in order to accurately navigating customers with using  better classifiers. (Refer to Sapoznik ¶[0004]).

Claim 19 has been analyzed and is rejected for the reasons indicated in claim 4 above. Additionally, the rationale and motivation to combine the Yu, Li, Paik, and  Sapoznik, references, presented in rejection of claim 4, apply to this claim.  

Claim(s) 6, 8, and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0083798), in view of Li, Bing, et al. "Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement." arXiv, (2022), in view of Ning, Zhiqing, et al. "Person-context cross attention for spatio-temporal action detection." Huawei Noah's Ark Lab, and University of Science and Technology of China, Tech. Rep (2021).

As per claim 6, The apparatus of claim 1, Yu as modified by Li further discloses “wherein at least one processor is further configured to generate, via the first multi-scale temporal feature fusion engine, the first local temporal context features from the first neighboring sub-sequence of the set of input frames by: generating, via a first convolutional neural network, first local temporal context features from the set of input frames;” (Yu, ¶[0023] discloses use a deep neural network to extract the static visual features from each frame. For example, the deep neural network may include a convolutional neural network. ¶[0024] discloses the computing devices 100 perform temporal-pyramid pooling on the extracted static visual features. The static visual features from adjacent frames in level 0 are pooled. For example, the static visual features of the first feature set 230A can be pooled with the static visual features of the second feature set 230B to generate a feature set 231A in level 1. The static visual features may be pooled by mean pooling, maximum pooling, or other pooling techniques.)

However Yu as modified by Li does not explicitly disclose the following which would have been obvious in view of Ning from similar filed of endeavor  “generating, via a first cross attention module, a first cross attended feature output based on the first local temporal context features; generating, via a first average pooling module, a first average pooling dataset from the set of input frames; and generating the first local temporal context features by adding the first cross attended feature output to the first average pooling dataset”. (Ning, page 2, Col.1, section 2.1 discloses a video backbone network extracts spatio-temporal features from the video clip. We perform average pooling along the temporal dimension on the video feature, which results in a feature map V. Each pooled person feature along with the global feature V is viewed as a person-context pair and fed into cross attention trans former encoder for relation modeling. In section 2.2 discloses In the first layer of cross attention transformer, the query input is a person feature and the key/value input is the person’s context feature. The scaled dot-product operation outputs an attention scores matrix and the projected context feature is multiplied by the matrix. The multiplied feature serves as the inherent dependency for person-context relations and is further added to the person feature through a shortcut connection. Further see Equation 2.)

Before the effective filing date of the claimed invention it would have been obvious to a person of ordinary skill in the art to combine Ning technique of image data analysis into Yu as modified by Li technique to provide the known and expected uses and benefits of Ning technique over video event classification technique of Yu as modified by Li. The proposed combination would have constituted a mere arrangement of old elements with each performing their known function, the combination yielding no more than one would expect from such an arrangement.
Therefore, it would have been obvious to a person of ordinary skill in the art to incorporate Ning to Yu as modified by Li in order to provide improvement in person-object -scene interaction and reasoning and the performance of spatio-temporal action detection. (Refer to Ning page1 Col.2 and paragraph 1, page 2, Col. 1 paragraph 1.).

As per claim 8, The apparatus of claim 6, Yu as modified by Li as modified by Ning further discloses “wherein the first neighboring sub-sequence of the set of input frames equals the second neighboring sub-sequence of the set of input frames.” (Li, Page 2, Col. 2, section 3, discloses applies encapsulation to extract I-frame and P-frame clips from compressed videos. Page 3, section 3.1, discloses calculate accumulated residuals and motion vectors, which are iterated to I-frames, further in page 4, section 3.3 discloses the feature maps from the RGB (I-frames) and MVR (P-frames) modalities The basic idea of SMC is incorporating aligned motion cues from the MVR modality into the RGB modality. Further page 5, Col. 1 discloses we uniformly sample 8 frames to generate the input clip from each video.)


Allowable Subject Matter
 Claims 7-15, 22, and 24-28 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims and on the pending conditions of the rejected and objected matter set forth in this action.
The following is a statement of reasons for the indication of allowable subject matter: the prior art of record, alone or in combination, fails to teach or suggest the limitations set forth by each of claims 7-15, 22,  and 24-28.



					Contact
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAGHAYEGH AZIMA whose telephone number is (571)272-1459. The examiner can normally be reached Monday-Friday, 9:30-6:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at (571)272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SHAGHAYEGH AZIMA/Examiner, Art Unit 2671

Read full office action

Prosecution Timeline

Nov 08, 2023

Application Filed

Feb 20, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/162,544

Patent 12586350

DETERMINING AUDIO AND VIDEO REPRESENTATIONS USING SELF-SUPERVISED LEARNING

2y 5m to grant Granted Mar 24, 2026

18/311,612

Patent 12573209

ROBUST INTERSECTION RIGHT-OF-WAY DETECTION USING ADDITIONAL FRAMES OF REFERENCE

2y 5m to grant Granted Mar 10, 2026

18/650,660

Patent 12561989

VEHICLE LOCALIZATION BASED ON LANE TEMPLATES

2y 5m to grant Granted Feb 24, 2026

18/023,199

Patent 12530867

Action Recognition System

2y 5m to grant Granted Jan 20, 2026

18/088,800

Patent 12525049

PERSON RE-IDENTIFICATION METHOD, COMPUTER-READABLE STORAGE MEDIUM, AND TERMINAL DEVICE

2y 5m to grant Granted Jan 13, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

82%

Grant Probability

93%

With Interview (+11.4%)

2y 7m

Median Time to Grant

Low

PTA Risk

Based on 350 resolved cases by this examiner. Grant probability derived from career allow rate.