Prosecution Insights
Last updated: April 19, 2026
Application No. 18/681,869

METHOD AND ELECTRONIC DEVICE FOR VIDEO ACTION DETECTION BASED ON END-TO-END FRAMEWORK

Non-Final OA §102§103
Filed
Feb 07, 2024
Examiner
BROUGHTON, KATHLEEN M
Art Unit
2661
Tech Center
2600 — Communications
Assignee
TCL Technology Group Corporation
OA Round
1 (Non-Final)
83%
Grant Probability
Favorable
1-2
OA Rounds
2y 7m
To Grant
92%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allow Rate
219 granted / 263 resolved
+21.3% vs TC avg
Moderate +8% lift
Without
With
+8.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
34 currently pending
Career history
297
Total Applications
across all art units

Statute-Specific Performance

§101
10.9%
-29.1% vs TC avg
§103
51.2%
+11.2% vs TC avg
§102
24.1%
-15.9% vs TC avg
§112
11.4%
-28.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 263 resolved cases

Office Action

§102 §103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Amendment A Preliminary Amendment was made 02/07/2024 to amend the abstract and claims 8, 10 of pending claims 1-10. Priority Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55. Information Disclosure Statement The information disclosure statement (IDS) submitted on March 08 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is considered by examiner. Claim Rejections - 35 USC § 102 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claims 1, 4, 5, 10 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Xi et al (CN 113158723). Regarding Claim 1, Xi et al teach a video action detection method based on an end-to-end framework (end-to end video motion detection locating system with a set of positioning processing steps; Fig 1 and ¶ [0064]), wherein the end-to-end framework includes a backbone network (video decoding module; Fig 1 and ¶ [0064]-[0065]), a positioning module (spatiotemporal information analysis unit module; Fig 1 and ¶ [0068]) and a classification module (channel information integration mining module; Fig 1 and ¶ [0069], [0093]), wherein the method (processing steps; Fig 1 and ¶ [0065]-[0070]) comprises: performing feature extraction on the video clip to be tested with the backbone framework (video decoding unit; Fig 1 and ¶ [0065]) to obtain a video feature map of the video clip to be tested (the video stream data is decoded with the video decoding unit, recombined and analyzed with a calculation operation to determine feature maps ; Fig 1 and ¶ [0065]-[0067], [0079]), where the video feature map includes feature maps of all frames in the video clip to be tested (the output feature maps include maps for each channel, width and height; ¶ [0079]); extracting feature maps of key frames from the video feature maps with the backbone network (key space information is extracted by the space-time information analyzing unit module; Fig 1 and ¶ [0068]), obtains actor position features from the feature maps of the key frames (key spatial information is obtained from the output feature map and used for mining motion information; Fig 1 and ¶ [0068]-[0069], [0081]-[0084]), and obtains action category features from the video feature maps (from the motion information the types of actions are predicted; Fig 1 and ¶ [0068]-[0069]); determining the actor's location based on the actor's location characteristics with the positioning module (the location feature of the action are enhanced based on the spatiotemporal information and filtering background; Fig 1 and ¶ [0068]); and determining the action category corresponding to the actor's location based on the action category characteristics and the actor's location with the classification module (using the spatio-temporal feature data, motion mining is performed to determine types of actions and predict categories of motion; Fig 1 and ¶ [0069], [0108]). Regarding Claim 4, Xi et al teach the method according to claim 1 (as described above), wherein the key frame is a frame located in the middle of the video segment to be tested (the spatial key information is based on the location of the object where the behavior occurs and is based on frame regression of the target with video classification to improve recognition; ¶ [0060]-[0061], [0079]). Regarding Claim 5, Xi et al teach the method according to claim 1 (as described above), wherein determining, by the classification module, the action category corresponding to the actor position according to the action category characteristics and the actor position (using the spatio-temporal feature data, motion mining is performed to determine types of actions and predict categories of motion; Fig 1 and ¶ [0069], [0108]) includes: extracting the spatial action features and the temporal action features corresponding to the actor's position from the action category features based on the actor's position with the classification module (a spatiotemporal information analysis unit module is used to obtain the spatial and associated time position data, and using the spatio-temporal feature data, motion mining is performed to determine types of actions and predict categories of motion; Fig 1 and ¶ [0068]-[0069], [0081]-[0087]), fusing the spatial action features and temporal action features corresponding to the actor's position (a spatial feature fusion is performed to fuse the spatiotemporal information; Fig 1 and ¶ [0086]-[0091]), and determining the action category corresponding to the actor's position based on the fused features (the channel information and characteristic map, to determine actions and category, is determined based on the fused feature map; Fig 1 and ¶ [0069], [0092]-[0095]). Regarding Claim 10, Xi et al teach an electronic device (System on Chip for performing end-to end video motion detection locating system with a set of positioning processing steps; Fig 1 and ¶ [0064], [0065]), wherein the electronic device includes a processor and a memory (a system on chip are recognized to contain a CPU and memory; Fig 1 and ¶ [0065]), the memory stores a computer program that can be executed by the processor, and when executed by the processor, the computer program implements the method (the end-to end video motion detection locating system with a set of positioning processing instructions are understood to be stored and executed on the system on chip; Fig 1 and ¶ [0063]-[0065]) according to claim 1 (as described above). Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 2-3, 6 are rejected under 35 U.S.C. 103 as being unpatentable over Xi et al (CN 113158723) in view of Wu et al (Context-Aware RCNN: A Baseline for Action Detection in Videos). Regarding Claim 2, Xi et al teach the method according to claim 1 (as described above). Xi et al does not teach performing multiple stages of feature extraction on the video clip to be tested with the backbone network to obtain video feature maps at each stage, wherein the spatial scales of the video feature maps at different stages are different; selecting the video feature maps of the last several stages among the multiple stages with the backbone network, extracting the feature maps of the key frames from the video feature maps of the last several stages, performing feature extraction on the feature maps of the key frames to obtain the actor position feature, and using the video feature map of the last stage among multiple stages as the action category feature. Wu et al is analogous art pertinent to the technological problem addressed in this application and teaches performing multiple stages of feature extraction on the video clip to be tested with the backbone network to obtain video feature maps at each stage (video of person and action is detected and analyzed in multiples strides of a Faster RCNN with ResNet-50 network with non-local blocks backbone to extract frame feature maps; Fig 3 and 3.2 Extracting Actor Features: -Backbone), wherein the spatial scales of the video feature maps at different stages are different (ROI features from convolution feature maps are extracted for the different layers and the spatial dimensions are down sampled over each stride; Fig 3 and 3.2 Extracting Actor Features: -Backbone, -Extracting actor features); selecting the video feature maps of the last several stages among the multiple stages with the backbone network (the local representative patterns of the last layers of the network are used to correctly distinguish fine-grained action classes; Fig 3 and 3.2 Extracting Actor Features: -Extracting actor features), extracting the feature maps of the key frames from the video feature maps of the last several stages (the local representative patterns of the actor bounding box at key frames of the last layers of the network are used to correctly distinguish fine-grained action classes; Fig 3 and 3.2 Extracting Actor Features: -Extracting actor features), performing feature extraction on the feature maps of the key frames to obtain the actor position feature (the local representative patterns of the last layers of the network are enlarged to capture fine-grained details to correctly distinguish fine-grained action classes; Fig 3 and 3.2 Extracting Actor Features: -Extracting actor features), and using the video feature map of the last stage among multiple stages as the action category feature (the local representative patterns of the last layers of the network are used to correctly identify fine-grained action classes; Fig 3 and 3.2 Extracting Actor Features: -Extracting actor features). It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al with Wu et al including performing multiple stages of feature extraction on the video clip to be tested with the backbone network to obtain video feature maps at each stage, wherein the spatial scales of the video feature maps at different stages are different; selecting the video feature maps of the last several stages among the multiple stages with the backbone network, extracting the feature maps of the key frames from the video feature maps of the last several stages, performing feature extraction on the feature maps of the key frames to obtain the actor position feature, and using the video feature map of the last stage among multiple stages as the action category feature. By performing fine-grained action analysis, differentiation of fine-grained actions may be achieved for actor-centric action recognition with high accuracy, as recognized by Wu et al (1. Introduction ¶ 2-3). Regarding Claim 3, Xi et al in view of Wu et al teach the method according to claim 2 (as described above), wherein a residual network is used to perform multiple stages of feature extraction on the video clip to be tested (Wu et al, the backbone model is based on a I3D ResNet-50 residual network (ResNet); Fig 3 and 3.2 Extracting Actor Features: -Backbone), and a feature pyramid network is used to perform feature extraction on the feature map of the key frame (Wu et al, the persons are detected by a Faster RCNN with a ResNeXt-101-FPN (feature pyramid network); Fig 3 and 3.2 Extracting Actor Features: -Person detector). Regarding Claim 6, Xi et al teach the method according to claim 5 (as described above), including extracting by the classification module based on the actor's position, extracting the spatial action features and the temporal action features corresponding to the actor's position from the action category features (a spatiotemporal information analysis unit module is used to obtain the spatial and associated time position data, and using the spatio-temporal feature data, motion mining is performed to determine types of actions and predict categories of motion; Fig 1 and ¶ [0068]-[0069], [0081]-[0087]). Xi et al does not explicitly teach extracting a fixed-scale feature map of the corresponding area from the action category features based on the actor's position with the classification module; performing a global average pooling operation on the fixed-scale feature map in the time dimension to obtain the spatial action characteristics corresponding to the actor's position; and performing a global average pooling operation on the fixed-scale feature map in the spatial dimension to obtain the temporal action characteristics corresponding to the actor's position. Wu et al is analogous art pertinent to the technological problem addressed in this application and teaches extracting a fixed-scale feature map of the corresponding area from the action category features based on the actor's position with the classification module (actor boxes are cropped directly from original video and resized to a fixed resolution; Fig 2, 3 and 3.2 Extracting Actor Features – Extracting actor features by RCNN); performing a global average pooling operation on the fixed-scale feature map in the time dimension to obtain the spatial action characteristics corresponding to the actor's position (global average pooling is performed for the image patches of fixed resolution with given time T frame; Fig 3 and 3.2 Extracting Actor Features – Extracting actor features by RCNN); and performing a global average pooling operation on the fixed-scale feature map in the spatial dimension to obtain the temporal action characteristics corresponding to the actor's position (global average pooling is performed for the image patches of fixed resolution with given spatial HxW frame; Fig 3 and 3.2 Extracting Actor Features – Extracting actor features by RCNN). It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al with Wu et al including extracting a fixed-scale feature map of the corresponding area from the action category features based on the actor's position with the classification module; performing a global average pooling operation on the fixed-scale feature map in the time dimension to obtain the spatial action characteristics corresponding to the actor's position; and performing a global average pooling operation on the fixed-scale feature map in the spatial dimension to obtain the temporal action characteristics corresponding to the actor's position. By extracting a fixed-scale feature map corresponding to a position with the classification, analysis can be performed to correctly identify the action, thereby achieving a high accuracy in correct identification of the action, as recognized by Wu et al (1. Introduction ¶ 2-3). Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Xi et al (CN 113158723) in view of Pan et al (Actor-Context-Action Relation Network for Spatio-Temporal Action Localization). Regarding Claim 7, Xi et al teach the method according to claim 5 (as described above), including wherein a plurality of actor locations are determined by the positioning module (the positioning process includes detection of action features and focusing on motion information mining between frames; ¶ [0068]-[0069]), and the action category is determined by the classification module based on each actor location of the plurality of actor locations (the spatial-temporal data is analyzed to determine the type of action occurring; ¶ [0068]-[0069]), extracting spatial action features and temporal action features corresponding to each actor's position from the features (channel information of the data features is obtained based on the spatio-temporal information to predict the feature map of the corresponding channels; ¶ [0069]-[0070]). Xi et al does not explicitly teach inputting the spatial embedding vectors corresponding to the multiple actor positions into the self-attention module, and performing a convolution operation on the spatial action features corresponding to the multiple actor positions and the output of the self-attention module to update spatial action characteristics corresponding to each of the plurality of actor positions; and inputting the temporal embedding vectors corresponding to the multiple actor positions into the self-attention module, perform a convolution operation on the temporal action features corresponding to the multiple actor positions and the output of the self-attention module, to update temporal action features corresponding to each of the plurality of actor locations. Pan et al is analogous art pertinent to the technological problem addressed in this application and teaches inputting the spatial embedding vectors corresponding to the multiple actor positions into the self-attention module, and performing a convolution operation on the spatial action features corresponding to the multiple actor positions and the output of the self-attention module to update spatial action characteristics corresponding to each of the plurality of actor positions (an attention mechanism is performed between current and long term actor context relations with query features based on short-term features and key and value features based on long term features, thereby accounting for the spatial positioning; Fig 3 and 3.3 Actor-Context Feature Bank ¶ 2-3); and inputting the temporal embedding vectors corresponding to the multiple actor positions into the self-attention module, perform a convolution operation on the temporal action features corresponding to the multiple actor positions and the output of the self-attention module, to update temporal action features corresponding to each of the plurality of actor locations (the attention mechanism is performed between current and long term actor context relations with query features based on short-term features and key and value features based on long term features, thereby accounting for the temporal context for identifying more accurate action localization; Fig 3 and 3.3 Actor-Context Feature Bank ¶ 2-3). It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al with Pan et al including inputting the spatial embedding vectors corresponding to the multiple actor positions into the self-attention module, and performing a convolution operation on the spatial action features corresponding to the multiple actor positions and the output of the self-attention module to update spatial action characteristics corresponding to each of the plurality of actor positions; and inputting the temporal embedding vectors corresponding to the multiple actor positions into the self-attention module, perform a convolution operation on the temporal action features corresponding to the multiple actor positions and the output of the self-attention module, to update temporal action features corresponding to each of the plurality of actor locations. By performing self-attention mechanisms for action of an actor, long range determinations may be made with high accuracy to correctly classify action of an actor and spatio-temporal relationships in video, as recognized by Pan et al (2. Related Work. Relational Reasoning for Video Understanding). Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Xi et al (CN 113158723) in view of Escorcia et al (US 2019/0108400). Regarding Claim 8, Xi et al teach the method according to claim 1 (as described above). Xi et al does not teach wherein determining the actor location includes determining coordinates of an actor's bounding box and a confidence indicating that the actor's bounding box contains the actor; and, the method further include: selecting an actor location with a confidence level higher than a predetermined threshold and an action category corresponding to the actor location. Escorcia et al is analogous art pertinent to the technological problem addressed in this application and teaches wherein determining the actor location includes determining coordinates of an actor's bounding box and a confidence indicating that the actor's bounding box contains the actor (a confidence score is determined by the object detector for a bounding box representing the features obtained are of the actors while in action across frames; Fig 9A, 9B and ¶ [0083]-[0084]); and, the method further include: selecting an actor location with a confidence level higher than a predetermined threshold and an action category corresponding to the actor location (a constraint is used for limiting the action proposal selection based on a minimum cost and/or a maximum affinity based on the confidence value; ¶ [0084]). It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al with Escorcia et al including wherein determining the actor location includes determining coordinates of an actor's bounding box and a confidence indicating that the actor's bounding box contains the actor; and, the method further include: selecting an actor location with a confidence level higher than a predetermined threshold and an action category corresponding to the actor location. By using a confidence value to determine the accuracy of the bounding box, the actor is tracked over time with a high level of success, as recognized by Escorcia et al (¶ [0086]). Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Xi et al (CN 113158723) in view of Escorcia et al (US 2019/0108400) and Adhi Pramono et al (Spatial-Temporal Action Localization with Hierarchical Self-Attention). Regarding Claim 9, Xi et al in view of Escorcia et al teach the method according to claim 8 (as described above). Xi et al in view of Escorcia et al does not teach wherein the end-to-end framework is trained based on the following objective function: PNG media_image1.png 21 1 media_image1.png Greyscale wherein PNG media_image1.png 21 1 media_image1.png Greyscale represents the actor bounding box localization loss, PNG media_image1.png 21 1 media_image1.png Greyscale represents the action categorization loss, PNG media_image2.png 16 10 media_image2.png Greyscale is the cross entropy loss, PNG media_image3.png 16 19 media_image3.png Greyscale and PNG media_image1.png 21 1 media_image1.png Greyscale are the respective bounding box loss, PNG media_image4.png 16 22 media_image4.png Greyscale is the binary cross entropy loss, and PNG media_image5.png 15 58 media_image5.png Greyscale PNG media_image1.png 21 1 media_image1.png Greyscale and PNG media_image1.png 21 1 media_image1.png Greyscale are constant scalars used to balance loss contribution. Pan et al is analogous art pertinent to the technological problem addressed in this application and teaches wherein the end-to-end framework is trained based on the following objective function: PNG media_image1.png 21 1 media_image1.png Greyscale wherein PNG media_image1.png 21 1 media_image1.png Greyscale represents the actor bounding box localization loss, PNG media_image1.png 21 1 media_image1.png Greyscale represents the action categorization loss, PNG media_image2.png 16 10 media_image2.png Greyscale is the cross entropy loss, PNG media_image3.png 16 19 media_image3.png Greyscale and PNG media_image1.png 21 1 media_image1.png Greyscale are the respective bounding box loss, PNG media_image4.png 16 22 media_image4.png Greyscale is the binary cross entropy loss, and PNG media_image5.png 15 58 media_image5.png Greyscale PNG media_image1.png 21 1 media_image1.png Greyscale and PNG media_image1.png 21 1 media_image1.png Greyscale are constant scalars used to balance loss contribution (training is performed that incorporates loss function of cross entropy with the focal loss with the cross entropy loss accounting for the multi-class action classification of all bounding boxes and focal loss accounting for the classification; IV. Experimental Results B. Settings). It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al in view of Escorcia et al with Pan et al including wherein the end-to-end framework is trained based on the following objective function: PNG media_image1.png 21 1 media_image1.png Greyscale wherein PNG media_image1.png 21 1 media_image1.png Greyscale represents the actor bounding box localization loss, PNG media_image1.png 21 1 media_image1.png Greyscale represents the action categorization loss, PNG media_image2.png 16 10 media_image2.png Greyscale is the cross entropy loss, PNG media_image3.png 16 19 media_image3.png Greyscale and PNG media_image1.png 21 1 media_image1.png Greyscale are the respective bounding box loss, PNG media_image4.png 16 22 media_image4.png Greyscale is the binary cross entropy loss, and PNG media_image5.png 15 58 media_image5.png Greyscale PNG media_image1.png 21 1 media_image1.png Greyscale and PNG media_image1.png 21 1 media_image1.png Greyscale are constant scalars used to balance loss contribution. By accounting for the localization of the bounding box and the classification, the action localization and classification are both considered during end to end training, thereby improving the model for higher accuracy in spatial-temporal action localization in video, as recognized by Adhi Pramono et al (Introduction ¶ 3). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Chen et al (Watch Only Once: An End-to-End Video Action Detection Framework) teaches an end-to-end pipeline for video action detection based on a spatial-temporal fusion module and is disclosed within one year by same co-inventors as the current application. Wang et al (US 2013/0132316) teach a system and method for space time modeling that identifies and tracks both sparse and global temporal trajectory probability and applied to continuous action recognition. Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATHLEEN M BROUGHTON whose telephone number is (571)270-7380. The examiner can normally be reached Monday-Friday 8:00-5:00. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John Villecco can be reached at (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /KATHLEEN M BROUGHTON/Primary Examiner, Art Unit 2661
Read full office action

Prosecution Timeline

Feb 07, 2024
Application Filed
Dec 25, 2025
Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602915
FEATURE FUSION FOR NEAR FIELD AND FAR FIELD IMAGES FOR VEHICLE APPLICATIONS
2y 5m to grant Granted Apr 14, 2026
Patent 12597233
SYSTEM AND METHOD FOR TRAINING A MACHINE LEARNING MODEL
2y 5m to grant Granted Apr 07, 2026
Patent 12586203
IMAGE CUTTING METHOD AND APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
Patent 12567227
METHOD AND SYSTEM FOR UNSUPERVISED DEEP REPRESENTATION LEARNING BASED ON IMAGE TRANSLATION
2y 5m to grant Granted Mar 03, 2026
Patent 12565240
METHOD AND SYSTEM FOR GRAPH NEURAL NETWORK BASED PEDESTRIAN ACTION PREDICTION IN AUTONOMOUS DRIVING SYSTEMS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
92%
With Interview (+8.3%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 263 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month