DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
A Preliminary Amendment was made 02/07/2024 to amend the abstract and claims 8, 10 of pending claims 1-10.
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on March 08 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is considered by examiner.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1, 4, 5, 10 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Xi et al (CN 113158723).
Regarding Claim 1, Xi et al teach a video action detection method based on an end-to-end framework (end-to end video motion detection locating system with a set of positioning processing steps; Fig 1 and ¶ [0064]), wherein the end-to-end framework includes a backbone network (video decoding module; Fig 1 and ¶ [0064]-[0065]), a positioning module (spatiotemporal information analysis unit module; Fig 1 and ¶ [0068]) and a classification module (channel information integration mining module; Fig 1 and ¶ [0069], [0093]), wherein the method (processing steps; Fig 1 and ¶ [0065]-[0070]) comprises:
performing feature extraction on the video clip to be tested with the backbone framework (video decoding unit; Fig 1 and ¶ [0065]) to obtain a video feature map of the video clip to be tested (the video stream data is decoded with the video decoding unit, recombined and analyzed with a calculation operation to determine feature maps ; Fig 1 and ¶ [0065]-[0067], [0079]), where the video feature map includes feature maps of all frames in the video clip to be tested (the output feature maps include maps for each channel, width and height; ¶ [0079]);
extracting feature maps of key frames from the video feature maps with the backbone network (key space information is extracted by the space-time information analyzing unit module; Fig 1 and ¶ [0068]), obtains actor position features from the feature maps of the key frames (key spatial information is obtained from the output feature map and used for mining motion information; Fig 1 and ¶ [0068]-[0069], [0081]-[0084]), and obtains action category features from the video feature maps (from the motion information the types of actions are predicted; Fig 1 and ¶ [0068]-[0069]);
determining the actor's location based on the actor's location characteristics with the positioning module (the location feature of the action are enhanced based on the spatiotemporal information and filtering background; Fig 1 and ¶ [0068]); and
determining the action category corresponding to the actor's location based on the action category characteristics and the actor's location with the classification module (using the spatio-temporal feature data, motion mining is performed to determine types of actions and predict categories of motion; Fig 1 and ¶ [0069], [0108]).
Regarding Claim 4, Xi et al teach the method according to claim 1 (as described above), wherein the key frame is a frame located in the middle of the video segment to be tested (the spatial key information is based on the location of the object where the behavior occurs and is based on frame regression of the target with video classification to improve recognition; ¶ [0060]-[0061], [0079]).
Regarding Claim 5, Xi et al teach the method according to claim 1 (as described above),
wherein determining, by the classification module, the action category corresponding to the actor position according to the action category characteristics and the actor position (using the spatio-temporal feature data, motion mining is performed to determine types of actions and predict categories of motion; Fig 1 and ¶ [0069], [0108]) includes:
extracting the spatial action features and the temporal action features corresponding to the actor's position from the action category features based on the actor's position with the classification module (a spatiotemporal information analysis unit module is used to obtain the spatial and associated time position data, and using the spatio-temporal feature data, motion mining is performed to determine types of actions and predict categories of motion; Fig 1 and ¶ [0068]-[0069], [0081]-[0087]),
fusing the spatial action features and temporal action features corresponding to the actor's position (a spatial feature fusion is performed to fuse the spatiotemporal information; Fig 1 and ¶ [0086]-[0091]), and
determining the action category corresponding to the actor's position based on the fused features (the channel information and characteristic map, to determine actions and category, is determined based on the fused feature map; Fig 1 and ¶ [0069], [0092]-[0095]).
Regarding Claim 10, Xi et al teach an electronic device (System on Chip for performing end-to end video motion detection locating system with a set of positioning processing steps; Fig 1 and ¶ [0064], [0065]), wherein the electronic device includes a processor and a memory (a system on chip are recognized to contain a CPU and memory; Fig 1 and ¶ [0065]), the memory stores a computer program that can be executed by the processor, and when executed by the processor, the computer program implements the method (the end-to end video motion detection locating system with a set of positioning processing instructions are understood to be stored and executed on the system on chip; Fig 1 and ¶ [0063]-[0065]) according to claim 1 (as described above).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 2-3, 6 are rejected under 35 U.S.C. 103 as being unpatentable over Xi et al (CN 113158723) in view of Wu et al (Context-Aware RCNN: A Baseline for Action Detection in Videos).
Regarding Claim 2, Xi et al teach the method according to claim 1 (as described above).
Xi et al does not teach performing multiple stages of feature extraction on the video clip to be tested with the backbone network to obtain video feature maps at each stage, wherein the spatial scales of the video feature maps at different stages are different; selecting the video feature maps of the last several stages among the multiple stages with the backbone network, extracting the feature maps of the key frames from the video feature maps of the last several stages, performing feature extraction on the feature maps of the key frames to obtain the actor position feature, and using the video feature map of the last stage among multiple stages as the action category feature.
Wu et al is analogous art pertinent to the technological problem addressed in this application and teaches performing multiple stages of feature extraction on the video clip to be tested with the backbone network to obtain video feature maps at each stage (video of person and action is detected and analyzed in multiples strides of a Faster RCNN with ResNet-50 network with non-local blocks backbone to extract frame feature maps; Fig 3 and 3.2 Extracting Actor Features: -Backbone), wherein the spatial scales of the video feature maps at different stages are different (ROI features from convolution feature maps are extracted for the different layers and the spatial dimensions are down sampled over each stride; Fig 3 and 3.2 Extracting Actor Features: -Backbone, -Extracting actor features);
selecting the video feature maps of the last several stages among the multiple stages with the backbone network (the local representative patterns of the last layers of the network are used to correctly distinguish fine-grained action classes; Fig 3 and 3.2 Extracting Actor Features: -Extracting actor features),
extracting the feature maps of the key frames from the video feature maps of the last several stages (the local representative patterns of the actor bounding box at key frames of the last layers of the network are used to correctly distinguish fine-grained action classes; Fig 3 and 3.2 Extracting Actor Features: -Extracting actor features),
performing feature extraction on the feature maps of the key frames to obtain the actor position feature (the local representative patterns of the last layers of the network are enlarged to capture fine-grained details to correctly distinguish fine-grained action classes; Fig 3 and 3.2 Extracting Actor Features: -Extracting actor features), and
using the video feature map of the last stage among multiple stages as the action category feature (the local representative patterns of the last layers of the network are used to correctly identify fine-grained action classes; Fig 3 and 3.2 Extracting Actor Features: -Extracting actor features).
It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al with Wu et al including performing multiple stages of feature extraction on the video clip to be tested with the backbone network to obtain video feature maps at each stage, wherein the spatial scales of the video feature maps at different stages are different; selecting the video feature maps of the last several stages among the multiple stages with the backbone network, extracting the feature maps of the key frames from the video feature maps of the last several stages, performing feature extraction on the feature maps of the key frames to obtain the actor position feature, and using the video feature map of the last stage among multiple stages as the action category feature. By performing fine-grained action analysis, differentiation of fine-grained actions may be achieved for actor-centric action recognition with high accuracy, as recognized by Wu et al (1. Introduction ¶ 2-3).
Regarding Claim 3, Xi et al in view of Wu et al teach the method according to claim 2 (as described above), wherein a residual network is used to perform multiple stages of feature extraction on the video clip to be tested (Wu et al, the backbone model is based on a I3D ResNet-50 residual network (ResNet); Fig 3 and 3.2 Extracting Actor Features: -Backbone), and a feature pyramid network is used to perform feature extraction on the feature map of the key frame (Wu et al, the persons are detected by a Faster RCNN with a ResNeXt-101-FPN (feature pyramid network); Fig 3 and 3.2 Extracting Actor Features: -Person detector).
Regarding Claim 6, Xi et al teach the method according to claim 5 (as described above), including extracting by the classification module based on the actor's position, extracting the spatial action features and the temporal action features corresponding to the actor's position from the action category features (a spatiotemporal information analysis unit module is used to obtain the spatial and associated time position data, and using the spatio-temporal feature data, motion mining is performed to determine types of actions and predict categories of motion; Fig 1 and ¶ [0068]-[0069], [0081]-[0087]).
Xi et al does not explicitly teach extracting a fixed-scale feature map of the corresponding area from the action category features based on the actor's position with the classification module; performing a global average pooling operation on the fixed-scale feature map in the time dimension to obtain the spatial action characteristics corresponding to the actor's position; and performing a global average pooling operation on the fixed-scale feature map in the spatial dimension to obtain the temporal action characteristics corresponding to the actor's position.
Wu et al is analogous art pertinent to the technological problem addressed in this application and teaches extracting a fixed-scale feature map of the corresponding area from the action category features based on the actor's position with the classification module (actor boxes are cropped directly from original video and resized to a fixed resolution; Fig 2, 3 and 3.2 Extracting Actor Features – Extracting actor features by RCNN);
performing a global average pooling operation on the fixed-scale feature map in the time dimension to obtain the spatial action characteristics corresponding to the actor's position (global average pooling is performed for the image patches of fixed resolution with given time T frame; Fig 3 and 3.2 Extracting Actor Features – Extracting actor features by RCNN); and
performing a global average pooling operation on the fixed-scale feature map in the spatial dimension to obtain the temporal action characteristics corresponding to the actor's position (global average pooling is performed for the image patches of fixed resolution with given spatial HxW frame; Fig 3 and 3.2 Extracting Actor Features – Extracting actor features by RCNN).
It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al with Wu et al including extracting a fixed-scale feature map of the corresponding area from the action category features based on the actor's position with the classification module; performing a global average pooling operation on the fixed-scale feature map in the time dimension to obtain the spatial action characteristics corresponding to the actor's position; and performing a global average pooling operation on the fixed-scale feature map in the spatial dimension to obtain the temporal action characteristics corresponding to the actor's position. By extracting a fixed-scale feature map corresponding to a position with the classification, analysis can be performed to correctly identify the action, thereby achieving a high accuracy in correct identification of the action, as recognized by Wu et al (1. Introduction ¶ 2-3).
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Xi et al (CN 113158723) in view of Pan et al (Actor-Context-Action Relation Network for Spatio-Temporal Action Localization).
Regarding Claim 7, Xi et al teach the method according to claim 5 (as described above), including wherein a plurality of actor locations are determined by the positioning module (the positioning process includes detection of action features and focusing on motion information mining between frames; ¶ [0068]-[0069]), and the action category is determined by the classification module based on each actor location of the plurality of actor locations (the spatial-temporal data is analyzed to determine the type of action occurring; ¶ [0068]-[0069]), extracting spatial action features and temporal action features corresponding to each actor's position from the features (channel information of the data features is obtained based on the spatio-temporal information to predict the feature map of the corresponding channels; ¶ [0069]-[0070]).
Xi et al does not explicitly teach inputting the spatial embedding vectors corresponding to the multiple actor positions into the self-attention module, and performing a convolution operation on the spatial action features corresponding to the multiple actor positions and the output of the self-attention module to update spatial action characteristics corresponding to each of the plurality of actor positions; and inputting the temporal embedding vectors corresponding to the multiple actor positions into the self-attention module, perform a convolution operation on the temporal action features corresponding to the multiple actor positions and the output of the self-attention module, to update temporal action features corresponding to each of the plurality of actor locations.
Pan et al is analogous art pertinent to the technological problem addressed in this application and teaches inputting the spatial embedding vectors corresponding to the multiple actor positions into the self-attention module, and performing a convolution operation on the spatial action features corresponding to the multiple actor positions and the output of the self-attention module to update spatial action characteristics corresponding to each of the plurality of actor positions (an attention mechanism is performed between current and long term actor context relations with query features based on short-term features and key and value features based on long term features, thereby accounting for the spatial positioning; Fig 3 and 3.3 Actor-Context Feature Bank ¶ 2-3); and
inputting the temporal embedding vectors corresponding to the multiple actor positions into the self-attention module, perform a convolution operation on the temporal action features corresponding to the multiple actor positions and the output of the self-attention module, to update temporal action features corresponding to each of the plurality of actor locations (the attention mechanism is performed between current and long term actor context relations with query features based on short-term features and key and value features based on long term features, thereby accounting for the temporal context for identifying more accurate action localization; Fig 3 and 3.3 Actor-Context Feature Bank ¶ 2-3).
It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al with Pan et al including inputting the spatial embedding vectors corresponding to the multiple actor positions into the self-attention module, and performing a convolution operation on the spatial action features corresponding to the multiple actor positions and the output of the self-attention module to update spatial action characteristics corresponding to each of the plurality of actor positions; and inputting the temporal embedding vectors corresponding to the multiple actor positions into the self-attention module, perform a convolution operation on the temporal action features corresponding to the multiple actor positions and the output of the self-attention module, to update temporal action features corresponding to each of the plurality of actor locations. By performing self-attention mechanisms for action of an actor, long range determinations may be made with high accuracy to correctly classify action of an actor and spatio-temporal relationships in video, as recognized by Pan et al (2. Related Work. Relational Reasoning for Video Understanding).
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Xi et al (CN 113158723) in view of Escorcia et al (US 2019/0108400).
Regarding Claim 8, Xi et al teach the method according to claim 1 (as described above).
Xi et al does not teach wherein determining the actor location includes determining coordinates of an actor's bounding box and a confidence indicating that the actor's bounding box contains the actor; and, the method further include: selecting an actor location with a confidence level higher than a predetermined threshold and an action category corresponding to the actor location.
Escorcia et al is analogous art pertinent to the technological problem addressed in this application and teaches wherein determining the actor location includes determining coordinates of an actor's bounding box and a confidence indicating that the actor's bounding box contains the actor (a confidence score is determined by the object detector for a bounding box representing the features obtained are of the actors while in action across frames; Fig 9A, 9B and ¶ [0083]-[0084]); and, the method further include: selecting an actor location with a confidence level higher than a predetermined threshold and an action category corresponding to the actor location (a constraint is used for limiting the action proposal selection based on a minimum cost and/or a maximum affinity based on the confidence value; ¶ [0084]).
It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al with Escorcia et al including wherein determining the actor location includes determining coordinates of an actor's bounding box and a confidence indicating that the actor's bounding box contains the actor; and, the method further include: selecting an actor location with a confidence level higher than a predetermined threshold and an action category corresponding to the actor location. By using a confidence value to determine the accuracy of the bounding box, the actor is tracked over time with a high level of success, as recognized by Escorcia et al (¶ [0086]).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Xi et al (CN 113158723) in view of Escorcia et al (US 2019/0108400) and Adhi Pramono et al (Spatial-Temporal Action Localization with Hierarchical Self-Attention).
Regarding Claim 9, Xi et al in view of Escorcia et al teach the method according to claim 8 (as described above).
Xi et al in view of Escorcia et al does not teach wherein the end-to-end framework is trained based on the following objective function:
PNG
media_image1.png
21
1
media_image1.png
Greyscale
wherein
PNG
media_image1.png
21
1
media_image1.png
Greyscale
represents the actor bounding box localization loss,
PNG
media_image1.png
21
1
media_image1.png
Greyscale
represents the action categorization loss,
PNG
media_image2.png
16
10
media_image2.png
Greyscale
is the cross entropy loss,
PNG
media_image3.png
16
19
media_image3.png
Greyscale
and
PNG
media_image1.png
21
1
media_image1.png
Greyscale
are the respective bounding box loss,
PNG
media_image4.png
16
22
media_image4.png
Greyscale
is the binary cross entropy loss, and
PNG
media_image5.png
15
58
media_image5.png
Greyscale
PNG
media_image1.png
21
1
media_image1.png
Greyscale
and
PNG
media_image1.png
21
1
media_image1.png
Greyscale
are constant scalars used to balance loss contribution.
Pan et al is analogous art pertinent to the technological problem addressed in this application and teaches wherein the end-to-end framework is trained based on the following objective function:
PNG
media_image1.png
21
1
media_image1.png
Greyscale
wherein
PNG
media_image1.png
21
1
media_image1.png
Greyscale
represents the actor bounding box localization loss,
PNG
media_image1.png
21
1
media_image1.png
Greyscale
represents the action categorization loss,
PNG
media_image2.png
16
10
media_image2.png
Greyscale
is the cross entropy loss,
PNG
media_image3.png
16
19
media_image3.png
Greyscale
and
PNG
media_image1.png
21
1
media_image1.png
Greyscale
are the respective bounding box loss,
PNG
media_image4.png
16
22
media_image4.png
Greyscale
is the binary cross entropy loss, and
PNG
media_image5.png
15
58
media_image5.png
Greyscale
PNG
media_image1.png
21
1
media_image1.png
Greyscale
and
PNG
media_image1.png
21
1
media_image1.png
Greyscale
are constant scalars used to balance loss contribution (training is performed that incorporates loss function of cross entropy with the focal loss with the cross entropy loss accounting for the multi-class action classification of all bounding boxes and focal loss accounting for the classification; IV. Experimental Results B. Settings).
It would have been obvious to one skilled in the art before the effective filing date of the current application to combine the teachings of Xi et al in view of Escorcia et al with Pan et al including wherein the end-to-end framework is trained based on the following objective function:
PNG
media_image1.png
21
1
media_image1.png
Greyscale
wherein
PNG
media_image1.png
21
1
media_image1.png
Greyscale
represents the actor bounding box localization loss,
PNG
media_image1.png
21
1
media_image1.png
Greyscale
represents the action categorization loss,
PNG
media_image2.png
16
10
media_image2.png
Greyscale
is the cross entropy loss,
PNG
media_image3.png
16
19
media_image3.png
Greyscale
and
PNG
media_image1.png
21
1
media_image1.png
Greyscale
are the respective bounding box loss,
PNG
media_image4.png
16
22
media_image4.png
Greyscale
is the binary cross entropy loss, and
PNG
media_image5.png
15
58
media_image5.png
Greyscale
PNG
media_image1.png
21
1
media_image1.png
Greyscale
and
PNG
media_image1.png
21
1
media_image1.png
Greyscale
are constant scalars used to balance loss contribution. By accounting for the localization of the bounding box and the classification, the action localization and classification are both considered during end to end training, thereby improving the model for higher accuracy in spatial-temporal action localization in video, as recognized by Adhi Pramono et al (Introduction ¶ 3).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Chen et al (Watch Only Once: An End-to-End Video Action Detection Framework) teaches an end-to-end pipeline for video action detection based on a spatial-temporal fusion module and is disclosed within one year by same co-inventors as the current application.
Wang et al (US 2013/0132316) teach a system and method for space time modeling that identifies and tracks both sparse and global temporal trajectory probability and applied to continuous action recognition.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATHLEEN M BROUGHTON whose telephone number is (571)270-7380. The examiner can normally be reached Monday-Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John Villecco can be reached at (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KATHLEEN M BROUGHTON/Primary Examiner, Art Unit 2661