DETAILED ACTION
Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2. This is in response to the applicant response filed on 02/18/2026. In the applicant’s response, claims 1-2, 10, and 20 were amended. Accordingly, claims 1-10 and 18-25 are pending and being examined. Claims 1 and 10 are independent form.
3. The claims rejected under 35 USC 101 in the previous office action have been withdrawn in view of applicant’s amendments and remarks.
Claim Rejections - 35 USC § 103
4. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
5. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
6. Claims 1-10 and 18-25 are rejected under 35 U.S.C. 103 as being unpatentable over Sun (CN113743362, hereinafter “Sun”) in view of Song et al (“Temporal–Spatial Mapping for Action Recognition”, 2020, hereinafter “Song). A machine translated English version (i.e., CN113743362-Eng) of document CN113743362 was provided by the examiner in the previous office action.
Regarding claim 1, Sun discloses an action recognition method applied to an electronic device (see Abstract: “a method for real-time correction training action based on deep learning and related device thereof...”; see figs.1-5), comprising: obtaining, by the electronic device, a plurality of image frames of a video to be recognized (see pg.7, line 6, S1: “obtaining the action video in real time, and performing the limb key point detection 6 operation to the video frame in the action video, obtaining the key point video frame.”);
determining a probability distribution of the video to be recognized being similar to a plurality of action categories according to the plurality of image frames and a pre-trained self- attention model (see pg.7, lines 33-41: “inputting the key point video frame according to the time sequence to the pre-trained action identification model, obtaining the action type of the video frame [...] according to the time sequence in the action video; the action identification model determines the probability of each action (such as lifting arm uplifting, lifting leg and so on) for the current video frame; the action corresponding to the maximum probability is used as the action type of the video frame.”);
wherein the self-attention model is used to calculate similarity between an image feature sequence and the plurality of action categories through a self-attention mechanism;
the image feature sequence is obtained in time dimension see fig.4 and pg.10, lines 5-29: “S31: generating standard characteristic based on all standard action pictures corresponding to the video frame of the current video segment; generating action characteristics based on all video frames of the current video segment [...] generating the action comparison result of the current video segment based on the first target characteristic and the second target characteristic [...] by calculating the cosine similarity between the first target characteristic (GwX1) and the second target characteristic (GwX2), obtaining action comparison result.”); and
the probability distribution includes a probability that the video to be recognized is similar to each action category in the action categories (see pg.7, lines 33-41: “[...] according to the time sequence in the action video; the action identification model determines the probability of each action (such as lifting arm uplifting, lifting leg and so on) for the current video frame; the action corresponding to the maximum probability is used as the action type of the video frame.”);
determining a target action category corresponding to the video to be recognized based on the probability distribution of the video to be recognized being similar to the plurality of action categories; wherein a probability that the video to be recognized is similar to the target action category is greater than or equal to a preset threshold (see pg.7, lines 33-41: “[...] according to the time sequence in the action video; the action identification model determines the probability of each action (such as lifting arm uplifting, lifting leg and so on) for the current video frame; the action corresponding to the maximum probability is used as the action type of the video frame.”).
As explained above, the mere difference between the claimed invention and the method of Sun is: Sun does not explicitly disclose: “the image feature sequence is obtained in spatial dimension based on the plurality of image frames”, “wherein the self- attention model includes a self-attention coding layer and a classification layer the image feature sequence is obtained by combining an image feature of each image frame with a corresponding position coding feature of a plurality of position coding features by the electronic device; the corresponding position coding feature identifies a relative position of a corresponding image feature in an input sequence input to the self-attention coding layer the plurality of position coding features are pre-generated by the electronic device according to image features of the plurality of image frames”, as recited in the claim 1.
However, in the same field of endeavor, Song teaches: “a head ConvNet with temporal attention to further transform the temporal-spatial VideoMap to a more compact and effective video-level feature representation for classification, which can better exploit the temporal-spatial dynamics” and “a deep architecture for action recognition which achieves significant performance improvement on the HMDB51 dataset. The source code and trained model will be released to facilitate the research in action recognition.”. See Fig.4; see Sec. 1, paragraph 4, in the left column, on page 749. In addition, Song teaches:
wherein the self- attention model includes a self-attention coding layer and a classification layer (wherein the pre-trained network includes the Backbone ConvNet and the Head ConvNet. See the “Backbone ConvNet” and “Head ConvNet” in fig.1); the image feature sequence is obtained by combining an image feature of each image frame with a corresponding position coding feature of a plurality of position coding features by the electronic device; the corresponding position coding feature identifies a relative position of a corresponding image feature in an input sequence input to the self-attention coding layer the plurality of position coding features are pre-generated by the electronic device according to image features of the plurality of image frames (see Sec. I, par.4: “[t]o deploy this TSM operation for action recognition, we first train a backbone 2D ConvNet model to extract convolutional features for each frame of a video sequence, then the TSM operation is performed on the features to generate the temporal-spatial VideoMap, which naturally encodes the temporal-spatial information in 2D feature map.” It should be noticed that: wherein the convolutional features extracted from a frame of the input temporal-spatial video sequence by the Backbone ConvNet identify a relative position of the corresponding image feature in the input frame input to the Backbone ConvNet, and therefore the plurality of position coding features is pre-generated according to image features of the plurality of image frames.).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention was made to incorporate the teachings of Song into the teachings of Sun and further obtain the image features from the spatial (RGB) domain of the input video by manes of the TSM for action recognition network taught by Song in the human action recognition method taught by Sun. Suggestion or motivation for doing so would have been to obtain “a more compact and effective video-level feature representation for classification, which can better exploit the temporal-spatial dynamics” as taught by Song, cf., Sec. 1, paragraph 4, in the left column, on page 749. Therefore, the claim is unpatentable over Sun in view of Song.
Regarding claim 2, 20, the combination of Sun and Song discloses, wherein the self- attention model includes a self-attention coding layer and a classification layer, the self-attention coding layer is used to calculate a similarity feature of the image feature sequence relative to the plurality of action categories, and the classification layer is used to calculate the probability distribution corresponding to the similarity feature; determining the probability distribution of the video to be recognized being similar to the plurality of action categories according to the plurality of image frames and the pre-trained self-attention model, includes: determining a target similarity feature of the video to be recognized relative to the plurality of action categories according to the plurality of image frames and the self- attention coding layer; the target similarity feature being used to characterize similarity between the video to be recognized and each action category; and inputting the target similarity feature to the classification layer to obtain the probability distribution of the video to be recognized being similar to the plurality of action categories (Sun, see pg.7 line 43—pg.8, line 31: “the action identification model comprises a bidirectional long term memory network, a self-attention layer and a normalization layer; the step of inputting the key point video frame according to the time sequence to the pre-training action identification model; the step of obtaining the action type of the video frame [based on the probabilities]”).
Regarding claim 3, the combination of Sun and Song discloses the action recognition method according to claim 2, wherein before determining the target similarity feature of the video to be recognized relative to the plurality of action categories according to the plurality of image frames and the self-attention coding layer (Sun, see pg.10, lines 20-29: “generating the action comparison result of the current video segment based on the first target characteristic and the second target characteristic [...] by calculating the cosine similarity between the first target characteristic (GwX1) and the second target characteristic (GwX2), obtaining action comparison result.”), the method further includes: segmenting each image frame in the plurality of image frames to obtain a plurality of sampling sub-images (Sun, see pg.9, lines 24-28: “[the action identification model] intercepting the action video, obtaining a plurality of video segments, and based on the action type from the database to adjust the corresponding standard action picture, the respectively frame of each video segment and corresponding to the standard action picture for action comparison, obtaining a plurality of action comparison result; generating a correction suggestion based on the action comparison result.”); and determining the target similarity feature of the video to be recognized relative to the plurality of action categories according to the plurality of image frames and the self- attention coding layer, includes: determining at least one sequence feature of the video to be recognized according to the plurality of sampling sub-images and the self-attention coding layer, and determining the target similarity feature according to the at least one sequence feature of the video to be recognized; wherein the at least one sequence feature includes a time sequence feature, or both the time sequence feature and a space sequence feature; the time sequence feature is used to characterize similarity between the video to be recognized and the plurality of action categories in the time dimension (Sun, see pg.7, lines 33-41: “[...] according to the time sequence in the action video; the action identification model determines the probability of each action (such as lifting arm uplifting, lifting leg and so on) for the current video frame; the action corresponding to the maximum probability is used as the action type of the video frame.”), and the space sequence feature is used to characterize similarity between the video to be recognized and the plurality of action categories in the spatial dimension (Song, see fig.4 : “The overall framework with our Temporal-Spatial Mapping operation followed by a head ConvNet for action recognition. Two-stream ConvNets extract features on each frame for the spatial stream (RGB) and temporal stream (Optical flow), respectively. The vectorized feature vectors of the sequential frames form a VideoMap for temporal-spatial representation. A head ConvNet with temporal attention makes action classification based on the VideoMap. Finally the class scores of the VideoMaps from two streams are fused to produce the video-level prediction.”).
Regarding claim 4, the combination of Sun and Song discloses the action recognition method according to claim 3, wherein determining the time sequence feature of the video to be recognized, includes: determining at least one time sampling sequence from the plurality of sampling sub- images; each time sampling sequence including sampling sub-images of all image frames located in same positions; determining a time sequence sub-feature of each time sampling sequence according to each time sampling sequence and the self-attention coding layer; the time sequence sub-feature being used to characterize similarity between each time sampling sequence and the plurality of action categories; and determining the time sequence feature of the video to be recognized according to a time sequence sub-feature of the at least one time sampling sequence (Sun, see pg.9, lines 24-28: “[the action identification model] intercepting the action video, obtaining a plurality of video [time] segments, and based on the action type from the database to adjust the corresponding standard action picture, the respectively frame of each video segment and corresponding to the standard action picture for action comparison, obtaining a plurality of action comparison result; generating a correction suggestion based on the action comparison result.”).
Regarding claim 5, 21, the combination of Sun and Song discloses, wherein determining the time sequence sub-feature of each time sampling sequence according to each time sampling sequence and the self-attention coding layer, includes: determining a plurality of first image input features and a category input feature; wherein each first image input feature is obtained by performing position encoding merging on image features of sampling sub-images included in a first time sampling sequence, and the first time sampling sequence is any of the at least one time sampling sequence; the category input feature is obtained by performing position encoding merging on a category feature, and the category feature is used to characterize the plurality of action categories; and inputting the plurality of first image input features and the category input feature into the self-attention coding layer, and determining an output feature output by the self- attention coding layer corresponding to the category input feature as a time sequence sub-feature of the first time sampling sequence (Song, see fig.2, “Temporal-Spatial Mapping [TSM] operation, which transforms a sequence of feature maps into a compact VideoMap.” It should be noticed that: wherein S1, S2, ... Sk, ...ST are T frames in temporal dimension sampled from the input video; the feature maps with high dimensions, e.g., Sk is further sub-sampled into C=3 images (i.e., RGN) in spatial dimension; and a spatial vectorization function V is used to encode the feature maps to a low dimension vector of fixed length, i.e., fk = V (Sk). Then, the feature vector of each frame as a row with the row identity corresponding to the time order of the frames are merged into a two-dimensional temporal-spatial map, i.e., VideoMap. The “two-stream backbone COnvNets” shown in fig.4 includes “self- attention coding layers” for coding feature maps.).
Regarding claim 6, 22, the combination of Sun and Song discloses, wherein determining the space sequence feature of the video to be recognized, includes: determining at least one space sampling sequence from the plurality of sampling sub-images; each space sampling sequence including sampling sub-images of an image frame; determining a space sequence sub-feature of each space sampling sequence according to each space sampling sequence and the self-attention coding layer; the space sequence sub-feature being used to characterize similarity between each space sampling sequence and the plurality of action categories; and determining the space sequence feature of the video to be recognized according to a space sequence sub-feature of the at least one space sampling sequence (Song, ibid.).
Regarding claim 7, the combination of Sun and Song discloses the action recognition method according to claim 6, wherein determining the at least one space sampling sequence from the plurality of sampling sub-images, includes: for a first image frame, determining a preset number of target sampling sub-images located in preset positions from sampling sub-images included in the first image frame, and determining the target sampling sub-images as a space sampling sequence corresponding to the first image frame; the first image frame being any of the plurality of image frames (Song, see fig.4, wherein the left col are target frame images, ...the third col are T-S maps sampled in temporal dimension and sub-sampled in spatial dimension.).
Regarding claim 8, the combination of Sun and Song discloses the action recognition method according to claim 6, wherein determining the space sequence sub-feature of each space sampling sequence according to each space sampling sequence and the self-attention coding layer, includes: determining a plurality of second image input features and a category input feature; wherein each second image input feature is obtained by performing position encoding merging on image features of sampling sub-images included in a first space sampling sequence, and the first space sampling sequence is any of the at least one space sampling sequence; the category input feature is obtained by performing position encoding merging on a category feature, and the category feature is used to characterize the plurality of action categories; and inputting the plurality of second image input features and the category input feature into the self-attention coding layer, and determining an output feature output by the self- attention coding layer corresponding to the category input feature as a space sequence sub-feature of the first space sampling sequence (Song, see fig.2, “Temporal-Spatial Mapping [TSM] operation, which transforms a sequence of feature maps into a compact VideoMap.” It should be noticed that: wherein S1, S2, ... Sk, ...ST are T frames in temporal dimension sampled from the input video; the feature maps with high dimensions, e.g., Sk is further sub-sampled into 3 images (i.e., RGN, the first image, and the second image) in spatial dimension; and a spatial vectorization function V is used to encode the feature maps to a low dimension vector of fixed length, i.e., fk = V (Sk).).
Regarding claim 9, 23, the combination of Sun and Song discloses, wherein the plurality of image frames are obtained based on image preprocessing, and the image preprocessing includes at least one operate of cropping, image enhancement or scaling (e.g., Song, see VI-A Datasets, wherein the datasets are preprocessed.).
Regarding claim 10, the combination of Sun and Song discloses the model training method, comprising: obtaining a plurality of sample image frames of a sample video, and a sample action category to which the sample video belongs; and performing self-attention training according to the plurality of sample image frames and the sample action category to obtain a trained self-attention model; wherein the self-attention model is used to calculate similarity between a sample image feature sequence and a plurality of action categories; and the sample image feature sequence is obtained in time dimension or spatial dimension based on the plurality of sample image frames (Sun, see pg.14, line 4---pg.15, line 43, and fig.6 which discloses a real-time correction training action of the deep learning model; Song, see Sec. V, para.3: “we train the networks in two stages. In the first stage, we train the backbone ConvNets. Then we train the head ConvNet for VideoMap classification”; see Sec. I, para.4: “we first train a backbone 2D ConvNet model to extract convolutional features for each frame of a video sequence, then the TSM operation is performed on the features to generate the temporal-spatial VideoMap, which naturally encodes the temporal-spatial information in 2D feature map [...]”).
Regarding claims 18, 19, each of which is an inherent variation of claim 1, thus it is interpreted and rejected for the reasons set forth in the rejection of claim 1.
Regarding claims 24, 25, each of which is an inherent variation of claim 10, thus it is interpreted and rejected for the reasons set forth in the rejection of claim 10.
Response to Arguments
7. Applicant’s arguments, filed on 02/18/2026, have been fully considered but they are not persuasive.
Specifically, on pages 13-14 of applicant’s response, applicant submits:
[Q1] Sun discloses that the key point video frames, which are obtained by the limb recognition model from video frames captured by the device, are input into the pre-trained action identification model, rather than the video frames captured by the device are input into the pre-trained action identification model.
[Q2] Sun discloses inputting each video frame orderly into the action identification model according to the time sequence, but is silent about an independent position coding vector pre-generated by the device used to represent the position of a video frame in the sequence.
[Q3] Third, amended claim 1 discloses that "wherein a probability that the video to be recognized is similar to the target action category is greater than or equal to a preset threshold".
(The emphases added by the applicant.)
The examiner respectfully disagrees with the arguments. First, regarding Q1, Sun’ device for real-time correction training action based on deep learning captures “an action video in real time” as an input of the device and extracts the key points/features from the input video (see S1 in fig.2 and pg.7, lines 6-31), and then “determines the probability of each action (such as lifting arm uplifting, lifting leg and so on) for the input video frame” (see S2 in fig.2 and pg.7, lines 36-41). Therefore, the Q1 is unpersuasive. Regarding Q2, Song discloses the temporal-spatial mapping (TSM) operation neural network which includes the Backbone 2D ConvNet and the Head CovnNet; wherein the Backbone 2D ConvNet is trained to extract convolutional features for each frame of the input video sequence, and the TSM operation is then performed on the features to generate the temporal-spatial VideoMap, “which naturally encodes the temporal-spatial information in 2D feature map”. See Sec, I, paragraph 4. Therefore, the Q1 is unpersuasive. Regarding Q3, as noticed by the applicant, Sun clearly discloses: “the action identification model determines the probability of each action (such as lifting arm uplifting, lifting leg and so on) for the current video frame; the action corresponding to the maximum probability is used as the action type of the video frame”. It is apparent that the probability of the action corresponding the maximum probability is greater than the other probabilities of the other actions. Besides, using a similarity threshold to determine a similarity patter is well-known for one skilled in the art and has no patentable weight. The Q3 is therefore unpersuasive.
8. In view of the above reasons, examiner maintains rejections.
Conclusion
9. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
10. Any inquiry concerning this communication or earlier communications from the examiner should be directed to RUIPING LI whose telephone number is (571)270-3376. The examiner can normally be reached 8:30am--5:30pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, HENOK SHIFERAW can be reached on (571)272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit https://patentcenter.uspto.gov; https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center, and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RUIPING LI/Primary Examiner, Ph.D., Art Unit 2676