DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
This non-final office action is in response to the amendment filed 1/2/2026. Claims 1, 3-8, 10-15, and 17-20 are pending in this application and have been considered below. Claims 2, 9, and 16 are canceled by the applicant.
Applicant’s arguments with respect to claims 1, 3-8, 10-15, and 17-20 regarding the lack of temporal displacements in Liu have been considered but are moot in view of new ground(s) of rejection, Yang, which explicitly teaches predicting temporal displacement from every frame (see below).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 3-8, 10-15, and 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vahdani et al. (Deep Learning-based Action Detection in Untrimmed Videos: A Survey – hereinafter “Vahdani”) in view of Yang et al. (Revisiting Anchor Mechanisms for Temporal Action Localization – hereinafter “Yang”) in view of De Souza et al. (US 2018/0053057 A1 – hereinafter “De Souza”).
Claims 1, 8 and 15.
Yang disclose a non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a computer (Yang p. 6, § IV. B.: “The proposed A2Net is implemented based on PyTorch 1.4 [40]. We perform experiments with one NVIDIA TITAN Xp GPU, Intel Xeon E5-2683 v3 CPU and 128G memory”), cause the computer to …
Vahdani discloses a method comprising:
retrieving a frame sequence depicting an event (Abstract discloses “activity detection in untrimmed videos”; Fig. 1), the frame sequence having a plurality of frames (p. 2, left column, discloses “predict action labels at every frame of the video”), each frame corresponding to a temporal location of the frame sequence (p. 3, top right column, discloses “The start time, end time, and label”);
extracting a feature from each frame of the plurality of frames (p. 2, left column, ¶3 discloses “When targeting fine-grained actions, temporal action detection (segmentation) is similar to semantic segmentation as both aim to classify every single instance, i.e., frames in temporal domain”; p. 4, left column, §2.2 discloses equation 4 a sequence S with l frames where, “RGB frame xtn is fed to spatial network ResNet [25], extracting feature vector fS,n”; );
generating an input matrix by (p. 4, left column, §2.2 discloses “spatial and temporal features, fS,n and fT,n, are concatenated to represent the visual feature fn for snippet sn“; where, fn is a matrix with dimension dxT and an output matrix A (T-CAM) that contains activation scores (confidences) for each class at each temporal position);
applying an event detection model to the input matrix to combine features across all temporal locations and generate an output matrix (p. 10, left column, § 2.4.1.1 discloses “T-CAM is a matrix denoted by A which represent the possibility of activities at each temporal position. Matrix A has nc rows which is the total number of action classes, and T columns which is the number of temporal positions in the video.”),
and every class, confidences (p. 10, right column, § 2.4.1.2 discloses “each video should be represented using a single confidence score per category. The confidence score for each category is computed as the average of top activation k scores over the temporal dimension for that category”; p. 11, top right column, discloses “nc is the number of action classes” and p. 3, left column, §2.1, definition 1 discloses multiple “categories of action instances”) and temporal displacements from the output matrix (p. 10, left column, § 2.4.1.1 discloses “T-CAM is a matrix denoted by A which represent the possibility of activities at each temporal position … T columns which is the number of temporal positions in the video.”); and
determining, based on the confidences and temporal displacements, (p. 10, right column, § 2.4.1.2 discloses confidences and timing; pp. 17-18, §4.1.3 discloses action spotting in sports and “Human activity localization in sports videos is studied in [192], [193], [194], [195], salient game actions are identified in [196], [197], automatic game highlights identification and summarization are performed in [198], [199], [200], [201], [202]. Moreover, action spotting, which is the task of temporal localization of human-induced events, has been popular in soccer game broadcasts [3], [203] and some
methods aimed to automatically detect goals, penalties, corner kicks, and card events [204].”).
Vahdani discloses all of the subject matter as described above except for specifically teaching “the output matrix comprising a plurality of classes for each frame,” “determining, for every frame”, and “the event depicted by the frame sequence.” However, Yang in the same field of endeavor teaches the output matrix comprising a plurality of classes for each frame (Yang teaches the model outputs a matrix/map that contains the class scores and displacement values for the temporal sequence. Yang’s matrix structure: “The anchor-free module uses two individual branches for predicting classification scores … and regression distances to the starting boundary and ending boundary (rs; re)” (p. 5, § III. D.). Table II lists the “Anchor-Free Localization Module” output size as “B x (C +2) x t”; where, t is the temporal length (frames), C is the plurality of classes, and +2 are the two temporal displcament values (start/end) (p. 4, Table II).), determining, for every frame (Yang p. 2, right column & Fig. 2: “The anchor-free module simultaneously predicts the classification score and regresses the distances to the starting and ending boundaries. The anchor-based module first chooses the closest-matched anchor then refines the action boundaries via regression. These two modules share the same backbone and independently make predictions at each temporal location from every pyramid level” – every frame/location confidence and displacements; Yang p. 4, § 3. D.: “The proposed anchor-free module regresses the distances from the center of an action instance to its boundaries.”), and the event depicted by the frame sequence (p. 6, § III. F.: “During inference … the predicted action boundaries (saf, eaf) can be obtained via inverting equation 3 [using the temporal displacements]. The maximum value of classification score Saf is regarded as the confidence for the localization results … Finally, action localization results for the anchor-free and anchor-based modules are merged together … and obtain the final localization results” [the event depicted].).
Vahdani discloses all of the subject matter as described above except for specifically teaching “stacking.” However, De Souza in the same field of endeavor teaches “stacking” (¶26: “Feature vectors extracted from the transformation(s) are aggregated (“stacked”), a process referred to herein as Data Augmentation by Feature Stacking (DAFS). The stacked descriptors form a feature matrix”)
It would have been obvious to a Person having Ordinary Skill In The Art (POSITA), before the effective filling date of the claimed invention, to combine the teachings of Vahdani, Yang, and De Souza to arrive at the claimed invention. A POSITA would have been motivated to improve temporal localization capability of Vahdani’s system. Yang teaches an anchor-free prediction head to allow for flexible action detection that disposes each temporal location equally (Yang p. 2) and avoids the limitation of pre-defined anchors. A POSITA would have found it obvious to modify Vahdani’s system by incorporating the specific localization techniques taught by Yang to achieve more precise boundary detecting, thereby arriving at a system that determines an event based on confidences and temporal displacements.
De Souza teaches that stacking feature vectors into a matrix is a well-known and conventional method for preparing sequential data for input into a deep learning model. Therefore, it would have been an obvious and routine design choice to apply the stacking technique of De Souza to the features within the combined Vahdani and Yang framework, as this would be a predictable step with a reasonable expectation of success.
Claims 3, 10, and 17.
The combination of Vahdani, Yang, and De Souza discloses the method of claim 1, wherein determining confidences and temporal displacements further comprises performing separate convolution operations on the output matrix (Vahdani p. 10, left column, §2.4.1.1, definition 15 discloses “T-CAM is a matrix denoted by A”; p. 6, right column, § 2.3.4 discloses “Several convolutional layers are applied on the features to predict actionness score (def 6), completeness score (def 8), classification score (def 9), and to adjust the temporal boundary of the proposals.”; Yang uses two parallel convolution branches (Fig. 4).).
Claims 4, 11, and 18.
The combination of Vahdani, Yang, and De Souza discloses the method of claim 1, wherein the event detection model includes a model trunk selected from a group consisting of a 1-D U-Net (Vahdani p. 5, right column, ¶2 discloses “U-shaped TFPNs”) and a transformer encoder (TE) (Vahdani p. 8, bottom left column, 2.3.5.3 Transformers disclose “encoder decoder transformer … Their encoder generates a context graph where the nodes are initially video level features and the interactions among nodes are modeled as learnable edge weights. Also, positional information for each node is provided using learnable positional encodings.”).
Claims 5, 12, and 19.
The combination of Vahdani, Yang, and De Souza discloses the method of claim 1, wherein determining the event depicted by the frame sequence based on the confidences and temporal displacements includes consolidating the confidences and temporal displacements by displacing the confidences by the temporal displacements (Vahdani p. 10, left column, § 2.4.1.1 discloses “T-CAM is a matrix denoted by A which represent the possibility of activities at each temporal position.”; p. 10, right column, § 2.4.1.2, last paragraph, discloses “A[c, tcl] is the activation (def 15) of class c at temporal position
tcl”).
Claims 6 and 13.
The combination of Vahdani, Yang, and De Souza discloses the method of claim 1, wherein prior to applying the event detection model, a dimensionality reduction technique is applied to the input matrix (Vahdani p. 4, left top column, discloses “In some cases, the recognition scores of sampled frames are aggregated with the Top-k pooling”; where, pooling layers are commonly used in neural networks to summarize features from an area).
Claims 7, 14, and 20.
The combination of Vahdani, Yang, and De Souza discloses the method of claim 1, further comprising training the event detection model by optimizing a confidence loss and a temporal displacement loss (Vahdani p. 11, top left column, discloses “MIL loss is a cross-entropy loss applied over all videos and all action classes; p. 6, § 2.3.4 discloses loss functions for proposal evaluation – definition 14, action regression loss to adjust the temporal boundaries of the proposals. Yang details a “classification loss” which corresponds to the focal loss structure (§ III. E., Eq. 5) for confidence and “L1 loss” for displacement (§ III. E.)).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Ross Varndell whose telephone number is (571)270-1922. The examiner can normally be reached M-F 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O’Neal Mistry can be reached at (313)446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Ross Varndell/Primary Examiner, Art Unit 2674