Last updated: May 29, 2026
Application No. 18/593,523
DETERMINING INCONSISTENCY OF LOCAL MOTION TO DETECT EDITED VIDEO

Non-Final OA §101§102§103
Filed
Mar 01, 2024
Priority
Oct 20, 2022 — CN 202211289026.3 +1 more
Examiner
WILBURN, MOLLY K
Art Unit
2666
Tech Center
2600 — Communications
Assignee
Tencent Technology (Shenzhen) Company Limited
OA Round
1 (Non-Final)
Interview Optional

— +8.7% interview lift. Interview lift (+8.7%) is below the 15.0% threshold. A written response is recommended.
Based on 455 resolved cases, 2023–2026
Examiner Intelligence

WILBURN, MOLLY K View full profile →
Grants 90% — above average
Career Allowance Rate
410 granted / 455 resolved
+28.1% vs TC avg
Moderate +9% lift
Without
With
+8.7%
Interview Lift
resolved cases with interview
Fast prosecutor
2y 0m
Avg Prosecution
14 currently pending
Career history
470
Total Applications
across all art units
Statute-Specific Performance

§101
8.7%
-31.3% vs TC avg
§103
57.2%
+17.2% vs TC avg
§102
22.8%
-17.2% vs TC avg
§112
2.8%
-37.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 455 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION
Claims 1-20 are currently pending. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/07/2025 has been considered by the Examiner. 

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
intra-snippet recognition module 
inter-snippet recognition module 
in claim 11. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 101
Claims 13 is rejected under 35 U.S.C. 101 because the claimed invention encompasses a signer per se. The computer readable storage medium claimed is not defined the specification as a statutory only embodiment. See [0202-210] The broadest reasonable interpretation of a claim drawn to a computer readable medium (also called machine readable medium and other such variations) typically covers forms of non-transitory tangible media and transitory propagating signals per se in view of the ordinary and customary meaning of computer readable media, particularly when the specification is silent. See MPEP 2111.01.  When the broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected under 35 U.S.C. § 101 as covering non-statutory subject matter. See In re Nuijten, 500 F.3d 1346, 1356-57 (Fed. Cir. 2007) transitory embodiments are not directed to statutory subject matter) and Interim Examination Instructions for Evaluating Subject Matter Eligibility Under 35 U.S.C. § 101, Aug. 24, 2009; p. 2. Applicant is encouraged to contact the Examiner to discuss possible amendments. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-8 are rejected under 35 U.S.C. 102(a)(2) as being anticipated  by Gu (“Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection”, provided on Applicant’s IDS).
The applied reference has common authors (Zhihao Gu, Taiping Yao, Yang Chen, and Shouhong Ding) with the instant application. However, the applied reference has additional authors Jinlin Li and Lizhuang Ma and therefore qualifies as prior art under 35 U.S.C. 102(a)(1). 
Regarding claim 1, Gu teaches: 
A video detection method, comprising: 
extracting N video snippets from a video, each video snippet of the N video snippets comprising M frames, the N video snippets comprising an initial object, and both N and M being positive integers greater than or equal to 2; (Gu, page 746 To this end we sample video sequence uniformly into U snippets, each of which contains T successive frames rather than a single frame like previous works do. See also Figure 2 with 3 snippets containing 4 frames) and
 determining a representation vector of the N video snippets, and determining a recognition result based on the representation vector, the recognition result representing a probability that the initial object is an edited object, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6. page 747 Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11. See also Figure 2 determination if video is fake/real)  wherein 
the representation vector is determined based on intra-snippet representation vectors and inter-snippet representation vectors, each intra-snippet representation vector corresponding to a respective video snippet of the N video snippets and representing inconsistent information between frames in the respective video snippet of the N video snippets, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6)  and 
each inter-snippet representation vector corresponding to a respective video snippet of the N video snippets and representing inconsistent information between the respective video snippet and one or more adjacent video snippets of the N video snippets.  (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)

Regarding claim 2, Gu teaches: 
The method according to claim 1, wherein the method further comprises:
 dividing a first representation vector corresponding to the respective video snippet of the N video snippets along a channel dimension to obtain first representation sub-vectors; (Gu, page 746 We first split I into two equal parts along the channel dimension to get I1 and I2 and then feed them into subsequent branches)
determining a convolution kernel based on the first representation sub-vectors, wherein the convolution kernel is a convolution kernel corresponding to the first representation vector;  (Gu, page 746 equation (1)
determining a weight matrix corresponding to the first representation sub-vectors, wherein the weight matrix is configured for extracting motion information between adjacent frames based on an attention mechanism; (Gu page 747, Averaging these features and applying a sigmoid function, the horizontal and vertical AttenH and AttenW can be obtained) 
 determining second representation sub-vectors based on the first representation sub-vectors, the weight matrix, and the convolution kernel;(Gu page 747 Equation 5)  and 
splicing the first representation sub-vectors and the second representation sub-vectors into the intra-snippet representation vector corresponding to the respective video snippet of the N video snippets.  (Gu page 747, Equation 6)


Regarding claim 3, Gu teaches: 
The method according to claim 2, wherein the determining the convolution kernel based on the first representation sub-vectors comprises:
 performing a global average pooling operation on each of the first representation sub-vectors to obtain respective first representation sub-vectors with a compressed spatial dimension; (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension) 
performing a fully connected operation on the first representation sub-vectors with the compressed spatial dimension to determine an initial convolution kernel; (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension and then two fully connected layers are performed) and 
performing a normalization operation on the initial convolution kernel to obtain the convolution kernel.  (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension and then two fully connected layers are performed finally a softmax operation comes up)

Regarding claim 4, Gu teaches: 
The method according to claim 2, wherein the determining the weight matrix corresponding to the first representation sub-vectors comprises: 
performing a bidirectional temporal difference operation on the first representation sub-vectors to determine a first difference matrix between adjacent frames in the respective video snippet corresponding to the first representation vector; (Gu page 746 To model the temporal relation, Intra-SMA applies temporal bi-directional temporal difference guided coordinate attention to make the network attend to local motions)
 reshaping the first difference matrix into a horizontal inconsistency parameter matrix and a vertical inconsistency parameter matrix along a horizontal dimension and a vertical dimension respectively; (Gu pages 746-747, the difference is sharped into two coordinate-wise representations, which further undergo through a multi-scale structure to capture fine-grained and short-term motion information….forward vertical inconsistency and forward horizontal consistency)  and 
determining a vertical attention weight matrix and a horizontal attention weight matrix based on the horizontal inconsistency parameter matrix and the vertical inconsistency parameter matrix respectively, wherein the weight matrix comprises the vertical attention weight matrix and the horizontal attention weight matrix.  (Gu page 747, Averaging these features and applying a sigmoid function, the horizontal and vertical AttenH and AttenW can be obtained)

Regarding claim 5, Gu teaches: 
The method according to claim 4, wherein the determining the second representation sub-vectors based on the first representation sub-vectors, the weight matrix, and the convolution kernel comprises: 
performing an element-wise multiplication operation on the vertical attention weight matrix, the horizontal attention weight matrix, and the first representation sub-vectors, and combining a result of the element-wise multiplication operation with the first representation sub-vectors to obtain third representation sub-vectors; (Gu equation 5, element wise multiplication of the attention matrices and the sub-vectors) and 
performing a convolution operation on the third representation sub-vectors by using the convolution kernel to obtain the second representation sub-vectors.   (Gu, equation 5 depth wise convolution) 

Regarding claim 6, Gu teaches: 
The method according to claim 1, wherein the method further comprises: 
performing a global average pooling operation on a second representation vector corresponding to the respective video snippet of the N video snippets to obtain a global representation vector with a compressed spatial dimension; (Gu page 747, Formally, let tensor be the module input. It is processed by GAP (global average pooling) to obtain a global representation)
dividing the global representation vector into a first global representation sub-vector and a second global representation sub-vector, wherein the first global representation sub-vector represents the respective video snippet corresponding to the second representation vector, and the second global representation sub-vector represents interaction information between the respective video snippet corresponding to the second representation vector and at least one adjacent video snippet; (Gu, page 747  It is first processed by the GAP to obtain a global representation and then passed through a two-branch structure for different interaction modelling …Among them one branch directly captures inter-snippet interaction without introducing intra-snippet information….The other branch inter-snippet motion attention which is designed to be computationally efficient while containing a larger intra-snippet field of view)and 
determining the inter-snippet representation vector for the respective video snippet based on the global representation vector, the first global representation sub-vector, and the second global representation sub-vector.  (Gu page 748 equation 11) 

Regarding claim 7, Gu teaches: 
The method according to claim 6, wherein the dividing the global representation vector comprises: 
performing a convolution operation on the global representation vector by using a first convolution kernel to obtain a global representation vector with a reduced dimension;  (Gu page 747, Among them, one branch directly captures the inter-snippet interaction without introducing intra-snippet information….Equation 7…where Conv is a spatial convolution with a kernel size 3X1 for snippet-wise feature extraction and dimension reduction)
performing a normalization operation on the global representation vector with the reduced dimension to obtain a normalized global representation vector; (Gu, page 748 equation 7, BN batch normalization) 
performing a deconvolution operation on the normalized global representation vector by using a second convolution kernel to obtain the first global representation sub-vector with a same dimension as the global representation vector;  (Gu, page 747 Conx 1X1 stand for convolution with size 1X1  for dimension recovery)
performing a bidirectional temporal difference operation on the global representation vector to determine a second difference matrix and a third difference matrix between the respective video snippet corresponding to the second representation vector and adjacent video snippets; ( Gu page 747, equations 8 and 9 are the bi-directional facial movements) and 
generating the second global representation sub-vector based on the second difference matrix and the third difference matrix.  (Gu, page 747 equation 10)

Regarding claim 8, Gu teaches: 
The method according to claim 6, wherein the determining the inter-snippet representation vector for the respective video snippet based on the global representation vector, the first global representation sub-vector, and the second global representation sub-vector comprises: 
performing an element-wise multiplication operation on the first global representation sub-vector, the second global representation sub-vector, and the global representation vector, and combining a result of the element-wise multiplication operation with the global representation vector to obtain a third global representation sub-vector; (Gu page 748 equation 11) and 
performing a convolution operation on the third global representation sub-vector by using a third convolution kernel to obtain the inter-snippet representation vector for the respective video snippet.  (Gu, page 748 equation 11)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 9-20 are rejected under 35 U.S.C. 103 as being unpatentable over Gu (the same reference applied above) in view of Jourads (US 2020/0110970). 
Regarding claim 9, Gu teaches:
 A video detection apparatus, comprising:
configured to extract N video snippets from a video, each video snippet of the N video snippets comprising M frames, the N video snippets comprising an initial object, and both N and M being positive integers greater than or equal to 2; (Gu, page 746 To this end we sample video sequence uniformly into U snippets, each of which contains T successive frames rather than a single frame like previous works do. See also Figure 2 with 3 snippets containing 4 frames) and 
determine a representation vector of the N video snippets, and determine a target recognition result based on the representation vector, the recognition result representing a probability that the initial object is an edited object, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6. page 747 Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11. See also Figure 2 determination if video is fake/real)   wherein
 the representation vector is determined based on intra-snippet representation vectors and inter-snippet representation vectors, each intra-snippet representation vector corresponding to a respective video snippet of the N video snippets and representing inconsistent information between frames in the respective video snippet of the N video snippets, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6)   and each inter-snippet representation vector corresponding to a respective video snippet of the N video snippets and representing inconsistent information between the respective video snippet and one or more adjacent video snippets of the N video snippets.  (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)
Gu fails to explicitly teach: 
processing circuitry
Jourads teaches: 
processing circuitry (Jourads, [0115] processor and memory)
Before the time of filing it would have been obvious to one of ordinary skill in the art to implement the method of Gu on the processor of Jourads. The rationale for the combination is the combination of elements according to known methods to yield a predictable result of a computer implemented neural network.

Regarding claim 10, the combination of Gu and Jourads teaches: 
A video detection apparatus, comprising:
 processing circuitry (Jourads, [0115] processor and memory)
 configured to extract N video snippets from a video, each video snippet of the N video snippets comprising M frames, the N video snippets comprising an initial object, and both N and M being positive integers greater than or equal to 2; (Gu, page 746 To this end we sample video sequence uniformly into U snippets, each of which contains T successive frames rather than a single frame like previous works do. See also Figure 2 with 3 snippets containing 4 frames)  and 
a neural network model, configured to obtain a recognition result based on the N video snippets, the recognition result representing a probability that the initial object is an edited object, the neural network model comprising a backbone network and a classification network, (Gu Figure 2, ResNet backbone determines if video is real/fake) the 
backbone network being configured to determine a representation vector of the N video snippets, and the classification network being configured to determine the recognition result based on the representation vector, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6. page 747 Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11. See also Figure 2 determination if video is fake/real)  
wherein the backbone network comprises an intra-snippet recognition module and an inter-snippet recognition module, the intra-snippet recognition module being configured to determine intra-snippet representation vectors, each corresponding to a respective video snippet of the N video snippets and representing inconsistent information between frames in the respective video snippet of the N video snippets,  (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6)  and the inter-snippet recognition module being configured to determine inter-snippet representation vectors, each corresponding to a respective video snippet of the N video snippets and representing inconsistent information between the respective video snippet and one or more adjacent video snippets of the N video snippets, (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)  and the representation vector being based on the intra-snippet representation vectors and the inter-snippet representation vectors.  (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6. page 747 Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11. See also Figure 2 determination if video is fake/real)  
Before the time of filing it would have been obvious to one of ordinary skill in the art to implement the method of Gu on the processor of Jourads. The rationale for the combination is the combination of elements according to known methods to yield a predictable result of a computer implemented neural network.


Regarding claim 11, the combination of Gu and Jourads teaches: 
The apparatus according to claim 10, wherein the model further comprises: 
processing circuitry (Jourads, [0115] processor and memory) configured to obtain original representation vectors of the N video snippets;  (Gu, page 746 To this end we sample video sequence uniformly into U snippets, each of which contains T successive frames rather than a single frame like previous works do. See also Figure 2 with 3 snippets containing 4 frames)
a first network structure, configured to determine a first representation vector corresponding to the respective video snippet of the N video snippets inputted to the intra-snippet recognition module based on the original representation vectors; (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6)  
the intra-snippet recognition module, configured to determine an intra-snippet representation vector based on the first representation vector; (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6)  
a second network structure, configured to determine a second representation vector corresponding to the respective video snippet of the N video snippets inputted to the inter-snippet recognition module based on the original representation vectors; (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)
the inter-snippet recognition module, configured to determine an inter-snippet representation vector based on the second representation vector; and a third network structure, configured to determine the representation vector based on the intra-snippet representation vector and the inter-snippet representation vector.  (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)
Before the time of filing it would have been obvious to one of ordinary skill in the art to implement the method of Gu on the processor of Jourads. The rationale for the combination is the combination of elements according to known methods to yield a predictable result of a computer implemented neural network.

Regarding claim 12, the combination of Gu and Jourads teaches: 
The apparatus according to claim 10, wherein the backbone network comprises: plural intra-snippet recognition modules and inter-snippet recognition module that are alternately placed.  (Gu, see figure 2)

Regarding claim 13, the combination of Gu and Jourads teaches:
A computer-readable storage medium storing computer-readable (Jourads [0115] processor and memory) instructions thereon, which, when executed by a computer, cause the computer to implement the method according to claim 1.  (Please see discussion of claim 1 above)
Before the time of filing it would have been obvious to one of ordinary skill in the art to implement the method of Gu on the processor of Jourads. The rationale for the combination is the combination of elements according to known methods to yield a predictable result of a computer implemented neural network.

Regarding claim 14, the combination of Gu and Jourads teaches:
The apparatus according to claim 10, wherein the neural network model is further configured to: 
divide a first representation vector corresponding to the respective video snippet of the N video snippets along a channel dimension to obtain first representation sub-vectors; (Gu, page 746 We first split I into two equal parts along the channel dimension to get I1 and I2 and then feed them into subsequent branches)
determine a convolution kernel based on the first representation sub-vectors, wherein the convolution kernel is a convolution kernel corresponding to the first representation vector; (Gu, page 746 equation (1)
determine a weight matrix corresponding to the first representation sub-vectors, wherein the weight matrix is configured for extracting motion information between adjacent frames based on an attention mechanism; (Gu page 747, Averaging these features and applying a sigmoid function, the horizontal and vertical AttenH and AttenW can be obtained)
determine second representation sub-vectors based on the first representation sub-vectors, the weight matrix, and the convolution kernel; (Gu page 747 Equation 5)  and
 splice the first representation sub-vectors and the second representation sub-vectors into the intra-snippet representation vector corresponding to the respective video snippet of the N video snippets.  (Gu page 747, Equation 6)

Regarding claim 15, the combination of Gu and Jourads teaches:
The apparatus according to claim 14, wherein the neural network model is further configured to: 
perform a global average pooling operation on each of the first representation sub-vectors to obtain respective first representation sub-vectors with a compressed spatial dimension; (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension) 
 perform a fully connected operation on the first representation sub-vectors with the compressed spatial dimension to determine an initial convolution kernel; (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension and then two fully connected layers are performed)  and
 perform a normalization operation on the initial convolution kernel to obtain the convolution kernel.  (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension and then two fully connected layers are performed finally a softmax operation comes up)

Regarding claim 16, the combination of Gu and Jourads teaches:
The apparatus according to claim 14, wherein the neural network model is further configured to: 
perform a bidirectional temporal difference operation on the first representation sub-vectors to determine a first difference matrix between adjacent frames in the respective video snippet corresponding to the first representation vector; (Gu page 746 To model the temporal relation, Intra-SMA applies temporal bi-directional temporal difference guided coordinate attention to make the network attend to local motions)
reshape the first difference matrix into a horizontal inconsistency parameter matrix and a vertical inconsistency parameter matrix along a horizontal dimension and a vertical dimension respectively; (Gu pages 746-747, the difference is sharped into two coordinate-wise representations, which further undergo through a multi-scale structure to capture fine-grained and short-term motion information….forward vertical inconsistency and forward horizontal consistency)  and 
determine a vertical attention weight matrix and a horizontal attention weight matrix based on the horizontal inconsistency parameter matrix and the vertical inconsistency parameter matrix respectively, wherein the weight matrix comprises the vertical attention weight matrix and the horizontal attention weight matrix.   (Gu page 747, Averaging these features and applying a sigmoid function, the horizontal and vertical AttenH and AttenW can be obtained)

Regarding claim 17, the combination of Gu and Jourads teaches:
The apparatus according to claim 16, wherein the neural network model is further configured to: 
perform an element-wise multiplication operation on the vertical attention weight matrix, the horizontal attention weight matrix, and the first representation sub-vectors, and combine a result of the element-wise multiplication operation with the first representation sub-vectors to obtain third representation sub-vectors; (Gu equation 5, element wise multiplication of the attention matrices and the sub-vectors) and 
perform a convolution operation on the third representation sub-vectors by using the convolution kernel to obtain the second representation sub-vectors.  (Gu, equation 5 depth wise convolution)

Regarding claim 18, the combination of Gu and Jourads teaches:
The apparatus according to claim 10, wherein the neural network model is further configured to: 
perform a global average pooling operation on a second representation vector corresponding to the respective video snippet of the N video snippets to obtain a global representation vector with a compressed spatial dimension; (Gu page 747, Formally, let tensor be the module input. It is processed by GAP (global average pooling) to obtain a global representation)
divide the global representation vector into a first global representation sub-vector and a second global representation sub-vector, wherein the first global representation sub-vector represents the respective video snippet corresponding to the second representation vector, and the second global representation sub-vector represents interaction information between the respective video snippet corresponding to the second representation vector and at least one adjacent video snippet;  (Gu, page 747  It is first processed by the GAP to obtain a global representation and then passed through a two-branch structure for different interaction modelling …Among them one branch directly captures inter-snippet interaction without introducing intra-snippet information….The other branch inter-snippet motion attention which is designed to be computationally efficient while containing a larger intra-snippet field of view) and
 determine the inter-snippet representation vector for the respective video snippet based on the global representation vector, the first global representation sub-vector, and the second global representation sub-vector.  (Gu page 748 equation 11)

Regarding claim 19, the combination of Gu and Jourads teaches:
The apparatus according to claim 18, wherein the neural network model is further configured to: 
perform a convolution operation on the global representation vector by using a first convolution kernel to obtain a global representation vector with a reduced dimension; (Gu page 747, Among them, one branch directly captures the inter-snippet interaction without introducing intra-snippet information….Equation 7…where Conv is a spatial convolution with a kernel size 3X1 for snippet-wise feature extraction and dimension reduction)
perform a normalization operation on the global representation vector with the reduced dimension to obtain a normalized global representation vector; (Gu, page 748 equation 7, BN batch normalization) 
 perform a deconvolution operation on the normalized global representation vector by using a second convolution kernel to obtain the first global representation sub-vector with a same dimension as the global representation vector; (Gu, page 747 Conx 1X1 stand for convolution with size 1X1  for dimension recovery)
perform a bidirectional temporal difference operation on the global representation vector to determine a second difference matrix and a third difference matrix between the respective video snippet corresponding to the second representation vector and adjacent video snippets; ( Gu page 747, equations 8 and 9 are the bi-directional facial movements) and 
generate the second global representation sub-vector based on the second difference matrix and the third difference matrix.  (Gu, page 747 equation 10)

Regarding claim 20, the combination of Gu and Jourads teaches:
The apparatus according to claim 18, wherein the neural network model is further configured to: 
perform an element-wise multiplication operation on the first global representation sub-vector, the second global representation sub-vector, and the global representation vector, and combine a result of the element-wise multiplication operation with the global representation vector to obtain a third global representation sub-vector; (Gu page 748 equation 11)and
 	perform a convolution operation on the third global representation sub-vector by using a third convolution kernel to obtain the inter-snippet representation vector for the respective video snippet.  (Gu, page 748 equation 11)


Conclusion
	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Refer to PTO-892, Notice of References Cited for a listing of analogous art.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Molly K Wilburn whose telephone number is (571)272-3589. The examiner can normally be reached Monday-Friday 8am-4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at (571) 270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Molly Wilburn/Primary Examiner, Art Unit 2666
Read full office action
Prosecution Timeline

Mar 01, 2024
Application Filed
Apr 02, 2026
Non-Final Rejection mailed — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/560,884
Patent 12639852
METHOD FOR PROCESSING IMAGES OF A LIVING BIOLOGICAL TISSUE WITH AUTO-CALIBRATION
2y 6m to grant Granted May 26, 2026
18/396,324
Patent 12633090
IMAGE FEATURE EXTRACTION METHOD AND APPARATUS, AND COMPUTER DEVICE AND READABLE STORAGE MEDIUM
2y 4m to grant Granted May 19, 2026
18/005,968
Patent 12620076
SCRAP DISCRIMINATION SYSTEM AND SCRAP DISCRIMINATION METHOD
3y 3m to grant Granted May 05, 2026
18/551,324
Patent 12620131
OPTICAL CENTER DETERMINATION METHOD AND APPARATUS, ELECTRONIC DEVICE, AND MEDIUM
2y 7m to grant Granted May 05, 2026
18/212,227
Patent 12614307
Method and System for Determining a Current Gaze Direction
2y 10m to grant Granted Apr 28, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
90%
Grant Probability
99%
With Interview (+8.7%)
2y 0m (~0m remaining)
Median Time to Grant
Low
PTA Risk
Based on 455 resolved cases by this examiner. Grant probability derived from career allowance rate.