DETAILED ACTION
Claims 1-20 are currently pending.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/07/2025 has been considered by the Examiner.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are:
intra-snippet recognition module
inter-snippet recognition module
in claim 11.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 101
Claims 13 is rejected under 35 U.S.C. 101 because the claimed invention encompasses a signer per se. The computer readable storage medium claimed is not defined the specification as a statutory only embodiment. See [0202-210] The broadest reasonable interpretation of a claim drawn to a computer readable medium (also called machine readable medium and other such variations) typically covers forms of non-transitory tangible media and transitory propagating signals per se in view of the ordinary and customary meaning of computer readable media, particularly when the specification is silent. See MPEP 2111.01. When the broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected under 35 U.S.C. § 101 as covering non-statutory subject matter. See In re Nuijten, 500 F.3d 1346, 1356-57 (Fed. Cir. 2007) transitory embodiments are not directed to statutory subject matter) and Interim Examination Instructions for Evaluating Subject Matter Eligibility Under 35 U.S.C. § 101, Aug. 24, 2009; p. 2. Applicant is encouraged to contact the Examiner to discuss possible amendments.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-8 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Gu (“Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection”, provided on Applicant’s IDS).
The applied reference has common authors (Zhihao Gu, Taiping Yao, Yang Chen, and Shouhong Ding) with the instant application. However, the applied reference has additional authors Jinlin Li and Lizhuang Ma and therefore qualifies as prior art under 35 U.S.C. 102(a)(1).
Regarding claim 1, Gu teaches:
A video detection method, comprising:
extracting N video snippets from a video, each video snippet of the N video snippets comprising M frames, the N video snippets comprising an initial object, and both N and M being positive integers greater than or equal to 2; (Gu, page 746 To this end we sample video sequence uniformly into U snippets, each of which contains T successive frames rather than a single frame like previous works do. See also Figure 2 with 3 snippets containing 4 frames) and
determining a representation vector of the N video snippets, and determining a recognition result based on the representation vector, the recognition result representing a probability that the initial object is an edited object, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6. page 747 Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11. See also Figure 2 determination if video is fake/real) wherein
the representation vector is determined based on intra-snippet representation vectors and inter-snippet representation vectors, each intra-snippet representation vector corresponding to a respective video snippet of the N video snippets and representing inconsistent information between frames in the respective video snippet of the N video snippets, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6) and
each inter-snippet representation vector corresponding to a respective video snippet of the N video snippets and representing inconsistent information between the respective video snippet and one or more adjacent video snippets of the N video snippets. (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)
Regarding claim 2, Gu teaches:
The method according to claim 1, wherein the method further comprises:
dividing a first representation vector corresponding to the respective video snippet of the N video snippets along a channel dimension to obtain first representation sub-vectors; (Gu, page 746 We first split I into two equal parts along the channel dimension to get I1 and I2 and then feed them into subsequent branches)
determining a convolution kernel based on the first representation sub-vectors, wherein the convolution kernel is a convolution kernel corresponding to the first representation vector; (Gu, page 746 equation (1)
determining a weight matrix corresponding to the first representation sub-vectors, wherein the weight matrix is configured for extracting motion information between adjacent frames based on an attention mechanism; (Gu page 747, Averaging these features and applying a sigmoid function, the horizontal and vertical AttenH and AttenW can be obtained)
determining second representation sub-vectors based on the first representation sub-vectors, the weight matrix, and the convolution kernel;(Gu page 747 Equation 5) and
splicing the first representation sub-vectors and the second representation sub-vectors into the intra-snippet representation vector corresponding to the respective video snippet of the N video snippets. (Gu page 747, Equation 6)
Regarding claim 3, Gu teaches:
The method according to claim 2, wherein the determining the convolution kernel based on the first representation sub-vectors comprises:
performing a global average pooling operation on each of the first representation sub-vectors to obtain respective first representation sub-vectors with a compressed spatial dimension; (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension)
performing a fully connected operation on the first representation sub-vectors with the compressed spatial dimension to determine an initial convolution kernel; (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension and then two fully connected layers are performed) and
performing a normalization operation on the initial convolution kernel to obtain the convolution kernel. (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension and then two fully connected layers are performed finally a softmax operation comes up)
Regarding claim 4, Gu teaches:
The method according to claim 2, wherein the determining the weight matrix corresponding to the first representation sub-vectors comprises:
performing a bidirectional temporal difference operation on the first representation sub-vectors to determine a first difference matrix between adjacent frames in the respective video snippet corresponding to the first representation vector; (Gu page 746 To model the temporal relation, Intra-SMA applies temporal bi-directional temporal difference guided coordinate attention to make the network attend to local motions)
reshaping the first difference matrix into a horizontal inconsistency parameter matrix and a vertical inconsistency parameter matrix along a horizontal dimension and a vertical dimension respectively; (Gu pages 746-747, the difference is sharped into two coordinate-wise representations, which further undergo through a multi-scale structure to capture fine-grained and short-term motion information….forward vertical inconsistency and forward horizontal consistency) and
determining a vertical attention weight matrix and a horizontal attention weight matrix based on the horizontal inconsistency parameter matrix and the vertical inconsistency parameter matrix respectively, wherein the weight matrix comprises the vertical attention weight matrix and the horizontal attention weight matrix. (Gu page 747, Averaging these features and applying a sigmoid function, the horizontal and vertical AttenH and AttenW can be obtained)
Regarding claim 5, Gu teaches:
The method according to claim 4, wherein the determining the second representation sub-vectors based on the first representation sub-vectors, the weight matrix, and the convolution kernel comprises:
performing an element-wise multiplication operation on the vertical attention weight matrix, the horizontal attention weight matrix, and the first representation sub-vectors, and combining a result of the element-wise multiplication operation with the first representation sub-vectors to obtain third representation sub-vectors; (Gu equation 5, element wise multiplication of the attention matrices and the sub-vectors) and
performing a convolution operation on the third representation sub-vectors by using the convolution kernel to obtain the second representation sub-vectors. (Gu, equation 5 depth wise convolution)
Regarding claim 6, Gu teaches:
The method according to claim 1, wherein the method further comprises:
performing a global average pooling operation on a second representation vector corresponding to the respective video snippet of the N video snippets to obtain a global representation vector with a compressed spatial dimension; (Gu page 747, Formally, let tensor be the module input. It is processed by GAP (global average pooling) to obtain a global representation)
dividing the global representation vector into a first global representation sub-vector and a second global representation sub-vector, wherein the first global representation sub-vector represents the respective video snippet corresponding to the second representation vector, and the second global representation sub-vector represents interaction information between the respective video snippet corresponding to the second representation vector and at least one adjacent video snippet; (Gu, page 747 It is first processed by the GAP to obtain a global representation and then passed through a two-branch structure for different interaction modelling …Among them one branch directly captures inter-snippet interaction without introducing intra-snippet information….The other branch inter-snippet motion attention which is designed to be computationally efficient while containing a larger intra-snippet field of view)and
determining the inter-snippet representation vector for the respective video snippet based on the global representation vector, the first global representation sub-vector, and the second global representation sub-vector. (Gu page 748 equation 11)
Regarding claim 7, Gu teaches:
The method according to claim 6, wherein the dividing the global representation vector comprises:
performing a convolution operation on the global representation vector by using a first convolution kernel to obtain a global representation vector with a reduced dimension; (Gu page 747, Among them, one branch directly captures the inter-snippet interaction without introducing intra-snippet information….Equation 7…where Conv is a spatial convolution with a kernel size 3X1 for snippet-wise feature extraction and dimension reduction)
performing a normalization operation on the global representation vector with the reduced dimension to obtain a normalized global representation vector; (Gu, page 748 equation 7, BN batch normalization)
performing a deconvolution operation on the normalized global representation vector by using a second convolution kernel to obtain the first global representation sub-vector with a same dimension as the global representation vector; (Gu, page 747 Conx 1X1 stand for convolution with size 1X1 for dimension recovery)
performing a bidirectional temporal difference operation on the global representation vector to determine a second difference matrix and a third difference matrix between the respective video snippet corresponding to the second representation vector and adjacent video snippets; ( Gu page 747, equations 8 and 9 are the bi-directional facial movements) and
generating the second global representation sub-vector based on the second difference matrix and the third difference matrix. (Gu, page 747 equation 10)
Regarding claim 8, Gu teaches:
The method according to claim 6, wherein the determining the inter-snippet representation vector for the respective video snippet based on the global representation vector, the first global representation sub-vector, and the second global representation sub-vector comprises:
performing an element-wise multiplication operation on the first global representation sub-vector, the second global representation sub-vector, and the global representation vector, and combining a result of the element-wise multiplication operation with the global representation vector to obtain a third global representation sub-vector; (Gu page 748 equation 11) and
performing a convolution operation on the third global representation sub-vector by using a third convolution kernel to obtain the inter-snippet representation vector for the respective video snippet. (Gu, page 748 equation 11)
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 9-20 are rejected under 35 U.S.C. 103 as being unpatentable over Gu (the same reference applied above) in view of Jourads (US 2020/0110970).
Regarding claim 9, Gu teaches:
A video detection apparatus, comprising:
configured to extract N video snippets from a video, each video snippet of the N video snippets comprising M frames, the N video snippets comprising an initial object, and both N and M being positive integers greater than or equal to 2; (Gu, page 746 To this end we sample video sequence uniformly into U snippets, each of which contains T successive frames rather than a single frame like previous works do. See also Figure 2 with 3 snippets containing 4 frames) and
determine a representation vector of the N video snippets, and determine a target recognition result based on the representation vector, the recognition result representing a probability that the initial object is an edited object, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6. page 747 Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11. See also Figure 2 determination if video is fake/real) wherein
the representation vector is determined based on intra-snippet representation vectors and inter-snippet representation vectors, each intra-snippet representation vector corresponding to a respective video snippet of the N video snippets and representing inconsistent information between frames in the respective video snippet of the N video snippets, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6) and each inter-snippet representation vector corresponding to a respective video snippet of the N video snippets and representing inconsistent information between the respective video snippet and one or more adjacent video snippets of the N video snippets. (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)
Gu fails to explicitly teach:
processing circuitry
Jourads teaches:
processing circuitry (Jourads, [0115] processor and memory)
Before the time of filing it would have been obvious to one of ordinary skill in the art to implement the method of Gu on the processor of Jourads. The rationale for the combination is the combination of elements according to known methods to yield a predictable result of a computer implemented neural network.
Regarding claim 10, the combination of Gu and Jourads teaches:
A video detection apparatus, comprising:
processing circuitry (Jourads, [0115] processor and memory)
configured to extract N video snippets from a video, each video snippet of the N video snippets comprising M frames, the N video snippets comprising an initial object, and both N and M being positive integers greater than or equal to 2; (Gu, page 746 To this end we sample video sequence uniformly into U snippets, each of which contains T successive frames rather than a single frame like previous works do. See also Figure 2 with 3 snippets containing 4 frames) and
a neural network model, configured to obtain a recognition result based on the N video snippets, the recognition result representing a probability that the initial object is an edited object, the neural network model comprising a backbone network and a classification network, (Gu Figure 2, ResNet backbone determines if video is real/fake) the
backbone network being configured to determine a representation vector of the N video snippets, and the classification network being configured to determine the recognition result based on the representation vector, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6. page 747 Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11. See also Figure 2 determination if video is fake/real)
wherein the backbone network comprises an intra-snippet recognition module and an inter-snippet recognition module, the intra-snippet recognition module being configured to determine intra-snippet representation vectors, each corresponding to a respective video snippet of the N video snippets and representing inconsistent information between frames in the respective video snippet of the N video snippets, (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6) and the inter-snippet recognition module being configured to determine inter-snippet representation vectors, each corresponding to a respective video snippet of the N video snippets and representing inconsistent information between the respective video snippet and one or more adjacent video snippets of the N video snippets, (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11) and the representation vector being based on the intra-snippet representation vectors and the inter-snippet representation vectors. (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6. page 747 Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11. See also Figure 2 determination if video is fake/real)
Before the time of filing it would have been obvious to one of ordinary skill in the art to implement the method of Gu on the processor of Jourads. The rationale for the combination is the combination of elements according to known methods to yield a predictable result of a computer implemented neural network.
Regarding claim 11, the combination of Gu and Jourads teaches:
The apparatus according to claim 10, wherein the model further comprises:
processing circuitry (Jourads, [0115] processor and memory) configured to obtain original representation vectors of the N video snippets; (Gu, page 746 To this end we sample video sequence uniformly into U snippets, each of which contains T successive frames rather than a single frame like previous works do. See also Figure 2 with 3 snippets containing 4 frames)
a first network structure, configured to determine a first representation vector corresponding to the respective video snippet of the N video snippets inputted to the intra-snippet recognition module based on the original representation vectors; (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6)
the intra-snippet recognition module, configured to determine an intra-snippet representation vector based on the first representation vector; (Gu, page 746, Intra-Snippet Inconsistency Module (Intra-SIM) then takes frames within each snipped to model the local inconsistency encoded in subtle motions. See also page 747 equation 6)
a second network structure, configured to determine a second representation vector corresponding to the respective video snippet of the N video snippets inputted to the inter-snippet recognition module based on the original representation vectors; (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)
the inter-snippet recognition module, configured to determine an inter-snippet representation vector based on the second representation vector; and a third network structure, configured to determine the representation vector based on the intra-snippet representation vector and the inter-snippet representation vector. (Gu page 747, Therefore, our Inter-Snippet Interaction Module (Inter-SIM) focuses on promoting the interaction across snippets from a global view to enhance the representation via a novel structure with different kind of interaction modeling, as shown in Fig 2. See also Equation 11)
Before the time of filing it would have been obvious to one of ordinary skill in the art to implement the method of Gu on the processor of Jourads. The rationale for the combination is the combination of elements according to known methods to yield a predictable result of a computer implemented neural network.
Regarding claim 12, the combination of Gu and Jourads teaches:
The apparatus according to claim 10, wherein the backbone network comprises: plural intra-snippet recognition modules and inter-snippet recognition module that are alternately placed. (Gu, see figure 2)
Regarding claim 13, the combination of Gu and Jourads teaches:
A computer-readable storage medium storing computer-readable (Jourads [0115] processor and memory) instructions thereon, which, when executed by a computer, cause the computer to implement the method according to claim 1. (Please see discussion of claim 1 above)
Before the time of filing it would have been obvious to one of ordinary skill in the art to implement the method of Gu on the processor of Jourads. The rationale for the combination is the combination of elements according to known methods to yield a predictable result of a computer implemented neural network.
Regarding claim 14, the combination of Gu and Jourads teaches:
The apparatus according to claim 10, wherein the neural network model is further configured to:
divide a first representation vector corresponding to the respective video snippet of the N video snippets along a channel dimension to obtain first representation sub-vectors; (Gu, page 746 We first split I into two equal parts along the channel dimension to get I1 and I2 and then feed them into subsequent branches)
determine a convolution kernel based on the first representation sub-vectors, wherein the convolution kernel is a convolution kernel corresponding to the first representation vector; (Gu, page 746 equation (1)
determine a weight matrix corresponding to the first representation sub-vectors, wherein the weight matrix is configured for extracting motion information between adjacent frames based on an attention mechanism; (Gu page 747, Averaging these features and applying a sigmoid function, the horizontal and vertical AttenH and AttenW can be obtained)
determine second representation sub-vectors based on the first representation sub-vectors, the weight matrix, and the convolution kernel; (Gu page 747 Equation 5) and
splice the first representation sub-vectors and the second representation sub-vectors into the intra-snippet representation vector corresponding to the respective video snippet of the N video snippets. (Gu page 747, Equation 6)
Regarding claim 15, the combination of Gu and Jourads teaches:
The apparatus according to claim 14, wherein the neural network model is further configured to:
perform a global average pooling operation on each of the first representation sub-vectors to obtain respective first representation sub-vectors with a compressed spatial dimension; (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension)
perform a fully connected operation on the first representation sub-vectors with the compressed spatial dimension to determine an initial convolution kernel; (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension and then two fully connected layers are performed) and
perform a normalization operation on the initial convolution kernel to obtain the convolution kernel. (Gu page 747 In this learning process we first exploit a global average pooling operation to squeeze the spatial dimension and then two fully connected layers are performed finally a softmax operation comes up)
Regarding claim 16, the combination of Gu and Jourads teaches:
The apparatus according to claim 14, wherein the neural network model is further configured to:
perform a bidirectional temporal difference operation on the first representation sub-vectors to determine a first difference matrix between adjacent frames in the respective video snippet corresponding to the first representation vector; (Gu page 746 To model the temporal relation, Intra-SMA applies temporal bi-directional temporal difference guided coordinate attention to make the network attend to local motions)
reshape the first difference matrix into a horizontal inconsistency parameter matrix and a vertical inconsistency parameter matrix along a horizontal dimension and a vertical dimension respectively; (Gu pages 746-747, the difference is sharped into two coordinate-wise representations, which further undergo through a multi-scale structure to capture fine-grained and short-term motion information….forward vertical inconsistency and forward horizontal consistency) and
determine a vertical attention weight matrix and a horizontal attention weight matrix based on the horizontal inconsistency parameter matrix and the vertical inconsistency parameter matrix respectively, wherein the weight matrix comprises the vertical attention weight matrix and the horizontal attention weight matrix. (Gu page 747, Averaging these features and applying a sigmoid function, the horizontal and vertical AttenH and AttenW can be obtained)
Regarding claim 17, the combination of Gu and Jourads teaches:
The apparatus according to claim 16, wherein the neural network model is further configured to:
perform an element-wise multiplication operation on the vertical attention weight matrix, the horizontal attention weight matrix, and the first representation sub-vectors, and combine a result of the element-wise multiplication operation with the first representation sub-vectors to obtain third representation sub-vectors; (Gu equation 5, element wise multiplication of the attention matrices and the sub-vectors) and
perform a convolution operation on the third representation sub-vectors by using the convolution kernel to obtain the second representation sub-vectors. (Gu, equation 5 depth wise convolution)
Regarding claim 18, the combination of Gu and Jourads teaches:
The apparatus according to claim 10, wherein the neural network model is further configured to:
perform a global average pooling operation on a second representation vector corresponding to the respective video snippet of the N video snippets to obtain a global representation vector with a compressed spatial dimension; (Gu page 747, Formally, let tensor be the module input. It is processed by GAP (global average pooling) to obtain a global representation)
divide the global representation vector into a first global representation sub-vector and a second global representation sub-vector, wherein the first global representation sub-vector represents the respective video snippet corresponding to the second representation vector, and the second global representation sub-vector represents interaction information between the respective video snippet corresponding to the second representation vector and at least one adjacent video snippet; (Gu, page 747 It is first processed by the GAP to obtain a global representation and then passed through a two-branch structure for different interaction modelling …Among them one branch directly captures inter-snippet interaction without introducing intra-snippet information….The other branch inter-snippet motion attention which is designed to be computationally efficient while containing a larger intra-snippet field of view) and
determine the inter-snippet representation vector for the respective video snippet based on the global representation vector, the first global representation sub-vector, and the second global representation sub-vector. (Gu page 748 equation 11)
Regarding claim 19, the combination of Gu and Jourads teaches:
The apparatus according to claim 18, wherein the neural network model is further configured to:
perform a convolution operation on the global representation vector by using a first convolution kernel to obtain a global representation vector with a reduced dimension; (Gu page 747, Among them, one branch directly captures the inter-snippet interaction without introducing intra-snippet information….Equation 7…where Conv is a spatial convolution with a kernel size 3X1 for snippet-wise feature extraction and dimension reduction)
perform a normalization operation on the global representation vector with the reduced dimension to obtain a normalized global representation vector; (Gu, page 748 equation 7, BN batch normalization)
perform a deconvolution operation on the normalized global representation vector by using a second convolution kernel to obtain the first global representation sub-vector with a same dimension as the global representation vector; (Gu, page 747 Conx 1X1 stand for convolution with size 1X1 for dimension recovery)
perform a bidirectional temporal difference operation on the global representation vector to determine a second difference matrix and a third difference matrix between the respective video snippet corresponding to the second representation vector and adjacent video snippets; ( Gu page 747, equations 8 and 9 are the bi-directional facial movements) and
generate the second global representation sub-vector based on the second difference matrix and the third difference matrix. (Gu, page 747 equation 10)
Regarding claim 20, the combination of Gu and Jourads teaches:
The apparatus according to claim 18, wherein the neural network model is further configured to:
perform an element-wise multiplication operation on the first global representation sub-vector, the second global representation sub-vector, and the global representation vector, and combine a result of the element-wise multiplication operation with the global representation vector to obtain a third global representation sub-vector; (Gu page 748 equation 11)and
perform a convolution operation on the third global representation sub-vector by using a third convolution kernel to obtain the inter-snippet representation vector for the respective video snippet. (Gu, page 748 equation 11)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Refer to PTO-892, Notice of References Cited for a listing of analogous art.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Molly K Wilburn whose telephone number is (571)272-3589. The examiner can normally be reached Monday-Friday 8am-4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at (571) 270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Molly Wilburn/Primary Examiner, Art Unit 2666