Last updated: April 19, 2026
Application No. 18/467,160
MACHINE LEARNING TO DETECT FAKE VIDEOS

Final Rejection §103
Filed
Sep 14, 2023
Examiner
ZAK, JACQUELINE ROSE
Art Unit
2666
Tech Center
2600 — Communications
Assignee
International Business Machines Corporation
OA Round
2 (Final)
Interview Optional

— -11.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 12 resolved cases, 2023–2026
Examiner Intelligence

ZAK, JACQUELINE ROSE View full profile →
Grants 67% — above average
Career Allow Rate
8 granted / 12 resolved
+4.7% vs TC avg
Minimal -11% lift
Without
With
+-11.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
46 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
5.7%
-34.3% vs TC avg
§103
56.3%
+16.3% vs TC avg
§102
21.1%
-18.9% vs TC avg
§112
13.8%
-26.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 12 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Status
Claims 1-8, 10-14, 16-19, and 21-23 are pending for examination in the application filed 12/23/2025. Claims 1, 5, 7, 10-11, 13-14, 16, and 18-19 have been amended; claims 9, 15, and 20 have been cancelled; claims 21-23 are new. 

Response to Arguments and Amendments
The objections of claims 7, 11, and 16 have been withdrawn in light of the amendments. 
Applicant’s arguments with respect to independent claims 1, 11, and 16 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument, as facilitated by the newly added amendments. 

Claim Objections
Amended claims 11 and 16 are objected to because additions and deletions are not marked. 
The amendments to the claims filed on 12/23/2025 do not comply with the requirements of 37 CFR 1.121(c) because independent claims 11 and 15 include the amended limitation: 
“identifying one or more characteristics of the statement, wherein the one or more characteristics comprise at least one of source information for the video, or one or more timestamps in the video”.
The previously filed claims 11 and 15 filed on 11/20/2023 include the limitation:
“identifying one or more characteristics of the statement, wherein the one or more characteristics comprise source information for the video, source information for an reference video, and one or more timestamps in the video”.  
Additions and deletions that were not marked have been italicized for clarity. Claim text with markings is required for additions and deletions. Amendments to the claims filed on or after July 30, 2003 must comply with 37 CFR 1.121(c). Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-2, 7, 11-12, 14, 16-17, and 19 rejected under 35 U.S.C. 103 as being unpatentable over Michaeli (US20240127630A1) in view of Dalins (Dalins, Janis, Campbell Wilson, and Douglas Boudry. "PDQ & TMK+ PDQF--A Test Drive of Facebook's Perceptual Hashing Algorithms." arXiv preprint arXiv:1912.07745 (2019)).

Regarding claim 1, Michaeli teaches a method (Fig. 2) comprising: accessing a statement asserting that a video is fake ([0013] Where any of the sequential analyses detect an anomaly, an alert indicating the presence of deepfake content is generated. This process may repeat as a loop for a series of synchronous observations, analyzing the audio-visual content frame by frame in two dimensions to detect deepfake modification to the speech or video of the human speaker); 
identifying one or more characteristics of the statement, wherein the one or more characteristics comprise at least one of source information for the video or one or more timestamps in the video ([0088] Identifiers for the manipulated portions of content (recorded as described in process block 225 above) are retrieved from storage and written into the alert to indicate pixel locations and audio frequency ranges of the deepfake content…In one embodiment, the alert includes a time (such as a time stamp, a range of time, an observation, or a frame) at which the detected deepfake content is present in the audio-visual content); 
performing consistency checks between audio and video of the video ([0114] To synchronize the uniformly sampled video and audio time series signals, the signals are phase-shifted so as to maximize correlation among the signals. The uniformly sampled video and audio time series signals are synchronized with each other using a synchronization technique such as a correlogram technique, a cross power spectral density technique, or a genetic algorithm technique); 
examining one or more video features of the video to detect modifications ([0122] in one embodiment, the ML model includes a variable for each color channel of each pixel of a frame of the video content, and a variable for each audio-frequency range (bin) of the audio content. For example, where a frame of the video content is 426×240 three-channel pixels, the number of variables is 57,600 (as discussed below with reference to Table 1). Thus, the ML model is configured to predict the value of variable 1 based on one or more of variables 2 through 57,600, and so on for all the variables. [0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker); 
identifying overlapping clips associated with the video ([0102] For example, reference data set 172 may be time series signals derived from a reference audio-video recording of the speaker speaking on a different occasion from that shown in audio-visual content 130. Before generating the residual time series signals, deepfake detection method 200 trains a machine learning model to generate the machine learning estimates of authentic delivery of the speech based on the reference set of time series signals. [0011] For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals. [0013] The selected residual values are placed into a two-dimensional array representing a frame of the audio-visual content);
and generating an output indicating that the video is fake based at least in part on the one or more characteristics of the statement ([0088] In one embodiment, the alert is an electronic message. In one embodiment, deepfake detection method 200 composes an electronic message which indicates or communicates that there is deepfake content in the audio-visual content), the consistency checks ([0093] At the completion of process block 230, an electronic alert message that indicates deepfake content included a synchronous observation (for a particular point in time) of the audio-visual content has been generated), the video features ([0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker), and the identified overlapping clips ([0011] In one embodiment, a deepfake detection system detects deepfake content with a moment-by-moment two-dimensional analysis of video and audio of a human speaker. For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals. The ML-estimated values are consistent with authentic speech by the human speaker. An anomaly in the residuals indicates the presence of deepfake content. [Abstract] In response to detection of the anomaly, an alert that deepfake content is detected in the audio-visual content is generated). 
Michaeli does not teach identifying overlapping clips associated with the video, comprising: generating, using a hash function, a sequence of hash values representing a plurality of sampling frames of the video, identifying a reference video by comparing the sequence of hash values with a plurality of reference hash sequences stored in a database, and identifying one or more segments of the video as the overlapping clips by determining, for each respective sampling frame of the video, that a frame level similarity between a hash value of the respective sampling frame of the video and a hash value of a corresponding frame of the reference video exceeds a defined threshold. 
Dalins, in the same field of endeavor of video hashing, teaches identifying overlapping clips associated with the video, comprising: generating, using a hash function, a sequence of hash values representing a plurality of sampling frames of the video ([2.2] TMK + PDQF(‘TMK’ for brevity) uses a modified version of PDQ6 for image similarity, combined with the Temporal Match Kernel for measuring time-related information. Its operation is summarised thus within documentation: (1) resampling videos to a common frame rate (15 frames per second), (2) calculating a frame descriptor, (3) computing averages within various periods and (4) generating a hash from the trigonometrically weighted averages…TMK stores hashes as 258KB binaries (extension .tmk)), 
identifying a reference video by comparing the sequence of hash values with a plurality of reference hash sequences stored in a database, and identifying one or more segments of the video as the overlapping clips by determining, for each respective sampling frame of the video, that a frame level similarity between a hash value of the respective sampling frame of the video and a hash value of a corresponding frame of the reference video exceeds a defined threshold ([3.1] All tests reported within this paper were conducted on a corpus of 225,887 images and 3,366 videos manually reviewed by Police and annotated as child exploitation materials. [5] As with PDQ, we utilised executables packaged with the project for our tests- in this case, tmk-hash-video for hash generation. Unlike PDQ, given the relative complexity of calculating similarities, we used the packaged binary (tmk-two-level-score) for hash comparison. Unlike PDQ, the TMK algorithm takes a two phased approach- if the first phase passes a pre-defined match threshold, then the second phase is attempted. If both results are higher (note contrast to PDQ) than the threshold, then a video is regarded as a match. For our tests, we followed the recommended threshold of 0.7 for both phases).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Michaeli with the teachings of Dalins to identify overlapping clips by generating and comparing hash values because "Events such as the Christchurch terror attack of March 2019 demonstrated the potential not only for publication of such acts, but also their re-distribution en masse after minor modification in order to avoid traditional (cryptographic hash-based) detection and blocking methods by online content providers. The mandatory reporting of such materials by multiple providers worldwide would rapidly overwhelm law enforcement’s ability to review and investigate, unless a reliable, portable and acceptable means for measuring similarity is adopted" [Dalins 1] and "Unlike binary-level digest algorithms such as MD5 and SHA-1, perceptual hashes are a mode of fuzzy hashing operating on materials as rendered to the end user, making them highly suitable for detecting lightly or imperceptibly altered materials" [Dalins 2]. 
 
Regarding claim 2, Michaeli and Dalins teach the method of claim 1. Michaeli further teaches synchronizing the consistency checks, the video features, and the identified overlapping clips with the one or more timestamps within the video identified by the statement ([0088] In one embodiment, the alert is an electronic message. In one embodiment, deepfake detection method 200 composes an electronic message which indicates or communicates that there is deepfake content in the audio-visual content. [0093] At the completion of process block 230, an electronic alert message that indicates deepfake content included a synchronous observation (for a particular point in time) of the audio-visual content has been generated. [0027] Multivariate spatiotemporal characterization of video and integrated audio refers to individual description of many discrete portions of audio-visual content as variables in a spatial structure (such as an array) over a sequence of discrete points in time, as described herein. Multivariate spatiotemporal analysis of video and integrated audio refers to examination of the discrete portions of audio-visual content over dimensions of the array structure at synchronous observations, as described herein. [0011] For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals).

Regarding claim 7, Michaeli and Dalins teaches the method of claim 1. Michaeli further teaches wherein examining the one or more video features of the video to detect modifications comprises: processing the video to create the plurality of sampling frames at a defined interval; analyzing each of the sampling frames to extract the one or more video features using one or more image processing algorithms, wherein the one or more video features comprise at least one of (i) pixilation, (ii) shadows, (iii) colors, (iv) edges, or (iv) textures; and determining the video has been manipulated based on processing the one or more video features using a machine learning model ([0122] in one embodiment, the ML model includes a variable for each color channel of each pixel of a frame of the video content, and a variable for each audio-frequency range (bin) of the audio content. For example, where a frame of the video content is 426×240 three-channel pixels, the number of variables is 57,600 (as discussed below with reference to Table 1). Thus, the ML model is configured to predict the value of variable 1 based on one or more of variables 2 through 57,600, and so on for all the variables. [0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker).

Regarding claim 11, Michaeli teaches a system (Fig, 6), comprising: one or more computer processors (processor(s) 610); and one or more memories collectively containing one or more programs which when executed by the one or more computer processors performs an operation, the operation comprising ([0176] In one or more embodiments, one or more of the components described herein are configured as program modules stored in a non-transitory computer readable medium. The program modules are configured with stored instructions that when executed by at least a processor cause the computing device to perform the corresponding function(s) as described herein): 
accessing a statement asserting that a video is fake ([0013] Where any of the sequential analyses detect an anomaly, an alert indicating the presence of deepfake content is generated. This process may repeat as a loop for a series of synchronous observations, analyzing the audio-visual content frame by frame in two dimensions to detect deepfake modification to the speech or video of the human speaker); 
identifying one or more characteristics of the statement, wherein the one or more characteristics comprise at least one of source information for the video or one or more timestamps in the video ([0088] Identifiers for the manipulated portions of content (recorded as described in process block 225 above) are retrieved from storage and written into the alert to indicate pixel locations and audio frequency ranges of the deepfake content…In one embodiment, the alert includes a time (such as a time stamp, a range of time, an observation, or a frame) at which the detected deepfake content is present in the audio-visual content); 
performing consistency checks between audio and video of the video ([0114] To synchronize the uniformly sampled video and audio time series signals, the signals are phase-shifted so as to maximize correlation among the signals. The uniformly sampled video and audio time series signals are synchronized with each other using a synchronization technique such as a correlogram technique, a cross power spectral density technique, or a genetic algorithm technique); 
examining one or more video features of the video to detect modifications ([0122] in one embodiment, the ML model includes a variable for each color channel of each pixel of a frame of the video content, and a variable for each audio-frequency range (bin) of the audio content. For example, where a frame of the video content is 426×240 three-channel pixels, the number of variables is 57,600 (as discussed below with reference to Table 1). Thus, the ML model is configured to predict the value of variable 1 based on one or more of variables 2 through 57,600, and so on for all the variables. [0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker); 
identifying overlapping clips associated with the video ([0102] For example, reference data set 172 may be time series signals derived from a reference audio-video recording of the speaker speaking on a different occasion from that shown in audio-visual content 130. Before generating the residual time series signals, deepfake detection method 200 trains a machine learning model to generate the machine learning estimates of authentic delivery of the speech based on the reference set of time series signals. [0011] For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals. [0013] The selected residual values are placed into a two-dimensional array representing a frame of the audio-visual content);
and generating an output indicating that the video is fake based at least in part on the one or more characteristics of the statement ([0088] In one embodiment, the alert is an electronic message. In one embodiment, deepfake detection method 200 composes an electronic message which indicates or communicates that there is deepfake content in the audio-visual content), the consistency checks ([0093] At the completion of process block 230, an electronic alert message that indicates deepfake content included a synchronous observation (for a particular point in time) of the audio-visual content has been generated), the video features ([0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker), and the identified overlapping clips ([0011] In one embodiment, a deepfake detection system detects deepfake content with a moment-by-moment two-dimensional analysis of video and audio of a human speaker. For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals. The ML-estimated values are consistent with authentic speech by the human speaker. An anomaly in the residuals indicates the presence of deepfake content. [Abstract] In response to detection of the anomaly, an alert that deepfake content is detected in the audio-visual content is generated). 
Michaeli does not teach identifying overlapping clips associated with the video, comprising: generating, using a hash function, a sequence of hash values representing a plurality of sampling frames of the video, identifying a reference video by comparing the sequence of hash values with a plurality of reference hash sequences stored in a database, and identifying one or more segments of the video as the overlapping clips by determining, for each respective sampling frame of the video, that a frame level similarity between a hash value of the respective sampling frame of the video and a hash value of a corresponding frame of the reference video exceeds a defined threshold. 
Dalins, in the same field of endeavor of video hashing, teaches identifying overlapping clips associated with the video, comprising: generating, using a hash function, a sequence of hash values representing a plurality of sampling frames of the video ([2.2] TMK + PDQF(‘TMK’ for brevity) uses a modified version of PDQ6 for image similarity, combined with the Temporal Match Kernel for measuring time-related information. Its operation is summarised thus within documentation: (1) resampling videos to a common frame rate (15 frames per second), (2) calculating a frame descriptor, (3) computing averages within various periods and (4) generating a hash from the trigonometrically weighted averages…TMK stores hashes as 258KB binaries (extension .tmk)), 
identifying a reference video by comparing the sequence of hash values with a plurality of reference hash sequences stored in a database, and identifying one or more segments of the video as the overlapping clips by determining, for each respective sampling frame of the video, that a frame level similarity between a hash value of the respective sampling frame of the video and a hash value of a corresponding frame of the reference video exceeds a defined threshold ([3.1] All tests reported within this paper were conducted on a corpus of 225,887 images and 3,366 videos manually reviewed by Police and annotated as child exploitation materials. [5] As with PDQ, we utilised executables packaged with the project for our tests- in this case, tmk-hash-video for hash generation. Unlike PDQ, given the relative complexity of calculating similarities, we used the packaged binary (tmk-two-level-score) for hash comparison. Unlike PDQ, the TMK algorithm takes a two phased approach- if the first phase passes a pre-defined match threshold, then the second phase is attempted. If both results are higher (note contrast to PDQ) than the threshold, then a video is regarded as a match. For our tests, we followed the recommended threshold of 0.7 for both phases).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the system of Michaeli with the teachings of Dalins to identify overlapping clips by generating and comparing hash values because "Events such as the Christchurch terror attack of March 2019 demonstrated the potential not only for publication of such acts, but also their re-distribution en masse after minor modification in order to avoid traditional (cryptographic hash-based) detection and blocking methods by online content providers. The mandatory reporting of such materials by multiple providers worldwide would rapidly overwhelm law enforcement’s ability to review and investigate, unless a reliable, portable and acceptable means for measuring similarity is adopted" [Dalins 1] and "Unlike binary-level digest algorithms such as MD5 and SHA-1, perceptual hashes are a mode of fuzzy hashing operating on materials as rendered to the end user, making them highly suitable for detecting lightly or imperceptibly altered materials" [Dalins 2]

Regarding claim 12, Michaeli and Dalins teach the system of claim 11. Michaeli further teaches synchronizing the consistency checks, the video features, and the identified overlapping clips with the one or more timestamps within the video identified by the statement ([0088] In one embodiment, the alert is an electronic message. In one embodiment, deepfake detection method 200 composes an electronic message which indicates or communicates that there is deepfake content in the audio-visual content. [0093] At the completion of process block 230, an electronic alert message that indicates deepfake content included a synchronous observation (for a particular point in time) of the audio-visual content has been generated. [0027] Multivariate spatiotemporal characterization of video and integrated audio refers to individual description of many discrete portions of audio-visual content as variables in a spatial structure (such as an array) over a sequence of discrete points in time, as described herein. Multivariate spatiotemporal analysis of video and integrated audio refers to examination of the discrete portions of audio-visual content over dimensions of the array structure at synchronous observations, as described herein. [0011] For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals).

Regarding claim 14, Michaeli and Dalins teach teaches the system of claim 11. Michaeli further teaches wherein examining the one or more video features of the video to detect modifications comprises: processing the video to create the plurality of sampling frames at a defined interval; analyzing each of the sampling frames to extract the one or more video features using one or more image processing algorithms, wherein the one or more video features comprise at least one of (i) pixilation, (ii) shadows, (iii) colors, (iv) edges, or (iv) textures; and determining the video has been manipulated based on processing the one or more video features using a machine learning model ([0122] in one embodiment, the ML model includes a variable for each color channel of each pixel of a frame of the video content, and a variable for each audio-frequency range (bin) of the audio content. For example, where a frame of the video content is 426×240 three-channel pixels, the number of variables is 57,600 (as discussed below with reference to Table 1). Thus, the ML model is configured to predict the value of variable 1 based on one or more of variables 2 through 57,600, and so on for all the variables. [0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker).

Regarding claim 16, Michaeli teaches a computer program product comprising one or more computer-readable storage media collectively containing computer-readable program code that, when executed by operation of one or more computer processors, performs an operation comprising ([0176] In one or more embodiments, one or more of the components described herein are configured as program modules stored in a non-transitory computer readable medium. The program modules are configured with stored instructions that when executed by at least a processor cause the computing device to perform the corresponding function(s) as described herein): 
accessing a statement alleging that a video is fake ([0013] Where any of the sequential analyses detect an anomaly, an alert indicating the presence of deepfake content is generated. This process may repeat as a loop for a series of synchronous observations, analyzing the audio-visual content frame by frame in two dimensions to detect deepfake modification to the speech or video of the human speaker); 
identifying one or more characteristics of the statement, wherein the one or more characteristics comprise at least one of source information for the video or one or more timestamps in the video ([0088] Identifiers for the manipulated portions of content (recorded as described in process block 225 above) are retrieved from storage and written into the alert to indicate pixel locations and audio frequency ranges of the deepfake content…In one embodiment, the alert includes a time (such as a time stamp, a range of time, an observation, or a frame) at which the detected deepfake content is present in the audio-visual content); 
performing consistency checks between audio and video of the video ([0114] To synchronize the uniformly sampled video and audio time series signals, the signals are phase-shifted so as to maximize correlation among the signals. The uniformly sampled video and audio time series signals are synchronized with each other using a synchronization technique such as a correlogram technique, a cross power spectral density technique, or a genetic algorithm technique); 
examining one or more video features of the video to detect modifications ([0122] in one embodiment, the ML model includes a variable for each color channel of each pixel of a frame of the video content, and a variable for each audio-frequency range (bin) of the audio content. For example, where a frame of the video content is 426×240 three-channel pixels, the number of variables is 57,600 (as discussed below with reference to Table 1). Thus, the ML model is configured to predict the value of variable 1 based on one or more of variables 2 through 57,600, and so on for all the variables. [0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker); 
identifying overlapping clips associated with the video ([0102] For example, reference data set 172 may be time series signals derived from a reference audio-video recording of the speaker speaking on a different occasion from that shown in audio-visual content 130. Before generating the residual time series signals, deepfake detection method 200 trains a machine learning model to generate the machine learning estimates of authentic delivery of the speech based on the reference set of time series signals. [0011] For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals. [0013] The selected residual values are placed into a two-dimensional array representing a frame of the audio-visual content);
and generating an output indicating that the video is fake based at least in part on the one or more characteristics of the statement ([0088] In one embodiment, the alert is an electronic message. In one embodiment, deepfake detection method 200 composes an electronic message which indicates or communicates that there is deepfake content in the audio-visual content), the consistency checks ([0093] At the completion of process block 230, an electronic alert message that indicates deepfake content included a synchronous observation (for a particular point in time) of the audio-visual content has been generated), the video features ([0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker), and the identified overlapping clips ([0011] In one embodiment, a deepfake detection system detects deepfake content with a moment-by-moment two-dimensional analysis of video and audio of a human speaker. For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals. The ML-estimated values are consistent with authentic speech by the human speaker. An anomaly in the residuals indicates the presence of deepfake content. [Abstract] In response to detection of the anomaly, an alert that deepfake content is detected in the audio-visual content is generated). 
Michaeli does not teach identifying overlapping clips associated with the video, comprising: generating, using a hash function, a sequence of hash values representing a plurality of sampling frames of the video, identifying a reference video by comparing the sequence of hash values with a plurality of reference hash sequences stored in a database, and identifying one or more segments of the video as the overlapping clips by determining, for each respective sampling frame of the video, that a frame level similarity between a hash value of the respective sampling frame of the video and a hash value of a corresponding frame of the reference video exceeds a defined threshold. 
Dalins, in the same field of endeavor of video hashing, teaches identifying overlapping clips associated with the video, comprising: generating, using a hash function, a sequence of hash values representing a plurality of sampling frames of the video ([2.2] TMK + PDQF(‘TMK’ for brevity) uses a modified version of PDQ6 for image similarity, combined with the Temporal Match Kernel for measuring time-related information. Its operation is summarised thus within documentation: (1) resampling videos to a common frame rate (15 frames per second), (2) calculating a frame descriptor, (3) computing averages within various periods and (4) generating a hash from the trigonometrically weighted averages…TMK stores hashes as 258KB binaries (extension .tmk)), 
identifying a reference video by comparing the sequence of hash values with a plurality of reference hash sequences stored in a database, and identifying one or more segments of the video as the overlapping clips by determining, for each respective sampling frame of the video, that a frame level similarity between a hash value of the respective sampling frame of the video and a hash value of a corresponding frame of the reference video exceeds a defined threshold ([3.1] All tests reported within this paper were conducted on a corpus of 225,887 images and 3,366 videos manually reviewed by Police and annotated as child exploitation materials. [5] As with PDQ, we utilised executables packaged with the project for our tests- in this case, tmk-hash-video for hash generation. Unlike PDQ, given the relative complexity of calculating similarities, we used the packaged binary (tmk-two-level-score) for hash comparison. Unlike PDQ, the TMK algorithm takes a two phased approach- if the first phase passes a pre-defined match threshold, then the second phase is attempted. If both results are higher (note contrast to PDQ) than the threshold, then a video is regarded as a match. For our tests, we followed the recommended threshold of 0.7 for both phases).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the product of Michaeli with the teachings of Dalins to identify overlapping clips by generating and comparing hash values because "Events such as the Christchurch terror attack of March 2019 demonstrated the potential not only for publication of such acts, but also their re-distribution en masse after minor modification in order to avoid traditional (cryptographic hash-based) detection and blocking methods by online content providers. The mandatory reporting of such materials by multiple providers worldwide would rapidly overwhelm law enforcement’s ability to review and investigate, unless a reliable, portable and acceptable means for measuring similarity is adopted" [Dalins 1] and "Unlike binary-level digest algorithms such as MD5 and SHA-1, perceptual hashes are a mode of fuzzy hashing operating on materials as rendered to the end user, making them highly suitable for detecting lightly or imperceptibly altered materials" [Dalins 2]
 
Regarding claim 17, Michaeli and Dalins teach the product of claim 16. Michaeli further teaches synchronizing the consistency checks, the video features, and the identified overlapping clips with the one or more timestamps within the video identified by the statement ([0088] In one embodiment, the alert is an electronic message. In one embodiment, deepfake detection method 200 composes an electronic message which indicates or communicates that there is deepfake content in the audio-visual content. [0093] At the completion of process block 230, an electronic alert message that indicates deepfake content included a synchronous observation (for a particular point in time) of the audio-visual content has been generated. [0027] Multivariate spatiotemporal characterization of video and integrated audio refers to individual description of many discrete portions of audio-visual content as variables in a spatial structure (such as an array) over a sequence of discrete points in time, as described herein. Multivariate spatiotemporal analysis of video and integrated audio refers to examination of the discrete portions of audio-visual content over dimensions of the array structure at synchronous observations, as described herein. [0011] For example, the deepfake detection system analyzes a two-dimensional matrix of residuals between ML-estimated and actual values of audio-visual content for one point in time with a two-dimensional sequential analysis to detect anomalies in the residuals).

Regarding claim 19, Michaeli and Dalins teach the product of claim 16. Michaeli further teaches wherein examining the one or more video features of the video to detect modifications comprises: processing the video to create the plurality of sampling frames at a defined interval; analyzing each of the sampling frames to extract the one or more video features using one or more image processing algorithms, wherein the one or more video features comprise at least one of (i) pixilation, (ii) shadows, (iii) colors, (iv) edges, or (iv) textures; and determining the video has been manipulated based on processing the one or more video features using a machine learning model ([0122] in one embodiment, the ML model includes a variable for each color channel of each pixel of a frame of the video content, and a variable for each audio-frequency range (bin) of the audio content. For example, where a frame of the video content is 426×240 three-channel pixels, the number of variables is 57,600 (as discussed below with reference to Table 1). Thus, the ML model is configured to predict the value of variable 1 based on one or more of variables 2 through 57,600, and so on for all the variables. [0123] The ML model has thus learned correlation patterns between variables that indicate when speech by the human speaker is authentic and free of deepfake injection of words or changes to mouth, facial, or other motions of the speaker).

Claims 3 is rejected under 35 U.S.C. 103 as being unpatentable over Michaeli in view of Dalins and Dimitrova (US6469749B1). 

Regarding claim 3, Michaeli and Dalins teach the method of claim 1. Dimitrova, in the same field of endeavor of video signature comparison, teaches wherein the statement further comprises one or more tags describing contents of the video ([col. 2 ln. 58-67] Other signatures in accordance with the invention include, e.g., closed caption text describing an advertised product or service, a frame number plus information from a subimage of identified text associated with the frame, such as an 800 number, a company name, a product or service name, a uniform resource locator (URL), etc., or a frame number and a position and size of a face or other object in the image, as identified by an appropriate bounding box, as well as various combinations of these and other signature types).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Michaeli with the teachings of Dimitrova to have tags describing contents of the video "to determine which of the identified segments are in fact associated with the particular type of video content" [Dimitrova col. 2 ln 38-40].

Claims 10 is rejected under 35 U.S.C. 103 as being unpatentable over Michaeli in view of Dalins, Dimitrova, and Altinisik (E. Altinisik, H. T. Sencar and D. Tabaa, "Video Source Characterization Using Encoding and Encapsulation Characteristics," in IEEE Transactions on Information Forensics and Security, vol. 17, pp. 3211-3224, 2022). 

Regarding claim 10, Michaeli and Dalins teach the method of claim 1. Michaeli does not teach further comprising: extracting one or more tags from the statements; converting audio track of the video into written texts using one or more speech-to-text algorithms; identifying one or more key entities from the written texts representing topics of the video; generating an output context comprising the extracted tags and the identified key entities; and performing a web crawling operation using the output context to search for additional user-generated posts or reference videos associated with the video.
Dimitrova teaches comprising: extracting one or more tags from the statements ([col. 2 ln. 58-67] Other signatures in accordance with the invention include, e.g., closed caption text describing an advertised product or service, a frame number plus information from a subimage of identified text associated with the frame, such as an 800 number, a company name, a product or service name, a uniform resource locator (URL), etc., or a frame number and a position and size of a face or other object in the image, as identified by an appropriate bounding box, as well as various combinations of these and other signature types); 
converting audio track of the video into written texts using one or more speech-to-text algorithms; identifying one or more key entities from the written texts representing topics of the video ([col. 4 ln. 61-67] In this case, the speech may be extracted, converted to text and the resulting text analyzed against the above-noted stored text file to detect known company names, product or service names, 800 numbers or other telephone numbers, URLs, etc. (c) Absence of closed caption information combined with high cut rate); 
generating an output context comprising the extracted tags and the identified key entities ([col. 2 ln. 54-67] As another example, a given extracted signature may be an audio signature based at least in part on a characteristic of an audio signal associated with at least a portion of the video segment. Other signatures in accordance with the invention include, e.g., closed caption text describing an advertised product or service, a frame number plus information from a subimage of identified text associated with the frame, such as an 800 number, a company name, a product or service name, a uniform resource locator (URL), etc., or a frame number and a position and size of a face or other object in the image, as identified by an appropriate bounding box, as well as various combinations of these and other signature types. [col. 3 ln. 1-8] In accordance with another aspect of the invention, a video processing system maintains different sets of lists of signatures, the sets of lists including one or more of a set of probable lists, a set of candidate lists and a set of found lists, with each entry in a given one of the lists corresponding to a signature associated with a particular video segment. The sets of lists are updated as the various extracted signatures are processed); 
and using the output context to search for additional user-generated posts or reference videos associated with the video ([col. 3 ln. 6-16] The sets of lists are updated as the various extracted signatures are processed. For example, a given one of the signatures identified as likely to be associated with the particular video content is initially placed on one of the probable lists if it does not match any signature already on one of the probable lists. If the given signature matches a signature already on one of the probable lists, the given signature is placed on one of the candidate lists. A given one of the signatures on a candidate list is moved to a found list if it matches a signature already on one of the candidate lists).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Michaeli with the teachings of Dimitrova to generate output based on extracted tags and identified key entities "to determine which of the identified segments are in fact associated with the particular type of video content" [Dimitrova col. 2 ln 38-40].
Altinisik, in the same field of endeavor of video analysis, teaches performing a web crawling operation to search for additional user-generated posts or reference videos associated with the video ([Abstract] At the first level, our method groups videos into metaclasses considering several abstractions that represent high-level structural properties of file metadata. This is followed by a more nuanced classification of classes that comprise each metaclass. [pg. 3220 para. 1] For this, we obtained 92,603 videos from the public video sharing website lbry.com which, unlike other popular video sharing websites, does not by default transcode and re-encapsulate uploaded user videos. Videos were obtained from 14,916 platform user accounts by iteratively crawling the suggested video links on the main webpage). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Michaeli with the teachings of Altinisik to perform web crawling because "An essential requisite for media forensics is the ability to track the provenance of multimedia content. Such a capability may answer several questions about the origin of an image or video with differing levels of specificity. At the one end, the focus may be on identifying whether a given media is camera-captured or a deepfake, i.e., synthesized by a deep neural network. At the other end, it may be about attributing the media to the particular device that generated it" [Altinisik pg. 3211 para. 1]. 

Claims 4 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Michaeli in view of Dalins and Alford (US20230172510A1).

Regarding claim 4, Michaeli and Dalins teach the method of claim 1. Michaeli further teaches accessing the statement alleging that the video is fake [0013] Where any of the sequential analyses detect an anomaly, an alert indicating the presence of deepfake content is generated. This process may repeat as a loop for a series of synchronous observations, analyzing the audio-visual content frame by frame in two dimensions to detect deepfake modification to the speech or video of the human speaker). 
Michaeli does not teach analyzing texts within the statement using a binary classifier.
Alford, in the same field of endeavor of video analysis, teaches analyzing texts within the statement using a binary classifier ([0268] Metadata typically possesses time and date information to associate the classification with the source data. Metadata may be human readable text, such as labels, binary or numeric classifications of confidence against those labels).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Michaeli with the teachings of Alford to analyze text using a binary classifier because "The resulting metadata identifies and enriches the context of the individual's actions and experiences in a complex and dynamic environment" [Alford 0028]. 

Regarding claim 21, Michaeli and Dalins teach the system of claim 11. Michaeli further teaches accessing the statement alleging that the video is fake [0013] Where any of the sequential analyses detect an anomaly, an alert indicating the presence of deepfake content is generated. This process may repeat as a loop for a series of synchronous observations, analyzing the audio-visual content frame by frame in two dimensions to detect deepfake modification to the speech or video of the human speaker). 
Michaeli does not teach analyzing texts within the statement using a binary classifier.
Alford, in the same field of endeavor of video analysis, teaches analyzing texts within the statement using a binary classifier ([0268] Metadata typically possesses time and date information to associate the classification with the source data. Metadata may be human readable text, such as labels, binary or numeric classifications of confidence against those labels).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the system of Michaeli with the teachings of Alford to analyze text using a binary classifier because "The resulting metadata identifies and enriches the context of the individual's actions and experiences in a complex and dynamic environment" [Alford 0028]. 

Claims 5-6, 13, 18, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Michaeli in view of Dalins and Stewart (US20210327431A1).

Regarding claim 5, Michaeli and Dalins teach the method of claim 1. Stewart, in the same field of endeavor of fake video detection, teaches wherein performing the consistency checks between audio and video of the video comprises: extracting audio track of the video ([0151] Initial pre-processing of the data consists of extracting audio from all video files and creating appropriate transcriptions for each individual utterance); 
converting the audio track into a sequence of syllables, wherein the sequence of syllables comprises a plurality of syllables that are spoken at different time points within the video ([0212] The VSR system can produce output in various forms. One form of output is a single highest scoring utterance with word-level and phonetic-level time stamps which indicate when each word and phoneme started and stopped. From these time stamps it is possible to determine the speaking rate for an individual in a video based on the “words per minute” or “phonemes per minute” metrics. However, syllables, rather than words or phonemes, are thought to be a more stable unit of pronunciation to measure rate of speech. We therefore convert the phoneme and word sequences to time-stamped syllable sequences through a process known as automatic syllabification. We can then use these time stamps to determine a speaker's “syllables per minute” rate of speech and also measure changes in duration of a person's syllables throughout a video); 
processing the video to create the plurality of sampling frames, wherein each respective sampling frame, of the plurality of sampling frame, is captured at a time point that a respective syllable is spoken; and comparing each respective syllable, of the plurality of syllables, with each respective sampling frame, of the plurality of sampling frames, to generate a plurality of consistency values ([0038] (iii) a computer vision subsystem configured to analyse the video stream received, using a lip reading or viseme processing subsystem, and to track and extract the movement of the end-user lip, and to recognise a word or sentence based on the lip movement, [0039] (iv) a merging subsystem configured to analyse the recognized words or sentences by the speech recognition subsystem and lip reading processing subsystem and to output a list of candidate words or sentences, each associated with a confidence score or rank referring to a likelihood or probability that the candidate word or sentence has been spoken by the end-user). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Michaeli with the teachings of Stewart to compare the syllables spoken with video sampling frames "thus ensuring a ‘live’ person is present and the authentication is valid" [Stewart 0074]. 

Regarding claim 6, Michaeli, Dalins, and Stewart teach the method of claim 5. Stewart teaches wherein the plurality of consistency values is mapped onto a timeline, with each respective value represents a degree of consistency between the audio and video of the video at a time point that a respective syllable is spoken ([0413] The lip reading processing subsystem processes the video stream and extracts viseme features. [0414] Viseme features are converted into time-stamped syllable sequences using an automatic syllabification process. [0464] (vi) a merging subsystem configured to analyse the recognized words or sentences by the speech recognition subsystem and lip reading processing subsystem and to output a list of candidate words or sentences, each associated with a confidence score or rank referring to a likelihood or probability that the candidate word or sentence has been spoken by the end-user. [0483] Merging subsystem dynamically updates the list of candidate words or sentences and their confidence score as the video and audio streams are being processed).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Michaeli with the teachings of Stewart to create a degree of consistency between the audio and video at time points "thus ensuring a ‘live’ person is present and the authentication is valid" [Stewart 0074]. 

Regarding claim 13, Michaeli and Dalins teach the system of claim 11. Stewart, in the same field of endeavor of fake video detection, teaches wherein performing the consistency checks between audio and video of the video comprises: extracting audio track of the video ([0151] Initial pre-processing of the data consists of extracting audio from all video files and creating appropriate transcriptions for each individual utterance); 
converting the audio track into a sequence of syllables, wherein the sequence of syllables comprises a plurality of syllables that are spoken at different time points within the video ([0212] The VSR system can produce output in various forms. One form of output is a single highest scoring utterance with word-level and phonetic-level time stamps which indicate when each word and phoneme started and stopped. From these time stamps it is possible to determine the speaking rate for an individual in a video based on the “words per minute” or “phonemes per minute” metrics. However, syllables, rather than words or phonemes, are thought to be a more stable unit of pronunciation to measure rate of speech. We therefore convert the phoneme and word sequences to time-stamped syllable sequences through a process known as automatic syllabification. We can then use these time stamps to determine a speaker's “syllables per minute” rate of speech and also measure changes in duration of a person's syllables throughout a video); 
processing the video to create the plurality of sampling frames, wherein each respective sampling frame, of the plurality of sampling frame, is captured at a time point that a respective syllable is spoken; and comparing each respective syllable, of the plurality of syllables, with each respective sampling frame, of the plurality of sampling frames, to generate a plurality of consistency values ([0038] (iii) a computer vision subsystem configured to analyse the video stream received, using a lip reading or viseme processing subsystem, and to track and extract the movement of the end-user lip, and to recognise a word or sentence based on the lip movement, [0039] (iv) a merging subsystem configured to analyse the recognized words or sentences by the speech recognition subsystem and lip reading processing subsystem and to output a list of candidate words or sentences, each associated with a confidence score or rank referring to a likelihood or probability that the candidate word or sentence has been spoken by the end-user). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the system of Michaeli with the teachings of Stewart to compare the syllables spoken with video sampling frames "thus ensuring a ‘live’ person is present and the authentication is valid" [Stewart 0074]. 

Regarding claim 18, Michaeli and Dalins teach the product of claim 16. Stewart, in the same field of endeavor of fake video detection, teaches wherein performing the consistency checks between audio and video of the video comprises: extracting audio track of the video ([0151] Initial pre-processing of the data consists of extracting audio from all video files and creating appropriate transcriptions for each individual utterance); 
converting the audio track into a sequence of syllables, wherein the sequence of syllables comprises a plurality of syllables that are spoken at different time points within the video ([0212] The VSR system can produce output in various forms. One form of output is a single highest scoring utterance with word-level and phonetic-level time stamps which indicate when each word and phoneme started and stopped. From these time stamps it is possible to determine the speaking rate for an individual in a video based on the “words per minute” or “phonemes per minute” metrics. However, syllables, rather than words or phonemes, are thought to be a more stable unit of pronunciation to measure rate of speech. We therefore convert the phoneme and word sequences to time-stamped syllable sequences through a process known as automatic syllabification. We can then use these time stamps to determine a speaker's “syllables per minute” rate of speech and also measure changes in duration of a person's syllables throughout a video);
processing the video to create the plurality of sampling frames, wherein each respective sampling frame, of the plurality of sampling frame, is captured at a time point that a respective syllable is spoken; and comparing each respective syllable, of the plurality of syllables, with each respective sampling frame, of the plurality of sampling frames, to generate a plurality of consistency values ([0038] (iii) a computer vision subsystem configured to analyse the video stream received, using a lip reading or viseme processing subsystem, and to track and extract the movement of the end-user lip, and to recognise a word or sentence based on the lip movement, [0039] (iv) a merging subsystem configured to analyse the recognized words or sentences by the speech recognition subsystem and lip reading processing subsystem and to output a list of candidate words or sentences, each associated with a confidence score or rank referring to a likelihood or probability that the candidate word or sentence has been spoken by the end-user). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the product of Michaeli with the teachings of Stewart to compare the syllables spoken with video sampling frames "thus ensuring a ‘live’ person is present and the authentication is valid" [Stewart 0074]. 

Regarding claim 22, Michaeli, Dalins, and Stewart teach the system of claim 13. Stewart teaches wherein the plurality of consistency values is mapped onto a timeline, with each respective value represents a degree of consistency between the audio and video of the video at a time point that a respective syllable is spoken ([0413] The lip reading processing subsystem processes the video stream and extracts viseme features. [0414] Viseme features are converted into time-stamped syllable sequences using an automatic syllabification process. [0464] (vi) a merging subsystem configured to analyse the recognized words or sentences by the speech recognition subsystem and lip reading processing subsystem and to output a list of candidate words or sentences, each associated with a confidence score or rank referring to a likelihood or probability that the candidate word or sentence has been spoken by the end-user. [0483] Merging subsystem dynamically updates the list of candidate words or sentences and their confidence score as the video and audio streams are being processed).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the system of Michaeli with the teachings of Stewart to create a degree of consistency between the audio and video at time points "thus ensuring a ‘live’ person is present and the authentication is valid" [Stewart 0074]. 

Claims 8 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Michaeli in view of Dalins and Fuhrman (US11856311B1). 

Regarding claim 8, Michaeli and Dalins teach the method of claim 7. Fuhrman, in the same field of endeavor of video feature extraction, teaches wherein analyzing each of the sampling frames to extract the one or more video features further comprises passing outputs of the one or more image processing algorithms into a lowpass filter to remove high frequency noise in the extracted one or more video features ([col. 13 ln. 52-62] The enhancement filters can include one or more filters such as, for example, a series of low-pass, high-pass, or band-pass filters followed by thresholding. The enhancement filters can be used to extract particular features within the video signal stream. These features can include, for example, edges and blobs. Other features can include corners, ridges, the total image intensity, the total image intensity within each masked area, and any other features associated with image objects for which motion tracking is desired. [col. 14 ln. 22-24] the high-pass filter 502 can be preceded by or followed by a low-pass filter to attenuate high frequency noise, which may corrupt the first feature video signal stream 510). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Michaeli with the teachings of Fuhrman to use a lowpass filter "to detect the low frequency components of the video signal stream 812" [Fuhrman col. 16 ln. 27-28].

Regarding claim 23, Michaeli and Dalins teach the system of claim 14. Fuhrman, in the same field of endeavor of video feature extraction, teaches wherein analyzing each of the sampling frames to extract the one or more video features further comprises passing outputs of the one or more image processing algorithms into a lowpass filter to remove high frequency noise in the extracted one or more video features ([col. 13 ln. 52-62] The enhancement filters can include one or more filters such as, for example, a series of low-pass, high-pass, or band-pass filters followed by thresholding. The enhancement filters can be used to extract particular features within the video signal stream. These features can include, for example, edges and blobs. Other features can include corners, ridges, the total image intensity, the total image intensity within each masked area, and any other features associated with image objects for which motion tracking is desired. [col. 14 ln. 22-24] the high-pass filter 502 can be preceded by or followed by a low-pass filter to attenuate high frequency noise, which may corrupt the first feature video signal stream 510). 
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the system of Michaeli with the teachings of Fuhrman to use a lowpass filter " to detect the low frequency components of the video signal stream 812" [Fuhrman col. 16 ln. 27-28].

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
                                                                                                                                                                                                    
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jacqueline R Zak whose telephone number is (571)272-4077. The examiner can normally be reached M-F 9-5. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at (571) 270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JACQUELINE R ZAK/Examiner, Art Unit 2666                                                                                                                                                                                                        
/EMILY C TERRELL/Supervisory Patent Examiner, Art Unit 2666
Read full office action
Prosecution Timeline

Sep 14, 2023
Application Filed
Sep 23, 2025
Non-Final Rejection — §103
Dec 15, 2025
Examiner Interview Summary
Dec 15, 2025
Applicant Interview (Telephonic)
Dec 23, 2025
Response Filed
Feb 25, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/175,738
Patent 12586340
PIXEL PERSPECTIVE ESTIMATION AND REFINEMENT IN AN IMAGE
2y 5m to grant Granted Mar 24, 2026
18/012,667
Patent 12462343
MEDICAL DIAGNOSTIC APPARATUS AND METHOD FOR EVALUATION OF PATHOLOGICAL CONDITIONS USING 3D OPTICAL COHERENCE TOMOGRAPHY DATA AND IMAGES
2y 5m to grant Granted Nov 04, 2025
17/924,432
Patent 12373946
ASSAY READING METHOD
2y 5m to grant Granted Jul 29, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
67%
Grant Probability
55%
With Interview (-11.4%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 12 resolved cases by this examiner. Grant probability derived from career allow rate.