Prosecution Insights
Last updated: May 29, 2026
Application No. 18/644,697

MULTIMODAL DEEPFAKE DETECTION VIA LIP-AUDIO CROSS-ATTENTION AND FACIAL SELF-ATTENTION

Non-Final OA §101§102§103§112
Filed
Apr 24, 2024
Priority
Jun 27, 2023 — provisional 63/510,416
Examiner
DULANEY, KATHLEEN YUAN
Art Unit
2666
Tech Center
2600 — Communications
Assignee
Purdue Research Foundation
OA Round
1 (Non-Final)
77%
Grant Probability
Favorable
1-2
OA Rounds
1y 0m
Est. Remaining
99%
With Interview

Examiner Intelligence

Grants 77% — above average
77%
Career Allowance Rate
508 granted / 659 resolved
+15.1% vs TC avg
Strong +24% interview lift
Without
With
+23.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
24 currently pending
Career history
693
Total Applications
across all art units

Statute-Specific Performance

§101
1.4%
-38.6% vs TC avg
§103
78.7%
+38.7% vs TC avg
§102
6.3%
-33.7% vs TC avg
§112
13.0%
-27.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 659 resolved cases

Office Action

§101 §102 §103 §112
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. It is noted that claims 1-20 are considered eligible subject matter. Even if the claims are interpreted as an abstract idea, the claims provide limitations that provide a practical application, i.e. video manipulation/ deepfake detection. Claim Rejections - 35 USC § 112 The following is a quotation of the first paragraph of 35 U.S.C. 112(a): (a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention. The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112: The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention. Claims 16-18 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the enablement requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to enable one skilled in the art to which it pertains, or with which it is most nearly connected, to make and/or use the invention. Claims 16-18 claim that the first embedding comprises determinations used from the second embedding. The examiner cannot find support for these limitations in the specification. It is not clear how the second embeddings can be used to determine the first embedding, given there are steps regarding items from the second embedding’s determination, and no connection as to how they are used to determine the first embedding. The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claim 15-18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Claim 15 recites the limitation "the second plurality of cropped frames" in lines 3-4. There is insufficient antecedent basis for this limitation in the claim. Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claims 1, 19 and 20 are rejected under 35 U.S.C. 102(a)(1) as being unpatentable by U.S. Patent Application Publication NO. 20220121868 (Chen et al). Regarding claim 1, Chen et al discloses a method (fig. 6, 8) for detecting whether a video has been manipulated, the method comprising: receiving, with a processor (fig. 2, item 202), a video (fig. 2, item 206) including a plurality of frames (fig. 2, item 210) and audio (fig. 2, item 208) of a person speaking (page 1, paragraph 10); determining, with the processor (page 10, paragraph 101), a first embedding (fig. 6, item 622) based on the plurality of frames of the video (fig. 6, item 606) using a first neural network (fig. 6, item 612), the first neural network incorporating a self-attention mechanism, since the first neural network only takes the image data as an input (fig. 6, item 606 input in 612) and being configured to detect artifacts in a facial region of the plurality of frames of the video by detecting if the frames match artifact spoofprint (page 5, paragraph 50-51); determining, with the processor (page 10, paragraph 101), a second embedding (fig. 6, item 618) based on the audio of the video (fig. 6, item 604) and the plurality of frames of the video (fig. 6, item 606) using a second neural network (fig. 6, item 610), the second neural network incorporating a cross-attention mechanism, because the two inputs are combined in cross attention network of fig. 6, item 610, and being configured to identify discrepancies between (i) lip movements of the person in the plurality of frames of the video and (ii) words spoken in the audio of the video (page 5, paragraph 54); and determining, with the processor (page 10, paragraph 101), whether the video has been manipulated based on both the first embedding and the second embedding (fig. 6, item 624, 626, page 6, paragraph 61). Regarding claim 19, Chen et al discloses the determining whether the video has been manipulated further comprising: determining a joint embedding by concatenating the first embedding and the second embedding (fig. 6, item 624, page 11, paragraph 106); and determining whether the video has been manipulated based on the joint embedding (fig. 6, item 624, page11, paragraph 107) using a linear neural network layer, the last layer of machine learning architecture fig. 6, item 607 of fig. 6, item 624 to fig. 6, item 626). Claim 20 is rejected for the same reasons as claim 1. Thus, the arguments analogous to that presented above for claim 1 are equally applicable to claim 20. Claim 20 distinguishes from claim 1 only in that claim 20 claims a non-transitory computer readable medium that stores program instructions for detecting whether a video has been manipulated, the program instructions being configured to, when executed by a processor, cause the processor to carry out the method of claim 1. Chen et al teaches further this feature (page 13, paragraph 135). Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 2 and 4 are rejected under 35 U.S.C. 103(a) as being unpatentable over Chen et al, as applied to claim 1 above, and further in view of U.S. Patent No. 11373449 (Genner) Regarding claim 2, Chen et al discloses all of the claimed elements as set forth above and incorporated herein by reference. Chen et al discloses the determining the first embedding further comprising processing the facial video/ images (fig. 6, item 512). Chen et al does not disclose expressly processing the facial images includes generating a first plurality of cropped frames, each cropped frame of the first plurality of cropped frames being cropped by a bounding box around a face of the person in a respective frame of the plurality of frames. Genner discloses processing the facial video includes generating a first plurality of cropped frames (col. 10, lines 60-61), each cropped frame of the first plurality of cropped frames being cropped by a bounding box around a face of the person in a respective frame of the plurality of frames, a square (col 10, lines 61-62). Chen et al and Genner are combinable because they are from the same field of endeavor, i.e. processing facial images. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to crop the facial area. The suggestion/motivation for doing so would have been to provide a more robust method by focusing on relevant data. Therefore, it would have been obvious to combine the method of Chen et al with the cropping of Genner to obtain the invention as specified in claim 2. Regarding claim 4, Genner et al discloses generating the first plurality of cropped frames further comprising: resizing each of the first plurality of cropped frames to a first predetermined resolution (col 10, lines 61-62). Claim 3 is rejected under 35 U.S.C. 103(a) as being unpatentable over Chen et al and Genner, as applied to claim 2 above, and further in view of U.S. Patent NO. 12266212 (Gupta). Regarding claim 3, Chen et al (as modified by Genner) discloses all of the claimed elements as set forth above and incorporated herein by reference. Chen et al (as modified by Genner) does not disclose expressly generating the first plurality of cropped frames further comprising: identifying a subset of the plurality of frames of the video; and generating the first plurality of cropped frames by cropping the subset of the plurality of frames of the video. Gupta et al discloses generating the first plurality of cropped frames further comprising: identifying a subset of the plurality of frames of the video (col. 8, lines 15-20); and generating the first plurality of cropped frames by cropping the subset of the plurality of frames of the video (col. 8, lines 40-41). Chen et al (as modified by Genner) and Gupta et al are combinable because they are from the same field of endeavor, i.e. processing facial images. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to use a subset of frames. The suggestion/motivation for doing so would have been to provide a faster method by processing only essential data. Therefore, it would have been obvious to combine the method of Chen et al (as modified by Genner) with a subset of frames of Gupta et al to obtain the invention as specified in claim3. Claims 5, 7, 8 and 15 are rejected under 35 U.S.C. 103(a) as being unpatentable over Chen et al as modified by Genner, as applied to claim 1 above, and further in view of U.S. Patent Application Publication NO. 20230401824 (Khan et al). Regarding claim 5, Chen et al discloses all of the claimed elements as set forth above and incorporated herein by reference. Genner discloses cropping the face from the image (col. 10, lines 60-61), and Chen et al discloses the first neural network is a network for facial features in video (fig. 6, item 612). Chen et al (as modified by Genner) does not disclose expressly determining the first embedding further comprising: determining a plurality of patches by extracting features from each of the first plurality of facial frames using a feature extractor of the facial network determining the first embedding further comprising: determining a plurality of patches. Khan et al discloses determining the first embedding further comprising: determining a plurality of patches (Fig. 8, item 812, 816) by extracting features from each of the first plurality of facial frames (fig. 6, item 602), the features that are extracted in fig. 8, item 812, or in fig. 8, item 814) using a feature extractor of the facial network, fig. 8 being part of the facial network. Chen et al (as modified by Genner) & Khan et al are combinable because they are from the same field of endeavor, i.e. detecting deepfakes. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to extract patches. The suggestion/motivation for doing so would have been to provide a more robust system by extracting features from significant areas of the image. Therefore, it would have been obvious to combine the method of Chen et al (as modified by Genner) with patch extraction of Khan et al to obtain the invention as specified in claim 5. Regarding claim 7, Khan et al discloses the determining the first embedding further comprising: determining the first embedding based on the plurality of patches using a Transformer encoder of the first neural network (fig. 8, item 808) having a multi-headed self-attention mechanism (fig. 8, item 826). Regarding claim 8, Khan et al discloses the determining the first embedding further comprising: determining a plurality of patch embeddings based on the plurality of patches using a linear layer (fig. 8, item 814); and determining a first plurality of position-encoded embeddings by embedding position information into the plurality of patch embeddings (fig. 8, item 806), wherein the first embedding is determined based on the first plurality of position-encoded embeddings (fig. 8, output of transform encoder 800 is based on item 816). Regarding claim 15, Chen et al (as modified by Genner and Khan et al) discloses all of the claimed elements as set forth above and incorporated herein by reference. Chen et al further discloses the second embedding is from audio and video data (fig. 6, item 610), determining the embedding based on the second plurality of frames and the audio of the video of the second neural network having a cross-attention mechanism, since the audio and video are both input and processed together (fig. 6, item 604, 606 input to 610). Genner discloses cropping the face images (col. 10, lines 60-61). Khan et al discloses using a Transformer encoder (fig. 8, item 826). Claims 11-14 are rejected under 35 U.S.C. 103(a) as being unpatentable over Chen et al, as applied to claim 1 above, and further in view of U.S. Patent Application Publication No. 20220318349 (Wasnik et al). Regarding claim 11, Chen et al discloses all of the claimed elements as set forth above and incorporated herein by reference. Chen et al further discloses determining the second embedding further comprises analyzing a bounding box around lips of the person in a respective frame of the plurality of frames (page 5, paragraph 56). Chen et al does not disclose expressly generating a second plurality of cropped frames, each cropped frame of the second plurality of cropped frames being cropped by a bounding box around lips of the person. Wasnik et al discloses generating a second plurality of cropped frames, each cropped frame of the second plurality of cropped frames being cropped by a bounding box around lips of the person, around a mouth region (page 7, paragraph 85). Chen et al and Wasnik et al are combinable because they are from the same field of endeavor, i.e. analyzing speech in videos. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to crop the mouth area. The suggestion/motivation for doing so would have been to provide a faster system by lowering the data to be processed. Therefore, it would have been obvious to combine the method of Chen et al with the cropping of Wasnik et al to obtain the invention as specified in claim 11. Regarding claim 12, Wasnik et al discloses the generating the second plurality of cropped frames further comprising: resizing each of the second plurality of cropped frames to a second predetermined resolution, 60x100 pixels (page 7, paragraph 85). Regarding claim 13, Wasnik et al discloses generating the second plurality of cropped frames further comprising: converting the second plurality of cropped frames to greyscale (page 7, paragraph 85). Regarding claim 14, Wasnik et al discloses converting the audio of the video into mono audio, i.e. a single channel stream (page 7, paragraph 86). Allowable Subject Matter Claims 6, 9, 10, 16-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Claims 16-18 also would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims. Claim 6 contains allowable subject matter regarding determining the claimed plurality of patches comprises: determining a plurality of raw patches corresponding to features extracted from portions of each of the first plurality of cropped frames using the feature extractor of the first neural network; based on the plurality of raw patches, determining a plurality of tubelets corresponding to features extracted from corresponding portions over a plurality of temporally sequential frames from the claimed first plurality of cropped frames; and determining the plurality of patches by flattening the plurality of tubelets. Claim 9 contains allowable subject matter regarding determining the claimed first embedding further comprises: determining Query, Key, and Value matrices based on the claimed plurality of patches, determined as claimed; and determining the first embedding using both the claimed Transformer encoder of the first neural network and the Query, Key, and Value matrices. Claim 16 contains allowable subject matter regarding the determining of the claimed first embedding further comprises determining a second plurality of position-encoded embeddings by embedding position information into the claimed second plurality of cropped frames and the audio of the video, wherein the second embedding is determined based on the second plurality of position-encoded embeddings. Claim 17 contains allowable subject matter regarding the determining the claimed first embedding further comprises determining a Query matrix based on the claimed second plurality of cropped frames; determining Key and Value matrices based on the audio of the video; and determining the second embedding using both the claimed Transformer encoder of the second neural network and the Query, Key, and Value matrices. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATHLEEN YUAN DULANEY whose telephone number is (571)272-2902. The examiner can normally be reached M-F, 9AM-5PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at 5712703717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /KATHLEEN Y DULANEY/Primary Examiner, Art Unit 2666 2/12/2026
Read full office action

Prosecution Timeline

Apr 24, 2024
Application Filed
May 07, 2026
Non-Final Rejection mailed — §101, §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12638668
Method and Astrophotographic Apparatus for Acquiring Images of Targets in Sky Area
3y 3m to grant Granted May 26, 2026
Patent 12631569
METHOD, DEVICE, SYSTEM AND COMPUTER READABLE MEDIUM FOR RAPIDLY DETECTING PEST EGG IN GRAIN BASED ON PEST EGG AND PEST HOLE STRUCTURE FEATURES
3y 2m to grant Granted May 19, 2026
Patent 12620110
IMAGE PROCESSING DEVICE, STEREO CAMERA DEVICE, MOBILE OBJECT, DISPARITY CALCULATING METHOD, AND IMAGE PROCESSING METHOD
3y 3m to grant Granted May 05, 2026
Patent 12605131
A SYSTEM AND METHOD FOR THE QUANTIFICATION OF CONTRAST AGENT
3y 3m to grant Granted Apr 21, 2026
Patent 12602801
IMAGE PROCESSING CIRCUITRY AND IMAGE PROCESSING METHOD FOR DEPTH ESTIMATION IN A TIME-OF-FLIGHT SYSTEM
3y 1m to grant Granted Apr 14, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
99%
With Interview (+23.7%)
3y 1m (~1y 0m remaining)
Median Time to Grant
Low
PTA Risk
Based on 659 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month