Last updated: July 05, 2026

Application No. 18/991,074

AUDIO AND VIDEO SYNCHRONIZATION DETECTION METHOD, DEVICE, ELECTRONIC EQUIPMENT AND TERMINAL

Non-Final OA §103

Filed

Dec 20, 2024

Priority

Sep 10, 2024 — CN 202411262508.9

Examiner

ALAM, MUSHFIKH I

Art Unit

2426

Tech Center

2400 — Computer Networks

Assignee

Baidu Online Network Technology (Beijing) Co., Ltd.

OA Round

2 (Non-Final)

This examiner grants 58% of cases after interview

— +38.2% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 514 resolved cases, 2023–2026

Examiner Intelligence

ALAM, MUSHFIKH I View full profile →

Grants 58% of resolved cases

Career Allowance Rate

299 granted / 514 resolved

At TC average

Strong +38% interview lift

Without

With

+38.2%

Interview Lift

resolved cases with interview

Typical timeline

3y 12m

Avg Prosecution

21 currently pending

Career history

545

Total Applications

across all art units

Statute-Specific Performance

§101

0.4%

-39.6% vs TC avg

§103

94.2%

+54.2% vs TC avg

§102

1.1%

-38.9% vs TC avg

§112

0.3%

-39.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 514 resolved cases

Office Action

§103

1074DETAILED ACTION
Claims 1, 4-12, 15-18, 22-23 are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 4, 12-15, 22-23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Siagian et al. (US 11871068) in view of Jelonek et al. (US 2008/0002892).



Claim 1, Siagian teaches An audio and video synchronization detection method, comprising: 
extracting image data and audio data of a video segment of a target length (i.e. audio/video data of conversation segment) (col. 2, lines 17-55); 
obtaining a plurality of face image lists (i.e. neural network of facial images) by performing face detection and tracking based on the extracted image data (i.e. portion of conversation segment) (col. 14, lines 35-58); 
extracting mouth features corresponding to each face image list based on a traversal result of the face image list (i.e. open or closed mouths), wherein the mouth features are used to characterize changes of lips (i.e. open or closed mouth) (col. 8, lines 4-42, col. 14, lines 35-58); and 
determining a synchronization result of the video segment based on the audio data and the mouth features (i.e. determining a synchronization error) (col. 2, lines 17-55);
wherein extracting the image data and the audio data of the video segment of the target length, comprises: 
extracting audio frames in the video segment of the target length (i.e. conversation segments), and generating an audio data list based on the audio frames (i.e. neural network for determining audio transition points) (col. 2, lines 17-55, col. 5-6, lines 52-21);
wherein obtaining the plurality of face image lists by performing face detection and tracking based on the extracted image data, comprises: 
obtaining one or more face images each with a face area greater than a preset threshold (i.e. threshold for detecting facial images in frames) in each image frame by detecting the image frame in the image data list (i.e. using neural network which contains training data of images) (col. 14, lines 35-58); 
obtaining one or more face identity documents (IDs) (i.e. open or closed mouths) by tracking face images in each image frame based on face features (i.e. open or closed mouths) (col. 8, lines 4-42, col. 14, lines 35-58);
 extracting image frames in the video segment of the target length (i.e. conversation segment), and generating an image data list based on the image frames (i.e. neural network of facial images) (col. 2, lines 17-55, col. 14, lines 35-58).
Siagon is not entirely clear in teaching the specific features of:
denoting the one or more face identifiers to corresponding image frames; and
obtaining respective facial lists corresponding to the one or more faces by grouping image frames containing a same facial identifier.
Jelonek teaches the specific features of:
denoting the one or more face identifiers (i.e. facial features) to corresponding image frames (i.e. imaes) (p. 0043-0044); and
obtaining respective facial lists (i.e. appears attributes) corresponding to the one or more faces by grouping image frames containing a same facial identifier (i.e. all the image frames that show a particular attribute and grouped together and assessed using a neural net to determine a metatag description of the face) (fig. 8; p. 0044, 0103-0108).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing data of the present invention to have provided grouping of facial image data as taught by Jelonek to the system of Siagan to provide facial attribute analysis (p. 0044).

Claim 4, Siagian teaches The method of claim 3, wherein extracting the mouth features corresponding to each face image list based on the traversal result of the face image list, comprises: 
traversing face images in each face image list (i.e. neural network of facial images) and extracting mouth features of each face image (i.e. open or closed mouths) (col. 8, lines 4-42, col. 14, lines 35-58); and 
generating a mouth feature list of each face image based on the mouth features of each face image (i.e. open or closed mouths training data of neural network) (col. 8, lines 4-42, col. 14, lines 35-58).

Claims 12 and 22 is analyzed and interpreted as an apparatus of claim 1. 
Claim 15 is analyzed and interpreted as an apparatus of claim 4. 

Claim 23 recites “A non-transitory computer program product comprising a computer program, wherein when the computer program is executed by a processor” to perform the steps of claim 1. Siagian teaches “A non-transitory computer program product comprising a computer program, wherein when the computer program is executed by a processor” to perform the steps of claim 1 (col. 18, lines 4-29).

Claim(s) 5-11, 16-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Siagian et al. (US 11871068) in view of Jelonek et al. (US 2008/0002892), and further in view of Mathews (US 2022/0269922).

Claim 5, Siagian teaches the method of claim 4, wherein determining the synchronization result of the video segment based on the audio data and the mouth features, comprises: 
obtaining a labial-sound similarity corresponding to the face image list based on a mouth feature list containing a mouth feature corresponding to an opening and closing change and an audio feature sequence of the audio data list; and 
performing audio and video synchronization detection on the video segment based on the sound similarity corresponding to the image list (i.e. audio speech detection of portions), to determine the synchronization result of the video segment (i.e. errors in synchronization) (col. 2, lines 17-55).
Siagian is not entirely clear in teaching the specific feature of:
obtaining a labial-sound similarity corresponding to the face image list based on a mouth feature list containing a mouth feature corresponding to an opening and closing change and an audio feature sequence of the audio data list. 
Mathews teaches the specific feature of:
obtaining a labial-sound similarity (i.e. sounds for audio/speech) corresponding to the face image list based on a mouth feature list containing a mouth feature corresponding to an opening and closing change and an audio feature sequence of the audio data list (i.e. neural network training for speech sounds corresponding to mouth movements) (p. 0044). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing data of the present invention to have provided speech sound detection as taught by Mathews to the system of Siagian to provide video analysis of speech segments (p. 0044).

Claim 6, Siagian teaches The method of claim 5, wherein a process of determining the mouth feature list containing the mouth feature corresponding to the opening and closing change, comprises: 
determining the mouth feature list containing the mouth feature corresponding to the opening and closing change by extracting lip movement features from the mouth feature list (i.e. open or closed mouths training data of neural network) (col. 8, lines 4-42, col. 14, lines 35-58).

Claim 7, Siagian is silent regarding the method of claim 5, wherein determining the audio feature sequence, comprises: 
obtaining an audio frame sequence by ranking the audio frames based on time stamps of the audio frames in the audio data list; and 
obtaining the audio feature sequence by extracting audio features from the audio frame sequence.
Mathews teaches the method of claim 5, wherein determining the audio feature sequence, comprises: 
obtaining an audio frame sequence by ranking the audio frames based on time stamps of the audio frames in the audio data list (i.e. classification score and timestamps within metadata of frames (p. 0030); and 
obtaining the audio feature sequence by extracting audio features from the audio frame sequence (i.e. obtaining from report) (p. 0030).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing data of the present invention to have provided speech sound detection as taught by Mathews to the system of Siagian to provide video analysis of speech segments (p. 0044).

Claim 8, Siagia, is silent regarding the method of claim 5, wherein obtaining the labial-sound similarity corresponding to the face image list based on the mouth feature list containing the mouth feature corresponding to the opening and closing change and the audio feature sequence of the audio data list, comprises: 
inputting the mouth feature list containing the mouth feature corresponding to the opening and closing change and the audio feature sequence into a pre-trained labial-sound synchronization detection model; and 
obtaining the labial-sound similarity corresponding to the face image list by performing cross-modal similarity calculation on the mouth feature list containing the mouth feature corresponding to the opening and closing change and the audio feature sequence through the labial-sound synchronization detection model.
Mathews teaches the specific features of:
inputting the mouth feature list containing the mouth feature corresponding to the opening and closing change and the audio feature sequence into a pre-trained labial-sound synchronization detection model (i.e. neural network training for speech sounds corresponding to mouth movements) (p. 0044); and 
obtaining the labial-sound similarity corresponding to the face image list by performing cross-modal similarity calculation (i.e. using neural network) on the mouth feature list containing the mouth feature corresponding to the opening and closing change and the audio feature sequence through the labial-sound synchronization detection model (i.e. neural network training for speech sounds corresponding to mouth movements) (p. 0044).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing data of the present invention to have provided speech sound detection as taught by Mathews to the system of Siagian to provide video analysis of speech segments (p. 0044).

Claim 9, Siagian teaches the method of claim 5, wherein performing audio and video synchronization detection on the video segment based on the sound similarity corresponding to the face image list, to determine the synchronization result of the video segment, comprises: 
performing audio and video synchronization detection on the video segment based on the statistical count, to determine the synchronization result of the video segment (i.e. errors in synchronization) (col. 2, lines 17-55).
Siagian is silent regarding the specific features of:
determining a preset similarity threshold; 
determining whether the labial-sound similarity corresponding to the face image list is greater than the preset similarity threshold; 
obtaining a statistical count of face image lists each with the labial-sound similarity greater than or equal to the preset similarity threshold.
Mathews teaches the specific features of:
determining a preset similarity threshold (i.e. threshold similarity values) (p. 0021); 
determining whether the labial-sound similarity corresponding to the face image list is greater than the preset similarity threshold (i.e. speech sounds are determined by a trained neural network which inherently have preset thresholds) (p. 0044); 
obtaining a statistical count of face image lists each with the labial-sound similarity greater than or equal to the preset similarity threshold (i.e. neural network inherently person a comparison between training data and input data using at least statistical counts) (p. 0021, 0030, 0044).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing data of the present invention to have provided speech sound detection as taught by Mathews to the system of Siagian to provide video analysis of speech segments (p. 0044).

Claim 10, Siagian teaches the method of claim 6, wherein performing audio and video synchronization detection on the video segment based on the statistical count, to determine the synchronization result of the video segment, comprises: 
determining a preset quantity threshold (i.e. threshold for detecting synchronization errors); 
in response to the statistical count being greater than or equal to the preset quantity threshold, determining that the video segment is audio-video synchronized (i.e. neural networks inherently utilize statistical counts, in this case for determining synchronization errors) (col. 2, lines 17-55, col. 5-6, lines 52-21); and 
in response to the statistical count being less than the preset quantity threshold, determining that the video segment is audio-video unsynchronized synchronized (i.e. neural networks inherently utilize statistical counts, in this case for determining synchronization errors) (col. 2, lines 17-55, col. 5-6, lines 52-21).

Claim 11, Siagian teaches The method of claim 6, before determining the audio feature sequence, further comprising: 
extracting key points (i.e. facial movements) of a mouth from a mouth feature image in the mouth feature list, and tracking the key points to obtain a motion trajectory of the key points (i.e. open or closed mouths based on facial movements) (col. 2-3, lines 57-14, col. 8, lines 4-42, col. 14, lines 35-58); and 
determining whether the mouth exhibits an opening and closing change based on the motion trajectory (i.e. open or closed mouths based on facial movements) (col. 2-3, lines 57-14, col. 8, lines 4-42, col. 14, lines 35-58).

Claim 16 is analyzed and interpreted as an apparatus of claim 5. 
Claim 17 is analyzed and interpreted as an apparatus of claim 6. 
Claim 18 is analyzed and interpreted as an apparatus of claim 7. 

Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 4-12, 15-18, 22-23 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Conclusion
Claims 1, 4-12, 15-18, 22-23 are rejected.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20230410290 A1	Desai; Deshana et al.
US 20190278804 A1	Matsushita; Masahiro et al.
US 20190034706 A1	el Kaliouby; Rana et al.
US 20180018508 A1	TUSCH; Michael
US 20100205541 A1	Rapaport; Jeffrey A. et al.
US 7027621 B1		Prokoski; Francine J.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Inquiries
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MUSHFIKH I ALAM whose telephone number is (571)270-1710. The examiner can normally be reached 1:00PM-9:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nasser Goodarzi can be reached at 571-272-4195. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MUSHFIKH I. ALAM
Primary Examiner
Art Unit 2426



/MUSHFIKH I ALAM/Primary Examiner, Art Unit 2426                                                                                                                                                                                                        4/2/2026

Read full office action

Prosecution Timeline

Dec 20, 2024

Application Filed

Dec 19, 2025

Non-Final Rejection mailed — §103

Mar 19, 2026

Response Filed

Apr 06, 2026

Final Rejection mailed — §103

Jun 02, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/763,771

Patent 12666115

DYNAMIC CONDITIONAL ADVERTISEMENT INSERTION

1y 11m to grant Granted Jun 23, 2026

18/887,470

Patent 12666100

METHODS AND SYSTEMS FOR MANAGING MEDIA SUBSCRIPTIONS

1y 9m to grant Granted Jun 23, 2026

18/281,801

Patent 12640071

METHOD OF GENERATING COLOR MAPPING TABLE, METHOD OF CORRECTING COLOR, APPARATUS FOR GENERATING COLOR CORRECTION MODEL, APPARATUS FOR CORRECTING COLOR, COMPUTER READABLE MEDIUM, AND DISPLAY DEVICE

2y 8m to grant Granted May 26, 2026

18/944,708

Patent 12621520

SYSTEMS AND METHODS FOR ROUTING CONTENT TO AN ASSOCIATED OUTPUT DEVICE

1y 5m to grant Granted May 05, 2026

18/175,465

Patent 12587707

SESSION TYPE CLASSIFICATION FOR MODELING

3y 0m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3

Expected OA Rounds

58%

Grant Probability

96%

With Interview (+38.2%)

3y 12m (~2y 5m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 514 resolved cases by this examiner. Grant probability derived from career allowance rate.