Prosecution Insights
Last updated: May 29, 2026
Application No. 18/120,375

SYSTEM AND METHOD FOR AUTOMATICALLY GENERATING SYNTHETIC HEAD VIDEOS USING A MACHINE LEARNING MODEL

Final Rejection §103
Filed
Mar 11, 2023
Priority
Mar 11, 2022 — IN 202241013438
Examiner
WITHEY, THEODORE JOHN
Art Unit
2655
Tech Center
2600 — Communications
Assignee
International Institute Of Information Technology Hyderabad
OA Round
4 (Final)
44%
Grant Probability
Moderate
5-6
OA Rounds
0m
Est. Remaining
95%
With Interview

Examiner Intelligence

Grants 44% of resolved cases
44%
Career Allowance Rate
11 granted / 25 resolved
-18.0% vs TC avg
Strong +51% interview lift
Without
With
+51.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
17 currently pending
Career history
64
Total Applications
across all art units

Statute-Specific Performance

§101
0.6%
-39.4% vs TC avg
§103
99.4%
+59.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 25 resolved cases

Office Action

§103
DETAILED ACTION This office action is in response to Applicant’s Amendment/Request for Reconsideration, received on 02/17/2026. Claims 1, 3, 5-7, 9-13, 15, 17-19 are pending and have been considered. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant's arguments filed 02/17/2026 have been fully considered but they are not persuasive. Applicant’s representative asserts, “The Office Action states that Song anticipates and/or teaches the claimed limitation of ‘extracting at least one feature from each frame of at least one video that is extracted from at least one data source’ by allegedly disclosing the generation of lip-sync data. Specifically, the Examiner relies on paragraph [0073], which describes generating lip-sync data including identification information regarding frames in a template video, generated lip images, and position information of lip images within the template frames, and further relies on paragraph [0135], which states that the server obtains a template video by loading the template video from memory. The Office Action asserts that the template video stored in memory corresponds to a data source, and that the lip-sync data, including lip position tracking information, corresponds to the claimed extraction of features from each frame, and therefore concludes that the claimed feature extraction step is anticipated by Song. Applicant amended claim 1 recites ‘extracting at least one feature from each frame of at least one video that is extracted from at least one data source.’ The Applicant's amended claims 1, 12 and 13 disclose extracting at least one feature from each frame of at least one video that is extracted from at least one data source, where the extracted features are subsequently analyzed by a face-detection model to determine whether the video qualifies as a driving face video. These extracted features form the technical basis for multi-scale box tiling, anchor generation, and anchor classification to verify that a single speaker identity is consistently present across all frames, before any speech alignment or lip modification is performed. In the portions relied upon by the Examiner, Song discloses that the server generates a lip image corresponding to a voice for each frame of a template video and generates lip-sync data including identification information of frames, generated lip images, and positional information of lip images within the template frames ([0073]). Song further discloses that the template video may be obtained by loading a stored template video from memory ([0135]). Thus, Song describes using a pre-existing template video and generating lip images and positional metadata associated with frames, rather than extracting analytical features from the video frames themselves. The applicant's amended claims 1, 12 and 13 require active, frame-level feature derivation from raw video data for analytical decision-making, rather than merely loading a video or associating positional metadata with frames. The extracted features directly enable downstream face detection, anchor-based classification, and speaker validation. In contrast, Song merely generates lip images or stores frame identifiers, but does not disclose or suggest this feature-centric, pre-processing extraction stage, and therefore fails to teach or anticipate the claimed extracting limitation, as recited in the applicant amended claims 1, 12 and 13. Thus, the applicant submits that Song fails to teach the following limitation of the applicant amended independent claims 1, 12 and 13: ‘extracting at least one feature from each frame of at least one video that is extracted from at least one data source;’” In response, the examiner respectfully disagrees with Applicant’s assertion that the template video of Song does not apply to a “video that is extracted from at least one data source” as currently claimed. [0062] of Song discloses “obtain a template video including at least one frame”. Further, [0147] discloses “the server 100 may input a first lip image obtained from a first frame of a template video”. Lip images obtained from template videos on a frame-basis tracks to an extracted feature from that frame, wherein the template video is stored ([0135]) as agreed with by Applicant. The examiner understands that Song also discloses generating lip images corresponding to frames of video, but respectfully asserts that these lip images are generated based on modifications to originally present lip images, extracted on a frame-level of a video. Applicant’s representative continues, “The Office Action states that Song anticipates and/or teaches the claimed limitation of ‘generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source.’ Specifically, the Examiner relies on Figure 9 and paragraph [0139], which disclose that the server may generate a target voice from text using a trained second artificial neural network, and further relies on paragraph [0086], which describes memory storing training data. The Office Action asserts that storing training data in memory corresponds to a data source and that generating a target voice from text implies automatic selection of words and sentences, thereby concluding that Song anticipates the claimed synthetic speech generation step. The applicant's amended claims 1, 12 and 13 recite ‘generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source;’ The Applicant's claims 1, 12 and 13 disclose generating, using a text-to-speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from at least one data source, in which the system itself selects words and sentences from the data source and generates corresponding synthetic speech utterances using the text-to-speech model, without relying on externally supplied or manually authored text input. In the portions relied upon by the Examiner, Song discloses that a server may generate a target voice from text using a trained second artificial neural network, as illustrated in Figure 9 and described in paragraph [0139]. Song further discloses that memory may store coefficients of a trained neural network, training data for training the neural network, or data received from another server ([0086]). Thus, Song describes speech generation from already-available text and the storage of training data or model parameters in memory but does not describe how the text itself is selected or generated. Song does not disclose or suggest automatically selecting a vocabulary of words and sentences from a data source as required by the Applicant's claims, as recited in the applicant amended claims 1, 12 and 13. Song merely teaches generating a voice output when text is provided and storing training data in memory, which is different from the claimed automatic selection of textual content used to generate synthetic speech utterances, as recited in the applicant amended claims 1, 12 and 13. The Examiner's position improperly equates the storage of training data with automatic vocabulary selection and conflates text-to-speech conversion with text selection. Because Song lacks any disclosure of an automated mechanism that selects words or sentences from a data source prior to speech synthesis, it does not anticipate or render obvious the applicant claimed limitation of ‘generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source’. Accordingly, the cited portions of Song do not teach ‘generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source’ , as recited in the applicant amended claims 1, 12 and 13, and the objection should be withdrawn.” In response, the examiner respectfully disagrees with Applicant’s assertion that the data of Song does not apply to “automatically selecting a vocabulary of words and sentences from the at least one data source”. Specifically, Song discloses generating a target voice from text using a neural network ([0139]), wherein the text “may be generated by the server 100 according to a certain rule or method” ([0140]). It is unclear to the examiner how generation of a target voice is conflating text-to-speech conversion to text selection when speech is generated. The certain rule or method being employed is generation via neural network. Presented with instructions, the neural network will “automatically” perform those instructions. Further, there is no indication in the claims as to how “automatically selecting a vocabulary of words and sentences” is performed other than retrieval from a data source. Applicant’s representative continues, “The Office Action states that Song anticipates the claimed limitation of ‘modifying lip movements of a single speaker corresponding to synthetic speech utterances’ by relying on paragraph [0146], which discloses modifying a first lip image based on an input voice, and Figure 9, which shows that the voice may be generated from input text. The Examiner asserts that modifying a lip image according to a synthesized voice corresponds to the claimed modification of lip movements in a driving face video. The applicant's amended claims 1, 12 and 13 recite ‘generating, using the machine learning model, the at least one synthetic talking head video by providing the driving face video and the aligned speech utterances to a talking face video generator framework which modifies lip movements of a single speaker originally present in the driving face video corresponding to the synthetic speech utterances.’ The Applicant's amended claims 1, 12 and 13 state that a synthetic talking-head video is generated by using a machine learning model that takes a driving face video and aligned speech utterances as inputs and modifies the lip movements of the single speaker originally present in the driving face video to match the synthetic speech utterances. In the portions relied upon by the Examiner, Song discloses that a first artificial neural network is trained to output a second lip image by modifying a first lip image based on an input voice, where the voice and the first lip image are provided as inputs ([0146]). Song further discloses, with reference to Figure 9, that the voice may be generated from input text using a neural network. Thus, Song teaches lip image modification driven by a voice input, independent of a driving face video or aligned speech utterances. The Applicant's amended claims 1, 12, and 13 go beyond merely modifying a lip image based on a voice. The applicant claims require generating a synthetic talking-head video by providing a driving face video and aligned speech utterances to a talking-face video generator framework that modifies lip movements of a single speaker originally present in the driving face video. This results in a concrete technical improvement namely, accurate audio visual synchronization within a full video sequence while preserving speaker identity rather than an abstract idea. Unlike Song, the applicant claimed invention integrates video-level processing, speech alignment, and selective lip-motion modification in a unified machine-learning pipeline, , as recited in the applicant amended claims 1, 12 and 13, thereby providing a specific, that improves video synthesis technology and overcomes the §103 objection. Thus, the applicant submits that Song fails to teach the following limitation of the applicant's amended independent claims 1, 12 and 13: ‘generating, using the machine learning model, the at least one synthetic talking head video by providing the driving face video and the aligned speech utterances to a talking face video generator framework which modifies lip movements of a single speaker originally present in the driving face video corresponding to the synthetic speech utterances.’ The Applicant respectfully submits that the Examiner has acknowledged that Song does not disclose analyzing, using a face-detection model, extracted features to determine a driving face video by confirming that the number of identities and speaker faces is equal to one across all frames of the video. This acknowledged deficiency demonstrates that the claimed invention recites specific, concrete technical steps that are not taught by Song and that go beyond any abstract idea. Accordingly, the applicant claims 1, 12 and 13 are directed to a practical application of machine learning for video processing, providing a technical improvement in accurately selecting a driving face video, and therefore satisfy the requirements of patent eligibility under § 103.” In response, the examiner respectfully disagrees with Applicant’s assertions that Song does not disclose “video-level processing, speech alignment, and selective-lip motion modification”. Specifically, it is unclear to the examiner how a template video being modified on a frame-level is not video-level processing. A video is necessarily comprised of individual frames. Further, considering the previously cited section of Song which discloses modifying lip images extracted from the template video, this will result in a generated, modified video (compared to the original template video) as output. With respect to Applicant’s assertion against Song not disclosing the “analyzing…” step, the examiner respectfully asserts that Song need not disclose an element which is disclosed by another piece of art (Qin). The template video of Song, wherein the video will be only containing one speaker as disclosed in Qin, tracks to a driving face video as currently claimed. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Applicant’s representative continues, “The Office Action states that Qin discloses and/or teaches the claimed limitation of ‘analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video.’ Specifically, the Examiner relies on paragraph [0128], which describes using a face detection model provided by the DUB library to perform face detection on video clips and deleting videos in which two or more faces are detected. The Office Action asserts that deleting videos with multiple detected faces corresponds to determining a driving face video in which only a single speaker is present, particularly when viewed in combination with the lip-sync method disclosed by Song. The applicant's amended claims 1, 12, and 13 recite ‘analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video, wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame; generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors.’ The Applicant's amended claims 1, 12 and 13 recite a structured and frame-level analysis in which extracted features are analyzed using a face-detection model to determine a driving face video only if a single speaker identity is present across all frames. This determination is performed through multi-scale box tiling on each frame, anchor generation representing location, shape, and size of the detected speaker, and classification of the anchors against pre- set anchors, thereby ensuring consistent and refined detection of the same single speaker throughout the video. In the portion relied upon by the Examiner, Qin discloses using a face detection model provided by the DLIB library to detect faces in video segments and deleting video segments that contain two or more faces ([0128]). Qin further categorizes videos containing the same face to prepare training data for an audio-driven face video generation model, with the stated purpose of avoiding interference from multiple faces during training and improving model accuracy. While Qin removes video segments containing multiple faces, Qin does not disclose or suggest the Applicant's claimed frame-by-frame determination of a driving face video using anchor-based detection, multi-scale box tiling, or anchor classification against pre-set anchors , as recited in the applicant amended claims 1, 12 and 13. Qin's disclosure is limited to filtering training data by deleting multi-face videos and does not perform refined, per-frame speaker verification or anchor-based consistency checks across all frames of a video. Accordingly, Qin does not teach or suggest the applicant claimed technical mechanism for determining a driving face video with a single speaker, as recited in the applicant amended claims 1, 12 and 13, and even when considered in combination with Song, does not render the Applicant's claims obvious. Therefore, the applicant claimed subject matter involves a technical advancement and satisfies the inventive step requirement under § 103. Thus, the applicant submits that Song fails to teach the following limitation of the applicant amended independent claims 1, 12 and 13: ‘analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video, wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame; generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors.’ The Office Action asserts that Song and Qin are analogous art in the field of video frame analysis and modification. The Examiner states that it would have been obvious to a person of ordinary skill in the art, prior to the effective filing date, to modify Song by incorporating Qin's teaching of identifying the number of people or faces in a video and discarding videos containing more than one person in order to reduce training data cost and improve model training efficiency, as described in Qin ([0096]). The Applicant's amended claims 1, 12 and 13 disclose a method that analyzes video frames using a face-detection model to determine a driving face video only when exactly one speaker identity is present across all frames. This analysis is performed by placing multiple detection boxes of different sizes and shapes on each frame, generating anchors that represent the speaker's location, shape, and size, and classifying those anchors to consistently identify and track the same single speaker throughout the video. After analysis and alignment, the Applicant's method modifies only the lip movements of that identified single speaker in the driving face video using aligned synthetic speech, while keeping the rest of the facial features unchanged, thereby producing a synchronized synthetic talking-head video. In the portion relied upon by the Examiner, Qin discloses selectively using only certain synthesized images as training data to improve model training efficiency. Specifically, Qin teaches cropping synthesized images and using only those images that contain text content as target images for training, while discarding other images, thereby reducing training data volume and improving training efficiency ([0096]). This disclosure is directed to training data selection and optimization, rather than to runtime video analysis or speaker determination. The Applicant respectfully submits that, even if Song and Qin are considered analogous art, their combination does not render the Applicant's claims obvious. Qin's teaching is limited to filtering or cropping training images based on the presence of text content to improve training efficiency and does not disclose or suggest analysing video frames to determine a driving face video based on single-speaker consistency across all frames, nor does it teach multi-scale box tiling, anchor generation, or anchor classification for refined speaker detection. The Applicant's amended claimed invention operates at a different technical level by performing frame-by-frame face analysis during video processing, followed by selective modification of lip movements of the identified single speaker in a driving face video using aligned synthetic speech. A person of ordinary skill in the art would not be motivated to modify Song with Qin's training-data cropping technique to arrive at the Applicant's claimed anchor- based, single-speaker analysis and lip-modification pipeline. Accordingly, the asserted combination relies on hindsight reconstruction and fails to establish a prima facie case of obviousness under §103.” In response, the examiner respectfully asserts that the Applicant is not considering the complete combination of art featured to reject this part of the claims. For example, Applicant's claimed frame-by-frame determination of a driving face video using anchor-based detection, multi-scale box tiling, or anchor classification against pre-set anchors , as recited in the applicant amended claims 1, 12 and 13 are taught by Yoo and Mann. Applicant does not acknowledge these pieces of art or their teachings as cited in the non-final action mailed 11/17/2025. The rejection relies upon all pieces of art to arrive at the claimed invention. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning. But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper. See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971). Applicant’s representative continues, “The Applicant respectfully submits that Yoo does not disclose the claimed limitation ‘wherein the face-detection model detects a single speaker in the driving face video by tiling a plurality of boxes on each frame of the video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame, as recited in the applicant amended claims 1, 12 and 13.’ Although Yoo describes setting predefined recognition target areas (A1-A3) corresponding to facial elements such as eyes, nose, and mouth (e.g., Figs. 9A-9C), these regions are fixed facial sub-areas derived from detected facial landmarks, rather than a tiled plurality of boxes at multiple scales and aspect ratios used to iteratively refine single-speaker detection across frames. Accordingly, Yoo fails to teach or suggest the claimed multi-scale box- tiling-based detection mechanism, as recited in the applicant amended claims 1, 12 and 13. In the portion relied on Figs. 9A-9C, Qin discloses filtering video or image data by detecting the presence of faces and discarding samples that include more than one person, with the goal of reducing training data cost and improving model training efficiency. Qin's analysis is limited to a coarse determination of whether a frame or clip contains a single face versus multiple faces, and the disclosed processing is applied primarily as a pre-processing or data selection step for training. Qin does not describe frame-by-frame analysis of a driving face video, nor does it disclose detecting a single speaker by tiling a plurality of boxes on each frame with different scales and aspect ratios to iteratively refine speaker detection across frames. The Applicant respectfully submits that neither Qin alone nor Qin in view of Yoo teaches or suggests the claimed analysing and modifying operations. The Applicant's claims require a face-detection model that performs per-frame analysis of a video to confirm single-speaker consistency across all frames and, critically, detects the single speaker by tiling multiple boxes of different scales and aspect ratios on each frame to refine detection, as recited in the applicant amended claims 1, 12 and 13. Yoo's predefined facial sub-regions (eyes, nose, mouth) are landmark-based recognition areas, not a multi-scale, anchor-like box tiling mechanism used to refine single-speaker detection, and Qin provides no motivation to modify Yoo to arrive at such an approach. Any assertion that these disparate teachings could be combined to yield the claimed multi-scale, frame-wise detection pipeline relies on hindsight reconstruction rather than a teaching, suggestion, or motivation in the prior art. Accordingly, the cited references fail to establish a prima facie case of obviousness under 35 U.S.C. §103. Thus, the applicant submits that Yoo fails to teach the following limitation of the applicant amended independent claims 1, 12 and 13: ‘wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame; generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors.’” In response, the examiner would like to refer to the combination of art used to make the rejection. Specifically, it is unclear to the examiner how “predefined facial sub-regions” which are boxed (see Figs. 9-10) as disclosed in Yoo do not track to a tiled plurality of boxes with different scales and aspect ratios as currently claimed. Qin discloses keeping videos with a single speaker. Yoo discloses generating facial region boxes for a single speaker. There appears to be a clear motivation for combination. Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references. Applicant's arguments do not comply with 37 CFR 1.111(c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections. Applicant’s representative continues, “The Office Action asserts that Song, Qin, and Yoo are analogous art in the field of video feature monitoring and tracking, particularly with respect to facial feature analysis. The Examiner states that it would have been obvious to a person of ordinary skill in the art, prior to the effective filing date of the claimed invention, to modify Song in view of Qin and further incorporate the teachings of Yoo. According to the Office Action, Yoo discloses a novel multi-modal emotion recognition technique applied to received video, including mouth-shape tracking to determine conversation states, which purportedly improves emotion recognition accuracy (see Yoo, [0006]). The Examiner therefore concludes that combining Song's video processing with Qin's filtering approach and Yoo's facial feature and mouth-tracking techniques would have rendered the claimed subject matter obvious. Applicant respectfully submits that the Examiner's §103 rejection based on Song in view of Qin and Yoo is improper. Qin is limited to filtering or cropping synthesized images based on the presence of text content to improve training efficiency, and does not disclose or suggest analysing video frames to determine whether a driving face video contains exactly one speaker identity across all frames. Yoo, on the other hand, is directed to multi-modal emotion recognition, using predefined facial sub-regions (eyes, nose, mouth) derived from facial landmarks to infer emotional or conversational states. Yoo does not teach tiling a plurality of detection boxes of different scales and aspect ratios across each video frame, generating and classifying anchors, or consistently identifying a single speaker throughout an entire video for qualification as a driving face video. In contrast, the Applicant's claimed invention performs a fundamentally different technical process by conducting frame-by-frame face analysis to verify single-speaker consistency using multi-scale box tiling and anchor-based detection, and only thereafter selectively modifying the lip movements of that identified single speaker using aligned synthetic speech while leaving other facial features unchanged, as recited in the applicant amended claims 1, 12 and 13. A person of ordinary skill in the art would have no motivation to combine Qin's training-data filtering approach with Yoo's emotion-recognition techniques to modify Song in a manner that yields the Applicant's claimed single-speaker analysis and targeted lip-modification pipeline. The Examiner's rationale thus relies on hindsight reconstruction and fails to establish a prima facie case of obviousness under § 103. Applicant respectfully submits that the Examiner has correctly identified that Song in view of Qin, further in view of Yoo, does not disclose or suggest the following limitations of the applicant amended independent claims 1, 12 and 13: ‘generating a plurality of anchors based on a plurality of boxes tiled on each frame of the video, wherein each anchor represents a location, shape, and size of a single speaker; classifying the plurality of anchors by correlating them with a series of pre-set anchors; and aligning each synthetic speech utterance with regions of the driving face video by detecting lip landmarks in each frame and tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip landmarks.’ Neither Song nor Qin teaches an anchor-based face detection framework involving multi- scale box tiling and anchor classification. Qin is limited to filtering or cropping data to improve training efficiency and does not disclose anchor generation or anchor classification for speaker detection. Yoo, while describing facial element regions for emotion recognition, relies on predefined facial sub-areas derived from landmarks and does not disclose generating or classifying anchors based on tiled boxes, nor tracking lip-landmark motion relative to such anchors across frames. Accordingly, even when Song, Qin, and Yoo are considered together, they fail to disclose or suggest the claimed anchor-based face detection and lip-movement alignment mechanism, as recited in the applicant amended claims 1, 12 and 13. The applicant claimed invention introduces a distinct technical approach that integrates anchor generation, anchor classification, and anchor- relative lip-landmark tracking to precisely align synthetic speech with a single speaker's mouth movements. As such, the cited references, alone or in combination, do not render the claimed subject matter obvious, and the § 103 rejection should therefore be withdrawn.” In response, the examiner respectfully asserts that none of Song, Qin, or Yoo are relied upon for disclosure of the anchors argued against by Applicant. Mann is incorporated to resolve the deficiencies of the combination of Song + Qin + Yoo. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Applicant’s representative continues, “The Office Action asserts that Mann discloses the claimed limitation of generating a plurality of anchors based on a plurality of boxes tiled on each frame of a video. In particular, the Examiner relies on paragraph [0065], which describes creating boxed image sequences for each character and performing precision face tracking to analyze facial features. The Examiner interprets this face-tracking process, including tracking facial landmarks (see [0093]), as generating anchors based on the boxed regions in each frame. The Applicant's amended claims 1, 12, and 13 recite generating a plurality of anchors from a plurality of boxes tiled on each frame of a video, where each anchor explicitly represents the location, shape, and size of a single speaker. These anchors are used as a structured, frame-by- frame representation to consistently identify and track the same speaker across all frames, enabling reliable downstream processing. In the portion relied on by the Examiner, Mann (para [0065]) describes creating a boxed image sequence for each character and performing a precision face-tracking process to analyses facial features, where the tracked points are referred to as facial landmarks. The Office Action interprets this disclosure, together with references to facial landmark tracking, as teaching the generation of anchors based on boxed regions in each frame. The applicant respectfully submits that, Mann does not disclose or suggest the applicant's amended claimed limitation of "generating a plurality of anchors from a plurality of boxes tiled on each frame of a video, where each anchor explicitly represents the location, shape, and size of a single speaker ", as recited in the applicant amended claims 1, 12 and 13. Mann's facial landmarks are feature points used for analyzing facial motion within an already-identified face, not anchors generated from tiled multi-scale boxes that encode location, shape, and size of a single speaker across frames. Mann lacks any teaching of systematically tiling boxes of varying scales and aspect ratios on each frame, generating corresponding anchors, or using those anchors to enforce single-speaker consistency. The Examiner's position equates generic facial landmark tracking with the Applicant's structured anchor generation, which operates at a different technical level and serves a distinct purpose. Accordingly, Mann fails to teach or suggest the claimed limitation, and the § 103 objection should be withdrawn. Further, the Office Action states that Mann discloses that each anchor represents the location, shape, and size of a single speaker. Specifically, paragraph [0090] describes determining the X-Y coordinates of the face (location), creating an effective box around the face (shape), and determining the size of the face (size). The Examiner therefore concludes that Mann teaches anchors corresponding to the location, shape, and size of the detected speaker across video frames. The Applicant's amended claims require generating a plurality of anchors from a plurality of boxes tiled on each frame of the video, where each anchor jointly represents the location, shape, and size of a single speaker. These anchors are not merely descriptive outputs, but structured reference elements generated from multi-scale, multi-aspect-ratio box tiling and used to consistently identify and track the same single speaker across successive frames. In the portion relied on by the Examiner, Mann (Para [0090]) describes a coarse face identification process in which original source footage is scanned to detect the existence of a face, and, if detected, the X-Y coordinates of the face and the size of the face are determined, thereby creating an effective box around the face. The Office Action interprets this disclosure as teaching that each detected face box represents the location (X-Y coordinates), shape (box), and size of the speaker. The applicant respectfully, Mann's disclosure does not teach or suggest the claimed limitation of "generating a plurality of anchors from a plurality of boxes tiled on each frame of a video, where each anchor explicitly represents the location, shape, and size of a single speaker ", as recited in the applicant amended claims 1, 12 and 13. Mann describes a single coarse bounding box per detected face used for initial face identification, not a systematic tiling of multiple boxes with different scales and aspect ratios, nor the generation of multiple anchors per frame derived from such tiling. Further, Mann's "box" is an output of face detection, whereas the Applicant's anchors are deliberately generated reference structures used for refined single-speaker detection, tracking, and downstream processing across frames. Equating Mann's coarse face bounding box with the Applicant's anchor-based framework improperly conflates basic face detection with a specialized anchor-generation mechanism. Accordingly, Mann does not disclose or render obvious the claimed limitation, and the § 103 objection should be withdrawn. Thus, the applicant submits that Mann fails to teach the following limitation of the applicant amended independent claims 1, 12 and 13: "generating a plurality of anchors from a plurality of boxes tiled on each frame of a video, where each anchor explicitly represents the location, shape, and size of a single speaker’”. In response, the examiner would like to refer to the combination of art used to reject the claims. Specifically, the examiner respectfully asserts that Mann need not disclose “tiling a plurality of boxes on each from of the at least on video with different scales and aspect ratios” because Yoo discloses this element. These boxes of Yoo are being taken in view of the landmarks of Mann. The examiner respectfully asserts that it would have been obvious to apply the landmark for one box of Mann to the multiple boxes located on a face of Yoo. Further, as currently claimed, the anchors have no connection to single-speaker detection. Single speaker-detection is a step which has been performed prior to this element, taught by Qin. The anchors are used for classifying through correlation to pre-set anchors. This does not indicates a connection to amount of people located in a frame/video. Refining detection of a single speaker indicates an original detection of an original speaker must be made in order to be refined. Further, the ability to box a face, regardless of operation performed with relation to the face boxing, will necessarily result in a box having a location, shape, and size of a speaker. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Applicant’s representative continues, “The Examiner asserts that Song, in view of Qin, discloses the applicant claimed limitations of (i) classifying a plurality of anchors by correlating them with a series of pre-set anchors to detect a single speaker, and (ii) detecting mouth movements in each frame by detecting lip landmarks. In particular, the Examiner relies on paragraph [0093] of Song, which describes running precise face tracking to analyze facial features by tracking key facial points (facial landmarks) in a 2D image, and interprets the initially determined facial landmarks (see [0060]) as pre-set anchors that are compared against subsequently tracked landmarks to classify and track changes of a talking head. The Examiner further relies on paragraph [0060] and related disclosures describing precision face tracking across frames, wherein facial landmarks-including lip landmarks-are identified for every frame and used to assess lip sync, thereby asserting that lip-landmark-based mouth movement detection is disclosed. The applicant's amended claims disclose ‘generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors; generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source; aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements in each frame of the driving face video by detecting lip- landmarks and tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames, wherein portions of the synthetic speech utterance corresponding to regions with no detected mouth movements are padded with silence to determine an aligned speech utterance.’ The Applicant claims 1, 12, 13 disclose multiple boxes are placed on each video frame to generate anchors that represent the location, size, and shape of a single speaker, and these anchors are classified by comparison with predefined reference anchors to consistently identify the speaker. The applicant method then automatically selects words and sentences from a data source to generate synthetic speech, and aligns the speech with the video by tracking lip landmarks across successive frames to detect mouth movements and their changes over time, padding silence where no mouth movement is detected to achieve accurate lip-synchronized output. In the portions relied upon by the Examiner, Song describes example structures of first and second artificial neural networks used by a server. Paragraphs [0060] and [0093] generally discuss neural-network architectures and facial landmark-based face tracking for analysis. However, these disclosures are limited to describing trained neural networks and tracking facial landmarks, without detailing how anchors are generated from tiled boxes, how such anchors encode location/shape/size, or how anchors are classified by correlation with pre-set anchors. Song does not disclose or suggest the following limitation of the applicant amended independent claims 1, 12 and 13: (i) tiling a plurality of boxes on each frame to generate anchors representing location, shape, and size, (ii) classifying anchors by correlating with pre-set anchors, (iii) automatically selecting vocabulary from a data source for text-to-speech generation, or (iv) aligning synthetic speech to video using lip-landmark changes tracked relative to anchors across frames with silence padding. Song's neural-network descriptions and facial landmark tracking are therefore structurally and functionally different from the Applicant's anchor-based, speech-selection, and anchor- relative lip-alignment pipeline, and do not teach or render obvious the claimed limitations.” In response, the examiner would like to refer to the art used to reject the argued elements. Specifically, the examiner respectfully asserts that Mann was used to reject (i) classifying a plurality of anchors by correlating them with a series of pre-set anchors to detect a single speaker and (ii) detecting mouth movements in each frame by detecting lip landmarks. It is unclear to the examiner why Song is relevant to these elements. Applicant fails to consider the anchors of Mann in view of the combination of Song + Qin + Yoo. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references. Applicant's arguments do not comply with 37 CFR 1.111(c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections. Applicant’s representative continues, “The Office Action asserts that Song, Qin, Yoo, and Mann are analogous art in the field of video analysis for dubbing. The Examiner contends that it would have been obvious to a person of ordinary skill in the art, prior to the effective filing date, to modify Song in view of Qin, further in view of Yoo, to incorporate the teachings of Mann. Specifically, Mann is cited for tracking specific facial features using facial landmarks during an audio dubbing process to improve audio/visual synchronization ([0005]). The Examiner therefore concludes that combining Song's video processing, Qin's filtering, Yoo's multi-modal facial feature tracking, and Mann's landmark-based facial tracking would have rendered the claimed invention obvious. The Applicant respectfully submits that the cited references Song, Qin, Yoo, and Mann do not disclose several critical features of the claimed invention. Specifically, none of the references teach tiling a plurality of detection boxes on each video frame with different scales and aspect ratios to refine single-speaker detection, generating multiple anchors from these boxes where each anchor represents the speaker's location, shape, and size, or classifying the anchors by correlating with a series of pre-set anchors for accurate tracking across frames. Further, the prior art does not disclose aligning synthetic speech with the driving face video by detecting mouth movements and tracking lip-landmarks relative to the anchors, nor handling portions of synthetic speech corresponding to frames with no mouth movement by padding with silence. These features are essential to the claimed invention and are absent from the combined teachings of the cited references, demonstrating a non-obvious technical advancement. The Office Action asserts that Yoo discloses tracking changes of lip landmarks relative to a plurality of anchors across successive frames to determine a rate of change of lip landmarks between a predefined threshold of frames. In particular, the Examiner relies on Figs. 10A-B, 12A- 12G, and paragraphs [068], [071], [082], [083], and [179], which describe using a landmark detection module and location adjustment module to track facial elements such as the eyes, nose, mouth, and forehead and to determine conversation states based on changes in mouth features. The Office Action contends that these disclosures correspond to the claimed lip-tracking and alignment step. In the portion relied on, Yoo (Figs. 10A-B, and Para [068], [071], [082], [083], [179]) describes tracking facial elements such as the eyes, nose, mouth, and forehead using a landmark detection module and a location adjustment module. Yoo further discloses determining conversation states by analysing changes in the mouth region over successive frames of a video. The focus of these techniques is on inferring whether a user is in a conversation or detecting emotional states from facial movements, rather than generating synthetic speech or aligning it with lip movements. The Applicant's amended claims require aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements in each frame, tracking changes of lip landmarks relative to a plurality of anchors across successive frames, and determining a rate of change of the lip-landmarks between a predefined threshold of frames, as recited in the applicant amended claims 1, 12 and 13. Yoo does not teach generating anchors from tiled detection boxes, correlating anchors with pre-set anchors, or using the rate of change of lip landmarks to synchronize synthetic speech with lip movements. The applicant claimed invention operates at a fundamentally different technical level, providing precise frame-by-frame lip-speech alignment, which is neither disclosed nor suggested by Yoo. The Office Action asserts that Song discloses generating lip-sync data for a plurality of template video frames, including frame identification, lip images, and position information (Figs. 7, 10, 11, and Para [062], [080], [146], [151], [152]). The Examiner states that Song modifies a first lip image according to a voice input, generates aligned synthetic speech utterances, and produces output frames by overlaying the lip images on the driving face video on a frame-by- frame basis. The Office Action further asserts that Song pads portions of the synthetic speech corresponding to regions with no detected mouth movements with silence, and that the combination of these modified frames generates a synthetic talking-head video with lip movements of the single speaker modified in correspondence with the synthetic speech. The Applicant's amended claims 1, 12 and 13 require aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements, tracking changes of the lip landmarks relative to a plurality of anchors across successive frames, and determining a rate of change of the lip landmarks between a predefined threshold of frames. Song does not disclose generating anchors from tiled boxes, correlating these anchors with pre-set anchors, or performing rate-of-change-based lip alignment for synthetic speech. The applicant's claimed invention provides a technical improvement by precisely synchronizing synthetic speech with lip movements using anchor-based tracking and temporal analysis, which is not taught or suggested by Song. The Office Action asserts that Song, Qin, Yoo, and Mann disclose aspects of lip-syncing, face tracking, and synthetic video generation, and suggests it would have been obvious to combine these references. However, the references operate on fundamentally incompatible architectures and data types. Mann's high-fidelity 3D dubbing pipeline relies on full 3D facial reconstruction and neural rendering, whereas Qin's system generates only 2D text- overlaid images for training datasets. Integrating Qin's 2D approach into Mann's 3D pipeline would destroy the 3D geometry and prevent proper performance capture. Similarly, Song's 2D overlay system for real-time delivery is incompatible with Qin's offline batch process, as Qin is not designed for frame-by-frame real-time rendering. Moreover, the technical approaches of Song and Mann are mutually exclusive. Song pastes 2D lip images onto video frames, while Mann regenerates face using 3D meshes. Yoo's 2D emotion-based feature extraction, which determines only binary conversation states, is too coarse to drive Mann's phoneme-level lip synchronization. The prior art provides no teaching of integrating 2D conversation-state or emotion data into a 3D generative pipeline for accurate speech-lip alignment, highlighting a clear technical incompatibility. The applicant respectfully submits that amended claims invention includes a novel rate- of-change lip alignment mechanism, which measures changes in lip landmarks across a predefined threshold of frames to synchronize synthetic speech and insert silence where no mouth movement occurs. Neither Song, Mann, nor Yoo teach or suggest this rate-of-change alignment, and forcing their combination would require a complete architectural redesign rather than a simple combination. For these reasons, the Applicant respectfully submits that the asserted combination fails to render the claimed invention obvious and respectfully requests that the Examiner reconsider and withdraw the §103 rejection.” In response, the examiner respectfully disagrees with Applicant’s assertion that “Mann's high-fidelity 3D dubbing pipeline relies on full 3D facial reconstruction and neural rendering, whereas Qin's system generates only 2D text- overlaid images for training datasets. Integrating Qin's 2D approach into Mann's 3D pipeline would destroy the 3D geometry and prevent proper performance capture. Similarly, Song's 2D overlay system for real-time delivery is incompatible with Qin's offline batch process, as Qin is not designed for frame-by-frame real-time rendering”. Specifically, Mann discloses 2D synthetic facial model image renders ([0044]) which has a corresponding original boxed image of the character ([0062]). This indicates a boxed 2D image which would be applicable to the 2D system of Qin. Further, with regard to “frame-by-frame real-time rendering”, the examiner respectfully asserts that Qin explicitly discloses video frame analysis for determining which videos contain one person (see facial key point detection for each frame of the face video, [0130]). It is unclear why the frame-analysis of Qin could not be applied to the frame-analysis of Song. With regard to Applicant’s assertions that the prior art does not teach individual elements of the claims, the examiner would like to refer to the previous analysis given above. Further, Yoo need not disclose synthetic speech alignment because Song discloses this. Song need not disclose generating anchors from tiled boxes because Yoo in view of Mann discloses this. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Applicant’s conclusory arguments, see pgs. 41-42, are not persuasive in view of the above analysis maintaining the combination of art used to reject the claims. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1, 3, 6, 10-13, 15, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Song et al. (US-20230023102-A1), hereinafter Song, in view of Qin et al. (CN-111798543-A), hereinafter Qin, further in view of Yoo et al. (US-20190163965-A1), hereinafter Yoo, further in view of Mann et al. (US-20230015971-A1), hereinafter Mann. Regarding claim 1, Song discloses: a processor-implemented method ([Fig. 2, first processor 120]) for automatically generating at least one synthetic talking head video using a machine learning model ([0120] the first artificial neural network 520 according to an embodiment of the present disclosure may refer to an artificial neural network that is trained (or learns) to output a second lip image 543, which is an image generated by modifying the first lip image 542 according to the voice 531), the method comprising: extracting at least one feature from each frame of at least one video that is extracted from at least one data source ([0073] generate lip-sync data including identification information regarding frames in the template video, generated lip images, and information of positions of the lip images in template frames [lip-sync data including lip position information tracks to feature data], where, [0135] the server 100 may obtain a template video by loading the template video from the memory 130 [Memory maps to a data source]); generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source ([Fig. 9], [0139] In a selective embodiment of the present disclosure, the server 100 may generate a target voice from a text by using a trained second artificial neural network, where, [0086] the memory 330 may store training data [Gathering training data from a memory, i.e. data source, indicates that any training sample can be automatically selected as training data, text tracks to a vocabulary or words and sentences]); and, modifying lip movements of the single speaker that are originally present in the driving face video corresponding to the synthetic speech utterances ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input, [Wherein the voice can be generated through input text 570, indicating synthesis, as disclosed in Fig. 9]). Song does not disclose: analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video. Qin discloses: analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video ([0128] Use the face detection model provided by the DLIB library to perform face detection on the video clips obtained in the first part one by one, delete the videos with two or more faces in each video clip [Deleting videos with more than one face detected indicates a similar determination to keep videos for processing when there is only one face, in view of the lip-sync method of Song]). Song and Qin are considered analogous art within video frame analysis/modification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Qin, because of the novel way to identify the number of people/faces within a video and discarding video (Qin, [0096]). Song in view of Qin does not disclose: wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame. Yoo discloses: wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame ([Figs. 9A-C, Boxes A1-A3], [0175] the video pre-processing unit 210 sets the recognition target area A including the eye recognition area A.sub.1, the nose recognition area A.sub.2, and the mouth recognition area A.sub.3 based on the face element location information AL of the recognizable eyes, nose, and mouth of the user, [Wherein the boxes on the face of the speaker in Fig. 9 clearly have different scales and aspect ratios demonstrated through the differing box sizes and change based on positions, i.e. frame, of the person in the video]). Song, Qin, and Yoo are considered analogous art within video feature monitoring/tracking, wherein the features being tracked are facial. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin to incorporate the teachings of Yoo, because of the novel way to implement multi-modal emotion recognition to received video, including mouth shape tracking to determine conversation states, improving the accuracy of emotion recognition (Yoo, [0005]-[0006]). Song in view of Qin, further in view of Yoo does not disclose: generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors, and aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements in each frame of the driving face video by detecting lip landmarks and tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames. Mann discloses: generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video ([0065] Once each character's boxed SD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features of each character in their training material [Face tracking indicates an anchor location in order to be able to distinguish and track changes within the facial features of the frames, see tracking facial landmarks [0093]]); wherein each anchor represents a location of the single speaker ([0090] the X-Y coordinates… of the face is determined), a shape of the single speaker ([0090] creating an effective “box” around the face [Boxing faces gives the speaker’s “shape” in the form of a box dependent on location and size]), and a size of the single speaker ([0090] the size of the face is determined); and, classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors to detect the single speaker ([0093] precise face tracking is run to give the most accurate analysis of the face. This tracks the key points of the face (facial landmarks) in the 2D image [Tracking points of facial landmarks indicates stored data to be used for the tracking in view of the initial facial landmark determinations ([0060]), which further indicates a comparison of original pre-set anchors, i.e. initial facial landmark determinations, to the plurality of anchors, i.e. facial landmarks after initial location is determined, in order to track and classify changes to the talking head. In view of the face detection model of Song in view of Qin]), and, detecting mouth movements in each frame of the driving face video by detecting lip landmarks ([0060] Once each boxed TD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features as each character speaks. The tracking points identified on the faces are referred to as facial landmarks, where, [0070] In combination with the FL dialogue edit, the material can be assessed for image quality, lip sync, [0076] The face-tracking finds facial landmarks—key points on the face—for every frame of footage [Using facial landmarks to track speakers, wherein the produced material can be assessed for lip sync, indicates lip landmarks as part of the facial landmarks]). Song, Qin, Yoo, and Mann are considered analogous art within video analysis for dubbing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo to incorporate the teachings of Mann, because of the novel way to track specific facial features using facial landmarks when going through an audio dubbing process for improved audio/visual synchronization (Mann, [0005]). Yoo further discloses: tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames ([Figs. 10A-B], [Figs. 12A-12G], [0077] a landmark detection module 213, a location adjustment module 214, [0082] the landmark detection module 213 can confirm the location information of a important face element such as a face, eyes, a mouth, a nose, and a forehead in the recognition target area A, [0083] The location adjustment module 214 can adjust the location based on the face element location information, [0057] the system analyzes one or more video frames of the video to obtain a first set of indices (face feature data) indicating characteristic of one or more face elements of a person appearing in the video, [0059] the system analyzes a video comprising a first portion (frames F.sub.1, F.sub.2), a second portion (frames F.sub.3, F.sub.4, F.sub.5) following the first portion and a third portion (frames F.sub.6, F.sub.7) following the second portion, [0179] the conversation state inference module 310 determines whether or not it is in a conversation state through the mouth recognition area A.sub.3 based on the mouth video data, [0068] The preliminary inference unit 300 can generate situation determination data P as to whether or not the user's situation changes according to a temporal sequence based on the video data DV, [0071] the preliminary inference unit 300 can generate the conversation determination data P.sub.1 for determining whether or not the user is in a conversation state based on the face feature data DF.sub.1, [Each facial element bounding box location tracks to an anchor for each element in view of the previously disclosed anchors of Mann (consider the A and A' wherein A’ represents the anchor of Figs. 10A-B). Further, analyzing portions of video containing a defined number of frames, i.e. 2 or 3, indicates this number of frames to be representative of a threshold of frames. Further still, determining changes in mouth shape to be associated with a speaking state indicates the speaking state is determined based on a rate of change between two portions of mouth facial element locations, i.e. including lip landmarks. Determining whether or not a user is in a conversation state based on changes in mouth feature data indicates a required rate of change measurement of the facial feature data to determine changes in mouth shape for identifying conversational states, i.e. no mouth changes (rate of change of facial features close to zero) means the user is not in a conversational state]). Song further discloses: aligning each synthetic speech utterance with a region in the driving face video ([Figs. 10, 11, Generate Output Frame S645, S655], [0151] the user terminal 200 may generate an output frame by overlapping a lip image included in the first lip-sync data on the frame read in operation S644 based on the position information of the lip image included in the first lip-sync data (operation S645) and display the same [Obtaining a template video 610 and target voice 620 and then going through a frame-by-frame lip-syncing modification indicates the speech, i.e. lip-sync data, and driving face, i.e. lip image data (In view of the driving face video of Song in view of Qin), are aligned on a frame-by-frame basis as well considering the portion/frame analysis of Yoo]); wherein portions of the synthetic speech utterance corresponding to regions with no detected mouth movements are padded with silence to determine an aligned speech utterance ([Fig. 7, 531, 542, 543], [0062] generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video… [0080] lip-sync data including generated lip images [Generating lip sync data based on positions of lips in images of the video and associated synthesized speech 531, in view of the first lip image of Fig. 7, 542 which is “modified” to remain the same as shown in a second lip image 543, indicating a region with no mouth movement, further indicating the synthesized speech associated with the movement has been silence padded for this region, i.e. there is no speaking (see the end of the example signal 531), in view of the lip image 641 + target voice 620 used to generate frames of Figs. 10, 11 indicating a correspondence between the two pieces of data, in view of the previously disclosed synthesized speech and driving face video, for alignment on a frame, i.e. portion, basis. There is no limitation that this lip “modification” has to be at the end of a speaking interval, i.e. consider the plurality of frames being analyzed in Song, indicating zero padding in regions with no mouth movements. The example frame of Fig. 6 with no mouth movement could be representative of any time interval of the input signal 511]); and, generating, using the machine learning model ([In view of the previously disclosed machine learning model]), the at least one synthetic talking head video by providing the driving face video and the aligned speech utterances to a talking face video generator framework which modifies lip movements of a single speaker originally present in the driving face video ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input… [0152] The server 100 according to an embodiment of the present disclosure may generate the lip-sync data for a plurality of template video frames on a frame-by-frame basis. In this case, the user terminal 200 may receive lip-sync data generated on the frame-by-frame basis and generate output frames for each lip-sync data [Generating lip-synced output frames, wherein that output comprises modification of lip movements, on a frame-by-frame basis indicates the combination of those frames will result in a synthetic talking head video, wherein the lip movements are modified based on synthetic speech and the video contains a single speaker, i.e. voice 531, as previously discussed]). Regarding claim 3, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1. Song further discloses: generating a synthetic talking head videos database based on the at least one synthetic talking head video that is generated ([0086] the memory 330 may store training data for training an artificial neural network or data received from the service server 300, where, [0080] the service server 300 may be a device that receives lip-sync data including generated lip images from the server 100, generates output frames by using the lip-sync data, and provide the output frames to another device [Memory receiving modified lip-sync output from service server 300 indicates the memory 330 is a talking head video database]). Regarding claim 6, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1. Song further discloses: wherein the at least one feature comprises at least one of: faces of speakers ([Fig. 12, 590]); lip structures ([Fig. 7, first lip image 542]); and, lip movements ([Fig. 7, second lip images 543] [Transitioning from a first to second lip image indicates a movement]). Qin further discloses: wherein the at least one feature comprises at least one of: number of identities ([0128] delete the videos with two or more faces in each video clip [Indicating a number of identity determination to know which videos to delete]); Mann further discloses: wherein the at least one feature comprises at least one of: head-pose of speaker ([0095] The precise 3D model of the face takes account of head pose); and, head movements ([0086] The novel techniques revolve around the facial landmark tracking of the extracted face [Facial landmark tracking indicates the face is moving which is also indicative of a head moving]). The examiner would like to note that the disjunctive nature of the claim does not require a mapping to every element featured. Regarding claim 10, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1. Song further discloses: retaining the background, the camera variations, and the head movements that correspond to the at least one video when the at least one synthetic head video is generated ([Fig. 12], [0157] the user terminal 200 may generate an individual output frame 711 by overlapping the lip image 544 generated by the server 100 on a specific frame 590 of the template video [In view of the video editing done of Fig. 12, it is demonstrated that only the lips are altered, indicating the background, camera variations, and head movements that correspond to talking head 590 are retained]). Regarding claim 11, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1. Song further discloses: training the machine learning model using historical driving face videos ([0119] The first artificial neural network 520 according to an embodiment of the present disclosure may refer to a neural network that is trained (or learns) correlations between a first lip image, a voice, and a second lip image included in each of the plurality of pieces of training data 510 [Training a lip-syncing model to select a plurality of alternate lip images to replace first lip images indicates the replacement images and associated training is based on historical driving face videos for lip-syncing, in view of the driving face video of Fig. 12]); and, training the machine learning model using historical synthetic speech utterances ([0126] The second artificial neural network 560 according to an embodiment of the present disclosure may refer to a neural network that is trained (or learns) correlations between a text included in each of the plurality of training data 550 and a target voice corresponding to a reading sound of the corresponding voice [Training a model to learn a correlation between text and voice data indicates that the training is composed of historical speech utterances and their associated texts]); and, training the machine learning model using historical native accents ([0127] an artificial neural network that is trained (or learns) to output the target voice 580 corresponding to a text 570 [Generating target voices without reference to translation anywhere within Song indicates that the training to generate a target voice based on a text is performed on a historical native accent, i.e. that which is familiar to the user of a system, e.g. in a native sense, so they can analyze accuracy of generated lip movements corresponding to the synthesized speech. Accuracy can only be determined if you know what the correct lip-sync is supposed to look like, requiring a native accent]). Mann further discloses: training the machine learning model using historical non-native accents ([0106] once trained, the audio of a performance in one language can be fed to the neural network and a translation, with matching performance, nuances, and audio characteristics—is generated in the secondary language [The ability of a system to produce matching performance of a generated translation in another language indicates the training was done using non-native accents, i.e. those corresponding to the other languages text is being translated into, in view of the text input of Song]). Regarding claim 12, Song discloses: a system for automatically generating at least one synthetic talking head video using a machine learning model ([0120] the first artificial neural network 520 according to an embodiment of the present disclosure may refer to an artificial neural network that is trained (or learns) to output a second lip image 543, which is an image generated by modifying the first lip image 542 according to the voice 531), the system comprising: a device processor ([Fig. 2, first processor 120]); and, a non-transitory computer-readable storage medium storing one or more sequences of instructions ([0167] may be implemented in the form of a computer program that can be executed through various components on a computer, such a computer program may be recorded on a computer readable medium…a hard disk, a floppy disk), which when executed by the device processor, causes: extracting at least one feature from each frame of at least one video that is extracted from at least one data source ([0073] generate lip-sync data including identification information regarding frames in the template video, generated lip images, and information of positions of the lip images in template frames [lip-sync data including lip position information tracks to feature data], where, [0135] the server 100 may obtain a template video by loading the template video from the memory 130 [Memory maps to a data source]); generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source ([Fig. 9], [0139] In a selective embodiment of the present disclosure, the server 100 may generate a target voice from a text by using a trained second artificial neural network, where, [0086] the memory 330 may store training data [Gathering training data from a memory, i.e. data source, indicates that any training sample can be automatically selected as training data, text tracks to a vocabulary or words and sentences]); and, modifying lip movements of the single speaker that are originally present in the driving face video corresponding to the synthetic speech utterances ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input, [Wherein the voice can be generated through input text 570, indicating synthesis, as disclosed in Fig. 9]). Song does not disclose: analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video. Qin discloses: analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video ([0128] Use the face detection model provided by the DLIB library to perform face detection on the video clips obtained in the first part one by one, delete the videos with two or more faces in each video clip [Deleting videos with more than one face detected indicates a similar determination to keep videos for processing when there is only one face, in view of the lip-sync method of Song]). Song and Qin are considered analogous art within video frame analysis/modification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Qin, because of the novel way to identify the number of people/faces within a video and discarding video (Qin, [0096]). Song in view of Qin does not disclose: wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame. Yoo discloses: wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame ([Figs. 9A-C, Boxes A1-A3], [0175] the video pre-processing unit 210 sets the recognition target area A including the eye recognition area A.sub.1, the nose recognition area A.sub.2, and the mouth recognition area A.sub.3 based on the face element location information AL of the recognizable eyes, nose, and mouth of the user, [Wherein the boxes on the face of the speaker in Fig. 9 clearly have different scales and aspect ratios demonstrated through the differing box sizes and change based on positions, i.e. frame, of the person in the video]). Song, Qin, and Yoo are considered analogous art within video feature monitoring/tracking, wherein the features being tracked are facial. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin to incorporate the teachings of Yoo, because of the novel way to implement multi-modal emotion recognition to received video, including mouth shape tracking to determine conversation states, improving the accuracy of emotion recognition (Yoo, [0005]-[0006]). Song in view of Qin, further in view of Yoo does not disclose: generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors, and aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements in each frame of the driving face video by detecting lip landmarks and tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames. Mann discloses: generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video ([0065] Once each character's boxed SD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features of each character in their training material [Face tracking indicates an anchor location in order to be able to distinguish and track changes within the facial features of the frames, see tracking facial landmarks [0093]]); wherein each anchor represents a location of the single speaker ([0090] the X-Y coordinates… of the face is determined), a shape of the single speaker ([0090] creating an effective “box” around the face [Boxing faces gives the speaker’s “shape” in the form of a box dependent on location and size]), and a size of the single speaker ([0090] the size of the face is determined); and, classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors to detect the single speaker ([0093] precise face tracking is run to give the most accurate analysis of the face. This tracks the key points of the face (facial landmarks) in the 2D image [Tracking points of facial landmarks indicates stored data to be used for the tracking in view of the initial facial landmark determinations ([0060]), which further indicates a comparison of original pre-set anchors, i.e. initial facial landmark determinations, to the plurality of anchors, i.e. facial landmarks after initial location is determined, in order to track and classify changes to the talking head. In view of the face detection model of Song in view of Qin]), and, detecting mouth movements in each frame of the driving face video by detecting lip landmarks ([0060] Once each boxed TD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features as each character speaks. The tracking points identified on the faces are referred to as facial landmarks, where, [0070] In combination with the FL dialogue edit, the material can be assessed for image quality, lip sync, [0076] The face-tracking finds facial landmarks—key points on the face—for every frame of footage [Using facial landmarks to track speakers, wherein the produced material can be assessed for lip sync, indicates lip landmarks as part of the facial landmarks]). Song, Qin, are considered analogous art within video analysis for dubbing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo to incorporate the teachings of Mann, because of the novel way to track specific facial features using facial landmarks when going through an audio dubbing process for improved audio/visual synchronization (Mann, [0005]). Yoo further discloses: tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames ([Figs. 10A-B], [Figs. 12A-12G], [0077] a landmark detection module 213, a location adjustment module 214, [0082] the landmark detection module 213 can confirm the location information of a important face element such as a face, eyes, a mouth, a nose, and a forehead in the recognition target area A, [0083] The location adjustment module 214 can adjust the location based on the face element location information, [0057] the system analyzes one or more video frames of the video to obtain a first set of indices (face feature data) indicating characteristic of one or more face elements of a person appearing in the video, [0059] the system analyzes a video comprising a first portion (frames F.sub.1, F.sub.2), a second portion (frames F.sub.3, F.sub.4, F.sub.5) following the first portion and a third portion (frames F.sub.6, F.sub.7) following the second portion, [0179] the conversation state inference module 310 determines whether or not it is in a conversation state through the mouth recognition area A.sub.3 based on the mouth video data, [0068] The preliminary inference unit 300 can generate situation determination data P as to whether or not the user's situation changes according to a temporal sequence based on the video data DV, [0071] the preliminary inference unit 300 can generate the conversation determination data P.sub.1 for determining whether or not the user is in a conversation state based on the face feature data DF.sub.1, [Each facial element bounding box location tracks to an anchor for each element in view of the previously disclosed anchors of Mann (consider the A and A' wherein A’ represents the anchor of Figs. 10A-B). Further, analyzing portions of video containing a defined number of frames, i.e. 2 or 3, indicates this number of frames to be representative of a threshold of frames. Further still, determining changes in mouth shape to be associated with a speaking state indicates the speaking state is determined based on a rate of change between two portions of mouth facial element locations, i.e. including lip landmarks. Determining whether or not a user is in a conversation state based on changes in mouth feature data indicates a required rate of change measurement of the facial feature data to determine changes in mouth shape for identifying conversational states, i.e. no mouth changes (rate of change of facial features close to zero) means the user is not in a conversational state]). Song further discloses: aligning each synthetic speech utterance with a region in the driving face video ([Figs. 10, 11, Generate Output Frame S645, S655], [0151] the user terminal 200 may generate an output frame by overlapping a lip image included in the first lip-sync data on the frame read in operation S644 based on the position information of the lip image included in the first lip-sync data (operation S645) and display the same [Obtaining a template video 610 and target voice 620 and then going through a frame-by-frame lip-syncing modification indicates the speech, i.e. lip-sync data, and driving face, i.e. lip image data (In view of the driving face video of Song in view of Qin), are aligned on a frame-by-frame basis as well considering the portion/frame analysis of Yoo]); wherein portions of the synthetic speech utterance corresponding to regions with no detected mouth movements are padded with silence to determine an aligned speech utterance ([Fig. 7, 531, 542, 543], [0062] generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video… [0080] lip-sync data including generated lip images [Generating lip sync data based on positions of lips in images of the video and associated synthesized speech 531, in view of the first lip image of Fig. 7, 542 which is “modified” to remain the same as shown in a second lip image 543, indicating a region with no mouth movement, further indicating the synthesized speech associated with the movement has been silence padded for this region, i.e. there is no speaking (see the end of the example signal 531), in view of the lip image 641 + target voice 620 used to generate frames of Figs. 10, 11 indicating a correspondence between the two pieces of data, in view of the previously disclosed synthesized speech and driving face video, for alignment on a frame, i.e. portion, basis. There is no limitation that this lip “modification” has to be at the end of a speaking interval, i.e. consider the plurality of frames being analyzed in Song, indicating zero padding in regions with no mouth movements. The example frame of Fig. 6 with no mouth movement could be representative of any time interval of the input signal 511]); and, generating, using the machine learning model ([In view of the previously disclosed machine learning model]), the at least one synthetic talking head video by providing the driving face video and the aligned speech utterances to a talking face video generator framework which modifies lip movements of a single speaker originally present in the driving face video ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input… [0152] The server 100 according to an embodiment of the present disclosure may generate the lip-sync data for a plurality of template video frames on a frame-by-frame basis. In this case, the user terminal 200 may receive lip-sync data generated on the frame-by-frame basis and generate output frames for each lip-sync data [Generating lip-synced output frames, wherein that output comprises modification of lip movements, on a frame-by-frame basis indicates the combination of those frames will result in a synthetic talking head video, wherein the lip movements are modified based on synthetic speech and the video contains a single speaker, i.e. voice 531, as previously discussed]). Regarding claim 13, Song discloses: one or more non-transitory computer-readable storage medium storing the one or more sequence of instructions ([0167] may be implemented in the form of a computer program that can be executed through various components on a computer, such a computer program may be recorded on a computer readable medium…a hard disk, a floppy disk), which when executed by one or more processors ([Fig. 2, first processor 120]), causes to perform a method for automatically generating at least one synthetic talking head video using a machine learning model ([0120] the first artificial neural network 520 according to an embodiment of the present disclosure may refer to an artificial neural network that is trained (or learns) to output a second lip image 543, which is an image generated by modifying the first lip image 542 according to the voice 531), the method comprising: extracting at least one feature from each frame of at least one video that is extracted from at least one data source ([0073] generate lip-sync data including identification information regarding frames in the template video, generated lip images, and information of positions of the lip images in template frames [lip-sync data including lip position information tracks to feature data], where, [0135] the server 100 may obtain a template video by loading the template video from the memory 130 [Memory maps to a data source]); generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source ([Fig. 9], [0139] In a selective embodiment of the present disclosure, the server 100 may generate a target voice from a text by using a trained second artificial neural network, where, [0086] the memory 330 may store training data [Gathering training data from a memory, i.e. data source, indicates that any training sample can be automatically selected as training data, text tracks to a vocabulary or words and sentences]); and, modifying lip movements of the single speaker that are originally present in the driving face video corresponding to the synthetic speech utterances ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input, [Wherein the voice can be generated through input text 570, indicating synthesis, as disclosed in Fig. 9]). Song does not disclose: analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video. Qin discloses: analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video ([0128] Use the face detection model provided by the DLIB library to perform face detection on the video clips obtained in the first part one by one, delete the videos with two or more faces in each video clip [Deleting videos with more than one face detected indicates a similar determination to keep videos for processing when there is only one face, in view of the lip-sync method of Song]). Song and Qin are considered analogous art within video frame analysis/modification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Qin, because of the novel way to identify the number of people/faces within a video and discarding video (Qin, [0096]). Song in view of Qin does not disclose: wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame. Yoo discloses: wherein the face-detection model detects a single speaker in the driving face video by, tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, wherein each box is used to refine detection of the single speaker in each frame ([Figs. 9A-C, Boxes A1-A3], [0175] the video pre-processing unit 210 sets the recognition target area A including the eye recognition area A.sub.1, the nose recognition area A.sub.2, and the mouth recognition area A.sub.3 based on the face element location information AL of the recognizable eyes, nose, and mouth of the user, [Wherein the boxes on the face of the speaker in Fig. 9 clearly have different scales and aspect ratios demonstrated through the differing box sizes and change based on positions, i.e. frame, of the person in the video]). Song, Qin, and Yoo are considered analogous art within video feature monitoring/tracking, wherein the features being tracked are facial. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin to incorporate the teachings of Yoo, because of the novel way to implement multi-modal emotion recognition to received video, including mouth shape tracking to determine conversation states, improving the accuracy of emotion recognition (Yoo, [0005]-[0006]). Song in view of Qin, further in view of Yoo does not disclose: generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors, and aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements in each frame of the driving face video by detecting lip landmarks and tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames. Mann discloses: generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video ([0065] Once each character's boxed SD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features of each character in their training material [Face tracking indicates an anchor location in order to be able to distinguish and track changes within the facial features of the frames, see tracking facial landmarks [0093]]); wherein each anchor represents a location of the single speaker ([0090] the X-Y coordinates… of the face is determined), a shape of the single speaker ([0090] creating an effective “box” around the face [Boxing faces gives the speaker’s “shape” in the form of a box dependent on location and size]), and a size of the single speaker ([0090] the size of the face is determined); and, classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors to detect the single speaker ([0093] precise face tracking is run to give the most accurate analysis of the face. This tracks the key points of the face (facial landmarks) in the 2D image [Tracking points of facial landmarks indicates stored data to be used for the tracking in view of the initial facial landmark determinations ([0060]), which further indicates a comparison of original pre-set anchors, i.e. initial facial landmark determinations, to the plurality of anchors, i.e. facial landmarks after initial location is determined, in order to track and classify changes to the talking head. In view of the face detection model of Song in view of Qin]), and, detecting mouth movements in each frame of the driving face video by detecting lip landmarks ([0060] Once each boxed TD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features as each character speaks. The tracking points identified on the faces are referred to as facial landmarks, where, [0070] In combination with the FL dialogue edit, the material can be assessed for image quality, lip sync, [0076] The face-tracking finds facial landmarks—key points on the face—for every frame of footage [Using facial landmarks to track speakers, wherein the produced material can be assessed for lip sync, indicates lip landmarks as part of the facial landmarks]). Song, Qin, are considered analogous art within video analysis for dubbing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo to incorporate the teachings of Mann, because of the novel way to track specific facial features using facial landmarks when going through an audio dubbing process for improved audio/visual synchronization (Mann, [0005]). Yoo further discloses: tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames ([Figs. 10A-B], [Figs. 12A-12G], [0077] a landmark detection module 213, a location adjustment module 214, [0082] the landmark detection module 213 can confirm the location information of a important face element such as a face, eyes, a mouth, a nose, and a forehead in the recognition target area A, [0083] The location adjustment module 214 can adjust the location based on the face element location information, [0057] the system analyzes one or more video frames of the video to obtain a first set of indices (face feature data) indicating characteristic of one or more face elements of a person appearing in the video, [0059] the system analyzes a video comprising a first portion (frames F.sub.1, F.sub.2), a second portion (frames F.sub.3, F.sub.4, F.sub.5) following the first portion and a third portion (frames F.sub.6, F.sub.7) following the second portion, [0179] the conversation state inference module 310 determines whether or not it is in a conversation state through the mouth recognition area A.sub.3 based on the mouth video data, [0068] The preliminary inference unit 300 can generate situation determination data P as to whether or not the user's situation changes according to a temporal sequence based on the video data DV, [0071] the preliminary inference unit 300 can generate the conversation determination data P.sub.1 for determining whether or not the user is in a conversation state based on the face feature data DF.sub.1, [Each facial element bounding box location tracks to an anchor for each element in view of the previously disclosed anchors of Mann (consider the A and A' wherein A’ represents the anchor of Figs. 10A-B). Further, analyzing portions of video containing a defined number of frames, i.e. 2 or 3, indicates this number of frames to be representative of a threshold of frames. Further still, determining changes in mouth shape to be associated with a speaking state indicates the speaking state is determined based on a rate of change between two portions of mouth facial element locations, i.e. including lip landmarks. Determining whether or not a user is in a conversation state based on changes in mouth feature data indicates a required rate of change measurement of the facial feature data to determine changes in mouth shape for identifying conversational states, i.e. no mouth changes (rate of change of facial features close to zero) means the user is not in a conversational state]). Song further discloses: aligning each synthetic speech utterance with a region in the driving face video ([Figs. 10, 11, Generate Output Frame S645, S655], [0151] the user terminal 200 may generate an output frame by overlapping a lip image included in the first lip-sync data on the frame read in operation S644 based on the position information of the lip image included in the first lip-sync data (operation S645) and display the same [Obtaining a template video 610 and target voice 620 and then going through a frame-by-frame lip-syncing modification indicates the speech, i.e. lip-sync data, and driving face, i.e. lip image data (In view of the driving face video of Song in view of Qin), are aligned on a frame-by-frame basis as well considering the portion/frame analysis of Yoo]); wherein portions of the synthetic speech utterance corresponding to regions with no detected mouth movements are padded with silence to determine an aligned speech utterance ([Fig. 7, 531, 542, 543], [0062] generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video… [0080] lip-sync data including generated lip images [Generating lip sync data based on positions of lips in images of the video and associated synthesized speech 531, in view of the first lip image of Fig. 7, 542 which is “modified” to remain the same as shown in a second lip image 543, indicating a region with no mouth movement, further indicating the synthesized speech associated with the movement has been silence padded for this region, i.e. there is no speaking (see the end of the example signal 531), in view of the lip image 641 + target voice 620 used to generate frames of Figs. 10, 11 indicating a correspondence between the two pieces of data, in view of the previously disclosed synthesized speech and driving face video, for alignment on a frame, i.e. portion, basis. There is no limitation that this lip “modification” has to be at the end of a speaking interval, i.e. consider the plurality of frames being analyzed in Song, indicating zero padding in regions with no mouth movements. The example frame of Fig. 6 with no mouth movement could be representative of any time interval of the input signal 511]); and, generating, using the machine learning model ([In view of the previously disclosed machine learning model]), the at least one synthetic talking head video by providing the driving face video and the aligned speech utterances to a talking face video generator framework which modifies lip movements of a single speaker originally present in the driving face video ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input… [0152] The server 100 according to an embodiment of the present disclosure may generate the lip-sync data for a plurality of template video frames on a frame-by-frame basis. In this case, the user terminal 200 may receive lip-sync data generated on the frame-by-frame basis and generate output frames for each lip-sync data [Generating lip-synced output frames, wherein that output comprises modification of lip movements, on a frame-by-frame basis indicates the combination of those frames will result in a synthetic talking head video, wherein the lip movements are modified based on synthetic speech and the video contains a single speaker, i.e. voice 531, as previously discussed]). Regarding claim 15, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the system of claim 12. Song further discloses: Wherein the processor is configured to generate a synthetic talking head videos database based on the at least one synthetic talking head video that is generated ([0086] the memory 330 may store training data for training an artificial neural network or data received from the service server 300, where, [0080] the service server 300 may be a device that receives lip-sync data including generated lip images from the server 100, generates output frames by using the lip-sync data, and provide the output frames to another device [Memory receiving modified lip-sync output from service server 300 indicates the memory 330 is a talking head video database]). Regarding claim 18, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the system of claim 12. Song further discloses: wherein the at least one feature comprises at least one of: faces of speakers ([Fig. 12, 590]); lip structures ([Fig. 7, first lip image 542]); and, lip movements ([Fig. 7, second lip images 543] [Transitioning from a first to second lip image indicates a movement]). Qin further discloses: wherein the at least one feature comprises at least one of: number of identities ([0128] delete the videos with two or more faces in each video clip [Indicating a number of identity determination to know which videos to delete]); Mann further discloses: wherein the at least one feature comprises at least one of: head-pose of speaker ([0095] The precise 3D model of the face takes account of head pose); and, head movements ([0086] The novel techniques revolve around the facial landmark tracking of the extracted face [Facial landmark tracking indicates the face is moving which is also indicative of a head moving]). The examiner would like to note that the disjunctive nature of the claim does not require a mapping to every element featured. Claim(s) 5, 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Song in view of Qin, further in view of Yoo, further in view of Mann, further in view of Theobald et al. ("Lip-reading Enhancement for Law Enforcement"), hereinafter Theobald. Regarding claim 5, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 3. Song in view of Qin, further in view of Yoo, further in view of Mann does not disclose: training a user in lip reading using the synthetic talking head videos database comprised of (i) the lip reading on isolated words, (ii) the lip reading missing words in sentences, and (iii) the lip reading the sentences with a context. Theobald discloses: training a user ([4.1, Par. 2] listener responses [Indicating the selected sentences of Theobald are used to train the listeners to improve their lip reading ability]) in lip reading using the synthetic talking head videos database comprised of (i) the lip reading on isolated words ([4.1, Par. 1] seemingly simple tasks, such as word spotting [Further, the input sentences could be changed to single words without a change in functionality of the system of Theobald]), (ii) the lip reading missing words in sentences ([4.2. Par. 3] The user was provided with the transcriptions for each utterance, but each contained a missing keyword, which the participants were asked to identify), and (iii) the lip reading the sentences with a context ([4.2, Table 1], [The plurality of test sentences all are in specific contexts, in view of the generated lip-syncing videos of Song]). Song, Qin, are considered analogous art within lip-reading . Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo, further in view of Mann to incorporate the teachings of Theobald, because of the novel way to improve the lip-reading ability of users not skilled in the task by exaggerating lip motion in video sequences (Theobald, Abstract). Regarding claim 17, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the system of claim 15. Song in view of Qin, further in view of Yoo, further in view of Mann does not disclose: Wherein the processor is configured to train a user in lip reading using the synthetic talking head videos database comprised of (i) the lip reading on isolated words, (ii) the lip reading missing words in sentences, and (iii) the lip reading the sentences with a context. Theobald discloses: Wherein the processor is configured to train a user ([4.1, Par. 2] listener responses [Indicating the selected sentences of Theobald are used to train the listeners to improve their lip reading ability]) in lip reading using the synthetic talking head videos database comprised of (i) the lip reading on isolated words ([4.1, Par. 1] seemingly simple tasks, such as word spotting [Further, the input sentences could be changed to single words without a change in functionality of the system of Theobald]), (ii) the lip reading missing words in sentences ([4.2. Par. 3] The user was provided with the transcriptions for each utterance, but each contained a missing keyword, which the participants were asked to identify), and (iii) the lip reading the sentences with a context ([4.2, Table 1], [The plurality of test sentences all are in specific contexts, in view of the generated lip-syncing videos of Song]). Song, Qin, are considered analogous art within lip-reading . Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Mann, further in view of Yoo to incorporate the teachings of Theobald, because of the novel way to improve the lip-reading ability of users not skilled in the task by exaggerating lip motion in video sequences (Theobald, Abstract). Claim(s) 7, 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Song in view of Qin, further in view of Yoo, further in view of Mann, further in view of Mukherjee et al. (US-20230099732-A1), hereinafter Mukherjee. Regarding claim 7, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1. Song further discloses: training the text to speech model to generate the synthetic speech utterances by obtaining the vocabulary of the words and the sentences that are selected from the at least one data source ([Fig. 9], [0139] In a selective embodiment of the present disclosure, the server 100 may generate a target voice from a text by using a trained second artificial neural network, where, [0086] the memory 330 may store training data [Gathering training data from a memory, i.e. data source, indicates that any training sample can be automatically selected as training data]); and, converting the words and the sentences from the vocabulary to a sequence of sounds ([0127] Therefore, as shown in FIG. 9, the second artificial neural network 560 according to an embodiment of the present disclosure may refer to an artificial neural network that is trained (or learns) to output the target voice 580 corresponding to a text 570 as the text 570 is input). Song in view of Qin, further in view of Yoo, further in view of Mann does not disclose: adding, using a variance adaptor, a duration, pitch, and energy in to the sequence of speech sounds to obtain the synthetic speech utterances. Mukherjee discloses: adding, using a variance adaptor, a duration, pitch, and energy in to the sequence of speech sounds to obtain the synthetic speech utterances ([0038] According to embodiments, a variance adapter (not shown in FIG. 2) adds variance information to the phoneme encoding 222 (e.g., duration, pitch, and energy) [In view of the generated phonemes, i.e. target voice, of Song]). Song, Qin, are considered analogous art within text-to-speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo, further in view of Mann to incorporate the teachings of Mukherjee, because of the novel way to generate synthesized speech from text without an additional emotional label improving run-time and user experience (Mukherjee, [0010]). Regarding claim 19, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the system of claim 12. Song further discloses: Wherein the processor is configured to train the text to speech model to generate the synthetic speech utterances by obtaining the vocabulary of the words and the sentences that are selected from the at least one data source ([Fig. 9], [0139] In a selective embodiment of the present disclosure, the server 100 may generate a target voice from a text by using a trained second artificial neural network, where, [0086] the memory 330 may store training data [Gathering training data from a memory, i.e. data source, indicates that any training sample can be automatically selected as training data]); and, converting the words and the sentences from the vocabulary to a sequence of sounds ([0127] Therefore, as shown in FIG. 9, the second artificial neural network 560 according to an embodiment of the present disclosure may refer to an artificial neural network that is trained (or learns) to output the target voice 580 corresponding to a text 570 as the text 570 is input). Song in view of Qin, further in view of Yoo, further in view of Mann does not disclose: adding, using a variance adaptor, a duration, pitch, and energy in to the sequence of speech sounds to obtain the synthetic speech utterances. Mukherjee discloses: adding, using a variance adaptor, a duration, pitch, and energy in to the sequence of speech sounds to obtain the synthetic speech utterances ([0038] According to embodiments, a variance adapter (not shown in FIG. 2) adds variance information to the phoneme encoding 222 (e.g., duration, pitch, and energy) [In view of the generated phonemes, i.e. target voice, of Song]). Song, Qin, are considered analogous art within text-to-speech synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo, further in view of Mann to incorporate the teachings of Mukherjee, because of the novel way to generate synthesized speech from text without an additional emotional label improving run-time and user experience (Mukherjee, [0010]). Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Song in view of Qin, further in view of Yoo, further in view of Mann, further in view of Chen et al. (CN-111507311-A), hereinafter Chen. Regarding claim 9, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1. Qin further discloses: discarding the at least one video if the face detection model detects multiple speakers ([0128] Use the face detection model provided by the DLIB library to perform face detection on the video clips obtained in the first part one by one, delete the videos with two or more faces in each video clip). Song in view of Qin, further in view of Yoo, further in view of Mann does not disclose: discarding the at least one video if the face detection model detects no speakers in the at least one video. Chen discloses: discarding the at least one video if the face detection model detects no speakers in the at least one video ([0048] Videos lacking facial features are skipped and discarded without being used for training., [In view of the face-detection model of Song in view of Qin]). Song, Qin, are considered analogous art within face detection in video signals. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo, further in view of Mann to incorporate the teachings of Chen, because of the novel way to improve the learning ability and video recognition capabilities of a video character recognition model by not considering video without characters (Chen, [0006]). Conclusion THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Hubert-Brierre (US-20160030744-A1) discloses “Embodiments presented herein are generally directed to techniques for compensating for tonal deafness experienced by a recipient of an auditory prosthesis. More specifically, an auditory prosthesis system includes an external device configured to generate a graphical representation that enables the recipient to compensate for reduced tonal perception associated with delivery of the stimulation signals representative of speech signals. The external device is configured to analyze received speech signals to determine vocal articulator movement of the speaker of the speech signals and/or emotion of the speaker. The external device is further configured to display one or more animated visual cues representative of the detected vocal articulator movement and/or emotion” (abstract). Specifically, there is disclosed “speech-to-lip movement synthesis” ([0044]) based on facial regions (Fig. 3A). See entire document. Johnson et al. (US-11368652-B1) discloses “Audio content and played frames may be received. The audio content may correspond to first video content. The played frames may be included in the first video content. The first video content may further include a replaced frame. The played frames and the replaced frame may include a face of a person. Location data may also be received that indicates locations of facial features of the face of the person within the replaced frame. A replacement frame may be generated, such as by rendering the facial features in the replacement frame based at least in part on the locations indicated by the location data and positions indicated by a portion of the audio content that is associated with the replaced frame. Second video content may be played including the played frames and the replacement frame. The replacement frame may replace the replaced frame in the second video content.” (abstract). See entire document. Any inquiry concerning this communication or earlier communications from the examiner should be directed to THEODORE JOHN WITHEY whose telephone number is (703)756-1754. The examiner can normally be reached Monday - Friday, 8am-5pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /THEODORE WITHEY/Examiner, Art Unit 2655 /ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655
Read full office action

Prosecution Timeline

Show 4 earlier events
Sep 05, 2025
Applicant Interview (Telephonic)
Sep 05, 2025
Examiner Interview Summary
Sep 11, 2025
Response after Non-Final Action
Oct 10, 2025
Request for Continued Examination
Oct 16, 2025
Response after Non-Final Action
Nov 17, 2025
Non-Final Rejection mailed — §103
Feb 17, 2026
Response Filed
Apr 14, 2026
Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12632670
Natural Language Processing for Identifying Bias in a Span of Text
3y 2m to grant Granted May 19, 2026
Patent 12591744
METHOD FOR TRAINING SEMANTIC REPRESENTATION MODEL, DEVICE AND STORAGE MEDIUM
4y 0m to grant Granted Mar 31, 2026
Patent 12536994
APPARATUS FOR CLASSIFYING SOUNDS BASED ON NEURAL CODE IN SPIKING NEURAL NETWORK AND METHOD THEREOF
2y 9m to grant Granted Jan 27, 2026
Patent 12475330
METHOD FOR IDENTIFYING NOISE SAMPLES, ELECTRONIC DEVICE, AND STORAGE MEDIUM
3y 1m to grant Granted Nov 18, 2025
Patent 12417759
SPEECH RECOGNITION USING CADENCE PATTERNS
3y 1m to grant Granted Sep 16, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

5-6
Expected OA Rounds
44%
Grant Probability
95%
With Interview (+51.3%)
2y 11m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 25 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month