Last updated: April 18, 2026
Application No. 18/120,375
SYSTEM AND METHOD FOR AUTOMATICALLY GENERATING SYNTHETIC HEAD VIDEOS USING A MACHINE LEARNING MODEL

Final Rejection §103
Filed
Mar 11, 2023
Examiner
WITHEY, THEODORE JOHN
Art Unit
2655
Tech Center
2600 — Communications
Assignee
International Institute Of Information Technology Hyderabad
OA Round
4 (Final)
Interview Optional

— +46.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 23 resolved cases, 2023–2026
Examiner Intelligence

WITHEY, THEODORE JOHN View full profile →
Grants 44% of resolved cases
Career Allow Rate
10 granted / 23 resolved
-18.5% vs TC avg
Strong +47% interview lift
Without
With
+46.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
39 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
22.0%
-18.0% vs TC avg
§103
48.6%
+8.6% vs TC avg
§102
17.1%
-22.9% vs TC avg
§112
12.0%
-28.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 23 resolved cases
Office Action

§103
DETAILED ACTION
	This office action is in response to Applicant’s request for continued examination (RCE), received on 10/10/2025. Claims 1, 12, and 13 have been amended. Claims 8 and 20 have been cancelled. Claims 1, 3, 5-7, 9-13, 15, 17-19 are pending and have been considered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/10/2025 has been entered.
 
Response to Arguments
Applicant's arguments filed 09/11/2025, see pgs. 14-17 have been fully considered but they are not persuasive. 
With regard to Applicant’s argument that Song does not disclose detecting a single speaker by tiling boxes on each video frame, generating anchors and classifying them, see pgs. 14-15, the examiner notes that all these elements were previously mapped to other references, namely, Qin and Song. Song does not need disclose these elements.
With regard to the “generating a plurality of anchors…wherein each anchor represents a location…” arguments against Qin, see pgs. 15-16, the examiner would like to note that Mann was previously used to cite these elements. It is unclear to the examiner why Qin is being referenced for anchor generation when Mann was the cited source for this element in the previous office action.
With regard to the “generating a plurality of anchors…” and “classifying, using the face detection model…” steps (previously elements of dependent claims 8/20 rejected under Mann), see pgs. 16-17, this rejection under Mann will now be used in conjunction with the plurality of tiled boxes of Yoo. Applicant makes no arguments against the mappings of Mann used in previous rejection of claims 8/20 apart from the “predefined threshold of frames” element, argued against below. Applicant only argues that Song/Qin does not disclose the elements previously cited to Mann. 
In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).

Considering Applicant’s arguments against Mann for not disclosing “the rate of change of lip landmarks between a predefined threshold of frames”, the examiner would like to note that these are also persuasive, but Yoo will be incorporated to resolve these deficiencies (see citation of reference further below in arguments). Specifically, Yoo discloses analysis of portions of video, wherein the portions consist of a number of frames, to gather feature data to be then used for making determinations about characteristics of faces in the video, [0057]-[0060]. See updated rejections below.
With further regard to Applicant’s argument that the examiner’s interpretation that “a frame inherently equals a threshold of one” is hindsight reconstruction, the examiner would like to refer to [0089] of Mann which discloses processing a video as an image sequence on a frame-by-frame basis. This clearly indicates a predefined threshold of frames to be equal to one and/or the total number of video frames, wherein each frame is comprised of a plurality of images for landmark tracking, [0080] “tracking the actor’s facial landmarks for each frame”. Further, [0076] discloses 24 frames per second of footage. This indicates a predefined threshold of frames based on the length of video, i.e. the length of video in second multiplied by 24 wherein the resulting product can also represent a predefined threshold of frames. Applicant’s assertion that Mann does not disclose or suggest the use of anchors is unsupported with any reasoning, rationale, and/or indication as to why Mann fails to disclose the anchors as currently claimed.
Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references.
In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning.  But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper.  See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).
Applicant’s arguments, see pgs. 18-20, filed 09/11/2025, with respect to the rejection(s) of claim(s) 1, 12, 13 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Yoo et al. (US-20190163965-A1), hereinafter Yoo. With regard to the “tiling a plurality of boxes…” step, Yoo discloses a video input to be processed for facial feature tracking, wherein each individual feature of the face is boxed into its own bounding region, see Figs. 9-11. 
With regard to “(i)” (see pg. 18 of Applicant remarks), Yoo is being incorporated to introduce the concept of tiling a plurality of boxes, from which the anchors of Mann will be applied. 
With regard to “(ii)”, this will also be updated in view of Yoo to account for the newly amended anchors for tracking changes relative to lip-landmarks and determining a rate of change of the lip landmarks into the independent claims. See updated rejections below.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 3, 6, 10-13, 15, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Song et al. (US-20230023102-A1), hereinafter Song, in view of Qin et al. (CN-111798543-A), hereinafter Qin, further in view of Yoo et al. (US-20190163965-A1), hereinafter Yoo, further in view of Mann et al. (US-20230015971-A1), hereinafter Mann.

Regarding claim 1, Song discloses: a processor-implemented method ([Fig. 2, first processor 120]) for automatically generating at least one synthetic talking head video using a machine learning model ([0120] the first artificial neural network 520 according to an embodiment of the present disclosure may refer to an artificial neural network that is trained (or learns) to output a second lip image 543, which is an image generated by modifying the first lip image 542 according to the voice 531), the method comprising:
extracting at least one feature from each frame of at least one video that is extracted from at least one data source ([0073] generate lip-sync data including identification information regarding frames in the template video, generated lip images, and information of positions of the lip images in template frames [lip-sync data including lip position information tracks to feature data], where, [0135] the server 100 may obtain a template video by loading the template video from the memory 130 [Memory maps to a data source]);
	generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source ([Fig. 9], [0139] In a selective embodiment of the present disclosure, the server 100 may generate a target voice from a text by using a trained second artificial neural network, where, [0086] the memory 330 may store training data [Gathering training data from a memory, i.e. data source, indicates that any training sample can be automatically selected as training data, text tracks to a vocabulary or words and sentences]); and,
	modifying lip movements of the single speaker that are originally present in the driving face video corresponding to the synthetic speech utterances ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input, [Wherein the voice can be generated through input text 570, indicating synthesis, as disclosed in Fig. 9]).
	Song does not disclose:
	analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video.
	Qin discloses:
	analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video ([0128] Use the face detection model provided by the DLIB library to perform face detection on the video clips obtained in the first part one by one, delete the videos with two or more faces in each video clip [Deleting videos with more than one face detected indicates a similar determination to keep videos for processing when there is only one face, in view of the lip-sync method of Song]).
Song and Qin are considered analogous art within video frame analysis/modification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Qin, because of the novel way to identify the number of people/faces within a video  (Qin, [0096]).
Song in view of Qin does not disclose:
wherein the face-detection model detects a single speaker in the driving face video by,
tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, each box being used to refine detection of the single speaker in each frame.
Yoo discloses:
wherein the face-detection model detects a single speaker in the driving face video by,
tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, each box being used to refine detection of the single speaker in each frame ([Figs. 9A-C, Boxes A1-A3], [0175] the video pre-processing unit 210 sets the recognition target area A including the eye recognition area A.sub.1, the nose recognition area A.sub.2, and the mouth recognition area A.sub.3 based on the face element location information AL of the recognizable eyes, nose, and mouth of the user, [Wherein the boxes on the face of the speaker in Fig. 9 clearly have different scales and aspect ratios demonstrated through the differing box sizes and change based on positions, i.e. frame, of the person in the video]).
 Song, Qin, and Yoo are considered analogous art within video feature monitoring/tracking, wherein the features being tracked are facial. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin to incorporate the teachings of Yoo, because of the novel way to implement multi-modal emotion recognition to received video, including mouth shape tracking to determine conversation states, improving the accuracy of emotion recognition (Yoo, [0005]-[0006]).
	Song in view of Qin, further in view of Yoo does not disclose:
generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, 
wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and 
classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors, and
aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements in each frame of the driving face video by detecting lip landmarks and tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames.
Mann discloses:
generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video ([0065] Once each character's boxed SD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features of each character in their training material [Face tracking indicates an anchor location in order to be able to distinguish and track changes within the facial features of the frames, see tracking facial landmarks [0093]]);
wherein each anchor represents a location of the single speaker ([0090] the X-Y coordinates… of the face is determined), a shape of the single speaker ([0090] creating an effective “box” around the face [Boxing faces gives the speaker’s “shape” in the form of a box dependent on location and size]), and a size of the single speaker ([0090] the size of the face is determined); and,
classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors to detect the single speaker ([0093] precise face tracking is run to give the most accurate analysis of the face. This tracks the key points of the face (facial landmarks) in the 2D image [Tracking points of facial landmarks indicates stored data to be used for the tracking in view of the initial facial landmark determinations ([0060]), which further indicates a comparison of original pre-set anchors, i.e. initial facial landmark determinations, to the plurality of anchors, i.e. facial landmarks after initial location is determined, in order to track and classify changes to the talking head. In view of the face detection model of Song in view of Qin]), and,
detecting mouth movements in each frame of the driving face video by detecting lip landmarks ([0060] Once each boxed TD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features as each character speaks. The tracking points identified on the faces are referred to as facial landmarks, where, [0070] In combination with the FL dialogue edit, the material can be assessed for image quality, lip sync, [0076] The face-tracking finds facial landmarks—key points on the face—for every frame of footage [Using facial landmarks to track speakers, wherein the produced material can be assessed for lip sync, indicates lip landmarks as part of the facial landmarks]).
Song, Qin, Yoo, and Mann are considered analogous art within video analysis for dubbing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo to incorporate the teachings of Mann, because of the novel way to track specific facial features using facial landmarks when going through an audio dubbing process for improved audio/visual synchronization (Mann, [0005]).
Yoo further discloses:
tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames ([Figs. 10A-B], [Figs. 12A-12G], [0077] a landmark detection module 213, a location adjustment module 214, [0082] the landmark detection module 213 can confirm the location information of a important face element such as a face, eyes, a mouth, a nose, and a forehead in the recognition target area A, [0083] The location adjustment module 214 can adjust the location based on the face element location information, [0057] the system analyzes one or more video frames of the video to obtain a first set of indices (face feature data) indicating characteristic of one or more face elements of a person appearing in the video, [0059] the system analyzes a video comprising a first portion (frames F.sub.1, F.sub.2), a second portion (frames F.sub.3, F.sub.4, F.sub.5) following the first portion and a third portion (frames F.sub.6, F.sub.7) following the second portion, [0179] the conversation state inference module 310 determines whether or not it is in a conversation state through the mouth recognition area A.sub.3 based on the mouth video data, [0068] The preliminary inference unit 300 can generate situation determination data P as to whether or not the user's situation changes according to a temporal sequence based on the video data DV, [0071] the preliminary inference unit 300 can generate the conversation determination data P.sub.1 for determining whether or not the user is in a conversation state based on the face feature data DF.sub.1, [Each facial element bounding box location tracks to an anchor for each element in view of the previously disclosed anchors of Mann (consider the A and A' wherein A’ represents the anchor of Figs. 10A-B). Further, analyzing portions of video containing a defined number of frames, i.e. 2 or 3, indicates this number of frames to be representative of a threshold of frames. Further still, determining changes in mouth shape to be associated with a speaking state indicates the speaking state is determined based on a rate of change between two portions of mouth facial element locations, i.e. including lip landmarks. Determining whether or not a user is in a conversation state based on changes in mouth feature data indicates a required rate of change measurement of the facial feature data to determine changes in mouth shape for identifying conversational states, i.e. no mouth changes (rate of change of facial features close to zero) means the user is not in a conversational state]).
Song further discloses:
aligning each synthetic speech utterance with a region in the driving face video ([Figs. 10, 11, Generate Output Frame S645, S655], [0151] the user terminal 200 may generate an output frame by overlapping a lip image included in the first lip-sync data on the frame read in operation S644 based on the position information of the lip image included in the first lip-sync data (operation S645) and display the same [Obtaining a template video 610 and target voice 620 and then going through a frame-by-frame lip-syncing modification indicates the speech, i.e. lip-sync data, and driving face, i.e. lip image data (In view of the driving face video of Song in view of Qin), are aligned on a frame-by-frame basis as well considering the portion/frame analysis of Yoo]);
wherein portions of the synthetic speech utterance corresponding to regions with no detected mouth movements are padded with silence to determine an aligned speech utterance ([Fig. 7, 531, 542, 543], [0062] generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video… [0080] lip-sync data including generated lip images [Generating lip sync data based on positions of lips in images of the video and associated synthesized speech 531, in view of the first lip image of Fig. 7, 542 which is “modified” to remain the same as shown in a second lip image 543, indicating a region with no mouth movement, further indicating the synthesized speech associated with the movement has been silence padded for this region, i.e. there is no speaking (see the end of the example signal 531), in view of the lip image 641 + target voice 620 used to generate frames of Figs. 10, 11 indicating a correspondence between the two pieces of data, in view of the previously disclosed synthesized speech and driving face video, for alignment on a frame, i.e. portion, basis. There is no limitation that this lip “modification” has to be at the end of a speaking interval, i.e. consider the plurality of frames being analyzed in Song, indicating zero padding in regions with no mouth movements. The example frame of Fig. 6 with no mouth movement could be representative of any time interval of the input signal 511]); and,
	generating, using the machine learning model ([In view of the previously disclosed machine learning model]), the at least one synthetic talking head video by providing the driving face video and the aligned speech utterances to a talking face video generator framework which modifies lip movements of a single speaker originally present in the driving face video ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input… [0152] The server 100 according to an embodiment of the present disclosure may generate the lip-sync data for a plurality of template video frames on a frame-by-frame basis. In this case, the user terminal 200 may receive lip-sync data generated on the frame-by-frame basis and generate output frames for each lip-sync data [Generating lip-synced output frames, wherein that output comprises modification of lip movements, on a frame-by-frame basis indicates the combination of those frames will result in a synthetic talking head video, wherein the lip movements are modified based on synthetic speech and the video contains a single speaker, i.e. voice 531, as previously discussed]).

		
	Regarding claim 3, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1.
	Song further discloses:
	generating a synthetic talking head videos database based on the at least one synthetic talking head video that is generated ([0086] the memory 330 may store training data for training an artificial neural network or data received from the service server 300, where, [0080] the service server 300 may be a device that receives lip-sync data including generated lip images from the server 100, generates output frames by using the lip-sync data, and provide the output frames to another device [Memory receiving modified lip-sync output from service server 300 indicates the memory 330 is a talking head video database]).


Regarding claim 6, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1.
	Song further discloses:
	wherein the at least one feature comprises at least one of:
	faces of speakers ([Fig. 12, 590]);
	lip structures ([Fig. 7, first lip image 542]); and,
	lip movements ([Fig. 7, second lip images 543] [Transitioning from a first to second lip image indicates a movement]).
	Qin further discloses:
	wherein the at least one feature comprises at least one of:
	number of identities ([0128] delete the videos with two or more faces in each video clip [Indicating a number of identity determination to know which videos to delete]);
	Mann further discloses:
wherein the at least one feature comprises at least one of:
	head-pose of speaker ([0095] The precise 3D model of the face takes account of head pose); and,
	head movements ([0086] The novel techniques revolve around the facial landmark tracking of the extracted face [Facial landmark tracking indicates the face is moving which is also indicative of a head moving]).
	The examiner would like to note that the disjunctive nature of the claim does not require a mapping to every element featured.

Regarding claim 10, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1.
	Song further discloses:
	retaining the background, the camera variations, and the head movements that correspond to the at least one video when the at least one synthetic head video is generated ([Fig. 12], [0157] the user terminal 200 may generate an individual output frame 711 by overlapping the lip image 544 generated by the server 100 on a specific frame 590 of the template video [In view of the video editing done of Fig. 12, it is demonstrated that only the lips are altered, indicating the background, camera variations, and head movements that correspond to talking head 590 are retained]).

Regarding claim 11, Song in view of Qin, further in view of Yoo, further in view of Mann discloses: the processor-implemented method of claim 1.
Song further discloses:
training the machine learning model using historical driving face videos ([0119] The first artificial neural network 520 according to an embodiment of the present disclosure may refer to a neural network that is trained (or learns) correlations between a first lip image, a voice, and a second lip image included in each of the plurality of pieces of training data 510 [Training a lip-syncing model to select a plurality of alternate lip images to replace first lip images indicates the replacement images and associated training is based on historical driving face videos for lip-syncing, in view of the driving face video of Fig. 12]); and,
training the machine learning model using historical synthetic speech utterances ([0126] The second artificial neural network 560 according to an embodiment of the present disclosure may refer to a neural network that is trained (or learns) correlations between a text included in each of the plurality of training data 550 and a target voice corresponding to a reading sound of the corresponding voice [Training a model to learn a correlation between text and voice data indicates that the training is composed of historical speech utterances and their associated texts]); and,
training the machine learning model using historical native accents ([0127] an artificial neural network that is trained (or learns) to output the target voice 580 corresponding to a text 570 [Generating target voices without reference to translation anywhere within Song indicates that the training to generate a target voice based on a text is performed on a historical native accent, i.e. that which is familiar to the user of a system, e.g. in a native sense, so they can analyze accuracy of generated lip movements corresponding to the synthesized speech. Accuracy can only be determined if you know what the correct lip-sync is supposed to look like, requiring a native accent]).
Mann further discloses:
training the machine learning model using historical non-native accents ([0106] once trained, the audio of a performance in one language can be fed to the neural network and a translation, with matching performance, nuances, and audio characteristics—is generated in the secondary language [The ability of a system to produce matching performance of a generated translation in another language indicates the training was done using non-native accents, i.e. those corresponding to the other languages text is being translated into, in view of the text input of Song]).

Regarding claim 12, Song discloses: a system for automatically generating at least one synthetic talking head video using a machine learning model ([0120] the first artificial neural network 520 according to an embodiment of the present disclosure may refer to an artificial neural network that is trained (or learns) to output a second lip image 543, which is an image generated by modifying the first lip image 542 according to the voice 531), the system comprising:
a device processor ([Fig. 2, first processor 120]); and,
a non-transitory computer-readable storage medium storing one or more sequences of instructions ([0167] may be implemented in the form of a computer program that can be executed through various components on a computer, such a computer program may be recorded on a computer readable medium…a hard disk, a floppy disk), which when executed by the device processor, causes:
extracting at least one feature from each frame of at least one video that is extracted from at least one data source ([0073] generate lip-sync data including identification information regarding frames in the template video, generated lip images, and information of positions of the lip images in template frames [lip-sync data including lip position information tracks to feature data], where, [0135] the server 100 may obtain a template video by loading the template video from the memory 130 [Memory maps to a data source]);
	generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source ([Fig. 9], [0139] In a selective embodiment of the present disclosure, the server 100 may generate a target voice from a text by using a trained second artificial neural network, where, [0086] the memory 330 may store training data [Gathering training data from a memory, i.e. data source, indicates that any training sample can be automatically selected as training data, text tracks to a vocabulary or words and sentences]); and,
	modifying lip movements of the single speaker that are originally present in the driving face video corresponding to the synthetic speech utterances ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input, [Wherein the voice can be generated through input text 570, indicating synthesis, as disclosed in Fig. 9]).
	Song does not disclose:
	analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video.
	Qin discloses:
	analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video ([0128] Use the face detection model provided by the DLIB library to perform face detection on the video clips obtained in the first part one by one, delete the videos with two or more faces in each video clip [Deleting videos with more than one face detected indicates a similar determination to keep videos for processing when there is only one face, in view of the lip-sync method of Song]).
Song and Qin are considered analogous art within video frame analysis/modification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Qin, because of the novel way to identify the number of people/faces within a video and discarding video (Qin, [0096]).
Song in view of Qin does not disclose:
wherein the face-detection model detects a single speaker in the driving face video by,
tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, each box being used to refine detection of the single speaker in each frame.
Yoo discloses:
wherein the face-detection model detects a single speaker in the driving face video by,
tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, each box being used to refine detection of the single speaker in each frame ([Figs. 9A-C, Boxes A1-A3], [0175] the video pre-processing unit 210 sets the recognition target area A including the eye recognition area A.sub.1, the nose recognition area A.sub.2, and the mouth recognition area A.sub.3 based on the face element location information AL of the recognizable eyes, nose, and mouth of the user, [Wherein the boxes on the face of the speaker in Fig. 9 clearly have different scales and aspect ratios demonstrated through the differing box sizes and change based on positions, i.e. frame, of the person in the video]).
 Song, Qin, and Yoo are considered analogous art within video feature monitoring/tracking, wherein the features being tracked are facial. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin to incorporate the teachings of Yoo, because of the novel way to implement multi-modal emotion recognition to received video, including mouth shape tracking to determine conversation states, improving the accuracy of emotion recognition (Yoo, [0005]-[0006]).
	Song in view of Qin, further in view of Yoo does not disclose:
generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, 
wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and 
classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors, and
aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements in each frame of the driving face video by detecting lip landmarks and tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames.
Mann discloses:
generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video ([0065] Once each character's boxed SD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features of each character in their training material [Face tracking indicates an anchor location in order to be able to distinguish and track changes within the facial features of the frames, see tracking facial landmarks [0093]]);
wherein each anchor represents a location of the single speaker ([0090] the X-Y coordinates… of the face is determined), a shape of the single speaker ([0090] creating an effective “box” around the face [Boxing faces gives the speaker’s “shape” in the form of a box dependent on location and size]), and a size of the single speaker ([0090] the size of the face is determined); and,
classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors to detect the single speaker ([0093] precise face tracking is run to give the most accurate analysis of the face. This tracks the key points of the face (facial landmarks) in the 2D image [Tracking points of facial landmarks indicates stored data to be used for the tracking in view of the initial facial landmark determinations ([0060]), which further indicates a comparison of original pre-set anchors, i.e. initial facial landmark determinations, to the plurality of anchors, i.e. facial landmarks after initial location is determined, in order to track and classify changes to the talking head. In view of the face detection model of Song in view of Qin]), and,
detecting mouth movements in each frame of the driving face video by detecting lip landmarks ([0060] Once each boxed TD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features as each character speaks. The tracking points identified on the faces are referred to as facial landmarks, where, [0070] In combination with the FL dialogue edit, the material can be assessed for image quality, lip sync, [0076] The face-tracking finds facial landmarks—key points on the face—for every frame of footage [Using facial landmarks to track speakers, wherein the produced material can be assessed for lip sync, indicates lip landmarks as part of the facial landmarks]).
Song, Qin,  are considered analogous art within video analysis for dubbing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin, further in view of Yoo to incorporate the teachings of Mann, because of the novel way to track specific facial features using facial landmarks when going through an audio dubbing process for improved audio/visual synchronization (Mann, [0005]).
Yoo further discloses:
tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames ([Figs. 10A-B], [Figs. 12A-12G], [0077] a landmark detection module 213, a location adjustment module 214, [0082] the landmark detection module 213 can confirm the location information of a important face element such as a face, eyes, a mouth, a nose, and a forehead in the recognition target area A, [0083] The location adjustment module 214 can adjust the location based on the face element location information, [0057] the system analyzes one or more video frames of the video to obtain a first set of indices (face feature data) indicating characteristic of one or more face elements of a person appearing in the video, [0059] the system analyzes a video comprising a first portion (frames F.sub.1, F.sub.2), a second portion (frames F.sub.3, F.sub.4, F.sub.5) following the first portion and a third portion (frames F.sub.6, F.sub.7) following the second portion, [0179] the conversation state inference module 310 determines whether or not it is in a conversation state through the mouth recognition area A.sub.3 based on the mouth video data, [0068] The preliminary inference unit 300 can generate situation determination data P as to whether or not the user's situation changes according to a temporal sequence based on the video data DV, [0071] the preliminary inference unit 300 can generate the conversation determination data P.sub.1 for determining whether or not the user is in a conversation state based on the face feature data DF.sub.1, [Each facial element bounding box location tracks to an anchor for each element in view of the previously disclosed anchors of Mann (consider the A and A' wherein A’ represents the anchor of Figs. 10A-B). Further, analyzing portions of video containing a defined number of frames, i.e. 2 or 3, indicates this number of frames to be representative of a threshold of frames. Further still, determining changes in mouth shape to be associated with a speaking state indicates the speaking state is determined based on a rate of change between two portions of mouth facial element locations, i.e. including lip landmarks. Determining whether or not a user is in a conversation state based on changes in mouth feature data indicates a required rate of change measurement of the facial feature data to determine changes in mouth shape for identifying conversational states, i.e. no mouth changes (rate of change of facial features close to zero) means the user is not in a conversational state]).
Song further discloses:
aligning each synthetic speech utterance with a region in the driving face video ([Figs. 10, 11, Generate Output Frame S645, S655], [0151] the user terminal 200 may generate an output frame by overlapping a lip image included in the first lip-sync data on the frame read in operation S644 based on the position information of the lip image included in the first lip-sync data (operation S645) and display the same [Obtaining a template video 610 and target voice 620 and then going through a frame-by-frame lip-syncing modification indicates the speech, i.e. lip-sync data, and driving face, i.e. lip image data (In view of the driving face video of Song in view of Qin), are aligned on a frame-by-frame basis as well considering the portion/frame analysis of Yoo]);
wherein portions of the synthetic speech utterance corresponding to regions with no detected mouth movements are padded with silence to determine an aligned speech utterance ([Fig. 7, 531, 542, 543], [0062] generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video… [0080] lip-sync data including generated lip images [Generating lip sync data based on positions of lips in images of the video and associated synthesized speech 531, in view of the first lip image of Fig. 7, 542 which is “modified” to remain the same as shown in a second lip image 543, indicating a region with no mouth movement, further indicating the synthesized speech associated with the movement has been silence padded for this region, i.e. there is no speaking (see the end of the example signal 531), in view of the lip image 641 + target voice 620 used to generate frames of Figs. 10, 11 indicating a correspondence between the two pieces of data, in view of the previously disclosed synthesized speech and driving face video, for alignment on a frame, i.e. portion, basis. There is no limitation that this lip “modification” has to be at the end of a speaking interval, i.e. consider the plurality of frames being analyzed in Song, indicating zero padding in regions with no mouth movements. The example frame of Fig. 6 with no mouth movement could be representative of any time interval of the input signal 511]); and,
	generating, using the machine learning model ([In view of the previously disclosed machine learning model]), the at least one synthetic talking head video by providing the driving face video and the aligned speech utterances to a talking face video generator framework which modifies lip movements of a single speaker originally present in the driving face video ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input… [0152] The server 100 according to an embodiment of the present disclosure may generate the lip-sync data for a plurality of template video frames on a frame-by-frame basis. In this case, the user terminal 200 may receive lip-sync data generated on the frame-by-frame basis and generate output frames for each lip-sync data [Generating lip-synced output frames, wherein that output comprises modification of lip movements, on a frame-by-frame basis indicates the combination of those frames will result in a synthetic talking head video, wherein the lip movements are modified based on synthetic speech and the video contains a single speaker, i.e. voice 531, as previously discussed]).

Regarding claim 13, Song discloses: one or more non-transitory computer-readable storage medium storing the one or more sequence of instructions ([0167] may be implemented in the form of a computer program that can be executed through various components on a computer, such a computer program may be recorded on a computer readable medium…a hard disk, a floppy disk), which when executed by one or more processors ([Fig. 2, first processor 120]), causes to perform a method for automatically generating at least one synthetic talking head video using a machine learning model ([0120] the first artificial neural network 520 according to an embodiment of the present disclosure may refer to an artificial neural network that is trained (or learns) to output a second lip image 543, which is an image generated by modifying the first lip image 542 according to the voice 531), the method comprising:
extracting at least one feature from each frame of at least one video that is extracted from at least one data source ([0073] generate lip-sync data including identification information regarding frames in the template video, generated lip images, and information of positions of the lip images in template frames [lip-sync data including lip position information tracks to feature data], where, [0135] the server 100 may obtain a template video by loading the template video from the memory 130 [Memory maps to a data source]);
	generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the at least one data source ([Fig. 9], [0139] In a selective embodiment of the present disclosure, the server 100 may generate a target voice from a text by using a trained second artificial neural network, where, [0086] the memory 330 may store training data [Gathering training data from a memory, i.e. data source, indicates that any training sample can be automatically selected as training data, text tracks to a vocabulary or words and sentences]); and,
	modifying lip movements of the single speaker that are originally present in the driving face video corresponding to the synthetic speech utterances ([0146] modifying the first lip image 542 according to a voice, as the voice 531 and the first lip image 542 are input, [Wherein the voice can be generated through input text 570, indicating synthesis, as disclosed in Fig. 9]).
	Song does not disclose:
	analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video.
	Qin discloses:
	analysing, using a face-detection model, the at least one feature to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the at least one video ([0128] Use the face detection model provided by the DLIB library to perform face detection on the video clips obtained in the first part one by one, delete the videos with two or more faces in each video clip [Deleting videos with more than one face detected indicates a similar determination to keep videos for processing when there is only one face, in view of the lip-sync method of Song]).
Song and Qin are considered analogous art within video frame analysis/modification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Qin, because of the novel way to identify the number of people/faces within a video and discarding video (Qin, [0096]).
Song in view of Qin does not disclose:
wherein the face-detection model detects a single speaker in the driving face video by,
tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, each box being used to refine detection of the single speaker in each frame.
Yoo discloses:
wherein the face-detection model detects a single speaker in the driving face video by,
tiling a plurality of boxes on each frame of the at least one video with different scales and aspect ratios, each box being used to refine detection of the single speaker in each frame ([Figs. 9A-C, Boxes A1-A3], [0175] the video pre-processing unit 210 sets the recognition target area A including the eye recognition area A.sub.1, the nose recognition area A.sub.2, and the mouth recognition area A.sub.3 based on the face element location information AL of the recognizable eyes, nose, and mouth of the user, [Wherein the boxes on the face of the speaker in Fig. 9 clearly have different scales and aspect ratios demonstrated through the differing box sizes and change based on positions, i.e. frame, of the person in the video]).
 Song, Qin, and Yoo are considered analogous art within video feature monitoring/tracking, wherein the features being tracked are facial. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Qin to incorporate the teachings of Yoo, because of the novel way to implement multi-modal emotion recognition to received video, including mouth shape tracking to determine conversation states, improving the accuracy of emotion recognition (Yoo, [0005]-[0006]).
	Song in view of Qin, further in view of Yoo does not disclose:
generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video, 
wherein each anchor represents a location of the single speaker, a shape of the single speaker, and a size of the single speaker; and 
classifying, using the face detection model, the plurality of anchors by correlating with a series of pre-set anchors, and
aligning each synthetic speech utterance with a region in the driving face video by detecting mouth movements in each frame of the driving face video by detecting lip landmarks and tracking changes of the lip landmarks relative to the plurality of anchors across successive frames to determine a rate of change of the lip-landmarks between a predefined threshold of frames.
Mann discloses:
generating a plurality of anchors based on the plurality of boxes that are tiled on each frame of the at least one video ([0065] Once each character's boxed SD image sequence has been created, a precision face tracking process is undertaken to accurately analyze the facial features of each character in their training material [Face tracking indicates an anchor location in order to be able to distinguish and track changes within the facial features of the frames, s
Read full office action
Prosecution Timeline

Mar 11, 2023
Application Filed
Mar 10, 2025
Non-Final Rejection — §103
Jun 09, 2025
Response Filed
Jul 10, 2025
Final Rejection — §103
Sep 05, 2025
Examiner Interview Summary
Sep 05, 2025
Applicant Interview (Telephonic)
Sep 11, 2025
Response after Non-Final Action
Oct 10, 2025
Request for Continued Examination
Oct 16, 2025
Response after Non-Final Action
Nov 12, 2025
Non-Final Rejection — §103
Feb 17, 2026
Response Filed
Apr 08, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/655,770
Patent 12591744
METHOD FOR TRAINING SEMANTIC REPRESENTATION MODEL, DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/113,192
Patent 12536994
APPARATUS FOR CLASSIFYING SOUNDS BASED ON NEURAL CODE IN SPIKING NEURAL NETWORK AND METHOD THEREOF
2y 5m to grant Granted Jan 27, 2026
17/956,558
Patent 12475330
METHOD FOR IDENTIFYING NOISE SAMPLES, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Nov 18, 2025
17/813,944
Patent 12417759
SPEECH RECOGNITION USING CADENCE PATTERNS
2y 5m to grant Granted Sep 16, 2025
17/986,417
Patent 12412580
Sound Extraction System and Sound Extraction Method
2y 5m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
44%
Grant Probability
90%
With Interview (+46.9%)
2y 11m
Median Time to Grant
High
PTA Risk
Based on 23 resolved cases by this examiner. Grant probability derived from career allow rate.