DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/13/2025 was filed in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Response to Arguments
1. Regarding the rejection of claims 1-20 under 35 U.S.C. § 101, Applicant's arguments filed 01/13/2026 have been fully considered but they are not persuasive.
Applicant first argues that claims 1-20 are patent eligible under Step 2A Prong 2 analysis as the claims integrate the judicial exception into a practical application via an improvement in the functioning of a computer or to a technology or technical field, specifically an improvement to speech technology and speech processing technology. The Examiner respectfully disagrees with this argument.
Under Step 2A Prong 2 analysis, additional elements which were not grouped under a judicial exception in Step 2A Prong 1 analysis are considered alone or in combination to determine if the claim integrates the judicial exception into a practical application. In amended claim 1, judicial exceptions are recited under the category of abstract idea mental processes. Under the broadest reasonable interpretation of the claim, claim 1 contains several limitations which can be performed mentally. The first obtaining step can be performed mentally as a person watching and listening to a person. The first determination step can be performed mentally as a person reading the lips of the person and writing down pronunciation data they observe. The conversion step can be performed mentally as a person writing down information related to the speech they hear the person speaking. The synthesis step and the outputting steps can each be performed by a person as a person can speak in a synchronized manner in order to replace corrupted audio portions with speech that contains the same words as the words in the corruption portion. The only additional limitations in the claim after Step 2A Prong 1 analysis are that this outputting step is “being presented by a display device”. This limitation is recited broadly and amounts to mere instructions to implement the judicial exception using a generic computer, which does not integrate the judicial exception into a practical application as it does not impose any meaningful limits on practicing the abstract idea. Claim 1 does not contain any further limitations which might integrate the judicial exception into a practical application. Further limitations regarding how these steps are performed in a technical manner in such a way to reflect a technical improvement would be needed in order for the abstract idea to be integrate the judicial exception into a practical application.
Hence, Applicant’s arguments are not persuasive.
2. Regarding the rejection of claims 1-20 under 35 U.S.C. § 102, Applicant's arguments filed 01/13/2026 have been fully considered but they are not persuasive.
Applicant argues that Kumar fails to teach or suggest “synthesizing, based on the encoded data, the speech to obtain synthesized speech” and “outputting the synthesized speech while playing or rendering the visual data, the synthesized speech is synchronized with a mouth movement associated with the person such that a segment of the audio data determined as a corrupted portion is replaced by a corresponding subset of the synthesized speech being output while the mouth movement, being presented by a display device, of the person is speaking corresponding words associated with a same plurality of words of the subset of the synthesized speech”, as recited in amended claim 1. The Examiner respectfully disagrees with these arguments. Under the broadest reasonable interpretation, Kumar teaches these additional limitations. First, Kumar teaches generation of digital audio corresponding to an electronic transcription representing speech portrayed within a digital video (para. 0019, para. 0066, para. 0071; see also Fig. 5, 508), and utilizing this generated audio content to repair noisy/missing/corrupted audio feeds during video conferences or of lost audio of a character speaking (see para. 0072-0073). Generating digital audio corresponding to missing/lost/noisy speech based on the electronic transcription reads on the BRI of “synthesizing, based on the encoded data, the speech to obtain synthesized speech”. Furthermore, Kumar teaches that this generated digital audio is aligned with digital frames of a digital video such that timestamps of the digital audio align with the corresponding frames of digital video and that generated audio content is overlayed within the digital video in a target playback time indicated by the electronic transcription (para. 0071). Furthermore, by aligning the timestamps of the generated digital audio and digital video, the digital audio is played back such that the synthesized speech is in sync with the mouth motion (para. 0071). These features read on the BRI of “outputting the synthesized speech while playing or rendering the visual data, the synthesized speech is synchronized with a mouth movement associated with the person”. Furthermore, this generated digital audio can further be used for corrupted speech portions, such as with audio conferences (para. 0072-0073), and can be used to repair the corrupted audio feeds. Kumar also teaches that a display device can be used for playing digital videos (para. 0037). Finally, Kumar teaches that this generated digital audio is based on the electronic transcription (para. 0066); therefore, the synthesized speech corresponds to the same plurality of words that were included in the electronic transcription. Therefore, Kumar reads on the BRI of the remaining limitations “such that a segment of the audio data determined as a corrupted portion is replaced by a corresponding subset of the synthesized speech being output while the mouth movement, being presented by a display device of the person is speaking corresponding words associated with a same plurality of words of the subset of the synthesized speech”.
Hence, Applicant’s arguments are not persuasive.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
3. Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1, “A method” is recited, which is directed to one of the four statutory categories of invention (process) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes which fall into the category of abstract idea (Step 2A Prong 1: YES).
The following limitations, under their broadest reasonable interpretation, recite mental processes:
obtaining audio-visual data comprising visual data associated with a person and audio data associated with the person: a person listens and watches a person speaking, writing down what they hear and what they see
determining, based on the visual data, pronunciation data associated with speech by the person: a person reads the lips of the speaking person, and determines what they are pronouncing, and writes down pronunciation data using pen and paper
converting the speech to encoded data: a person writes down encoded data representing the speech they heard using pen and paper
and synthesizing, based on the encoded data, the speech to obtain synthesized speech: a person reproduces speech based on the encoded data
outputting the synthesized speech while playing or rendering the visual data, the synthesized speech is synchronized with a mouth movement associated with the person such that a segment of the audio data determined as a corrupted portion is replaced by a corresponding subset of the synthesized speech being output while the mouth movement…of the person is speaking corresponding words associated with a same plurality of words of the subset of the synthesized speech: a person speaks while a visual data (video) is played, and can speak at the same rate/time as the corresponding mouth movements in the video for segments determined by the person to be corrupted (e.g. noisy, hard to hear), and speaks the same words as the person in the original would have been speaking.
Claim 1 does not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). The only additional limitation is “being presented by a display device”. This limitation is being reicted at a high level of generality and amounts to mere instructions to implement the judicial exception using a generic computer, which does not integrate the judicial exception into a practical application as it does not impose any meaningful limits on practicing the abstract idea. Therefore, the claim does not integrate the judicial exception into a practical application. Accordingly, claim 1 is directed to an abstract idea (Step 2A: YES).
Claim 1 does not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitation amounts to mere instructions to implement the judicial exception using a generic computer, which does not amount to significantly more than the judicial exception as it does not provide an inventive concept. Therefore, the claim does not amount to significantly more than the judicial exception. Therefore, claim 1 is not patent eligible.
Regarding dependent claims 2-9, “The method” is recited, which is directed to one of the four statutory categories of invention (process) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite further mental processes which fall into the category of abstract idea (Step 2A Prong 1: YES).
The following limitations, under their broadest reasonable interpretation, recite further mental processes:
Claim 2:
the outputting of the subset of synthesized speech is simultaneous with the mouth movement: a person outputs the speech and synchronizes its presentation with the movement of the person
Claim 2 contains the additional limitation “while being displayed by the display device”, which amounts to mere instructions to implement the judicial exception using a generic computer.
Claim 3:
further comprising in response to determining a corrupted portion of the audio data: determining a duration in which the corrupted portion occurs; and outputting the synthesized speech during the duration: a person produces the speech during a duration where it is determined that corrupted speech occurs.
Claim 3 contains no additional limitations.
Claim 4:
wherein the determining the pronunciation data associated with the speech comprises determining, by a first model using the visual data, a visual cue of the person: a person uses a model as a set of rules to determine a visual cue based off the visual data, using pen and paper.
Claim 4 contains no additional limitations.
Claim 5:
wherein converting the speech to the encoded data comprises converting, by the first model, the visual cue into the pronunciation data: a person uses the first model to convert visual cues into pronunciation data, using pen and paper.
Claim 5 contains no additional limitations.
Claim 6:
herein the synthesizing the speech comprises generating the synthesized speech based on the pronunciation data determined based on the visual cue: a person produces the speech using the pronunciation data determined from the visual cue.
Claim 6 contains no additional limitations.
Claim 7:
wherein the converting the speech to the encoded data comprises converting, by a second model, the speech to the encoded data, wherein the second model is trained to encode the visual cue by assigning a code to the visual cue: a person uses a second model as a set of rules to assign a code to the visual cue to obtain the encoded data.
Claim 7 contains no additional limitations.
Claim 8:
removing background noise from the audio data, wherein the determining the pronunciation data is based on the visual data that comprises visual cues associated with the person: a person removes background noise from the audio data they write down, and determines pronunciation based on visual data comprising visual cues.
Claim 8 contains no additional limitations.
Claim 9:
wherein the visual cues comprise one or more mouth movements associated with the person: a person determines visual cues comprising mouth movements of a person.
Claim 9 contains no additional limitations.
Claims 2-9 do not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). The only additional limitations amount to mere instructions to implement the judicial exception using a generic computer, which do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, the claims do not integrate the judicial exception into a practical application. Accordingly, claims 2-9 are directed to an abstract idea (Step 2A: YES).
Claims 2-9 do not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitations amount to mere instructions to implement the judicial exception using a generic computer, which does not amount to significantly more than the judicial exception as it does not provide an inventive concept. Therefore, the claims do not amount to significantly more than the judicial exception. Therefore, claims 2-9 are not patent eligible.
Regarding claim 10, “A device” is recited, which is directed to one of the four statutory categories of invention (machine) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes which fall into the category of abstract idea (Step 2A Prong 1: YES).
The following limitations, under their broadest reasonable interpretation, recite mental processes:
obtain audio-visual data comprising visual data associated with a person and audio data associated with the person: a person listens and watches a person speaking, writing down what they hear and what they see
determine, by utilizing a first model, pronunciation data associated with speech by the person, based on the visual data: a person reads the lips of the speaking person, and uses a first model to determine what they are pronouncing, and writes down pronunciation data using pen and paper
convert, by utilizing a second model, the speech to encoded data: a person writes down encoded data representing the speech they heard using pen and paper and a second model.
synthesize, by utilizing the second model, the speech to obtain synthesized speech based on the encoded data: a person reproduces speech based on the encoded data using the second model.
output the synthesized speech while playing or rendering the visual data, the synthesized speech is synchronized with a mouth movement associated with the person such that a segment of the audio data determined as a corrupted portion is replaced by a corresponding subset of the synthesized speech being output while the mouth movement…of the person is speaking corresponding words associated with a same plurality of words of the subset of the synthesized speech: a person speaks while a visual data (video) is played, and can speak at the same rate/time as the corresponding mouth movements in the video for segments determined by the person to be corrupted (e.g. noisy, hard to hear), and speaks the same words as the person in the original would have been speaking.
Claim 10 does not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). The only additional limitations are “one or more processors”, “at least one memory storing instructions that when executed by the one or more processors, cause the device to”, and “being presented by a display device”. These limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, these limitations do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, the claim does not integrate the judicial exception into a practical application. Accordingly, claim 10 is directed to an abstract idea (Step 2A: YES).
Claim 10 does not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the additional limitations do not amount to significantly more as mere instructions to implement the judicial exception using a generic computer does not provide an inventive concept. Therefore, the claim does not amount to significantly more than the judicial exception. Therefore, claim 10 is not patent eligible.
Regarding dependent claims 11-17, “The device” is recited, which is directed to one of the four statutory categories of invention (machine) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite further mental processes which fall into the category of abstract idea (Step 2A Prong 1: YES).
The following limitations, under their broadest reasonable interpretation, recite further mental processes:
Claim 11:
present…the audio-visual data and the synthesized speech and wherein the synthesized speech being synchronized with the mouth movement of the person is synchronized while the audio-visual data is presented: a person outputs the speech and synchronizes its presentation with the movement of the person
Claim 11 contains the additional limitation “present, by the display and a speaker”, which amounts to mere instructions to implement the judicial exception using a generic computer.
Claim 12:
determine a duration in which the corrupted portion occurs; and present the synthesized speech during the duration: a person produces the speech during a duration where it is determined that corrupted speech occurs.
Claim 12 contains no additional limitations.
Claim 13:
determine the pronunciation data associated with the speech based on determining, by the first model using the visual data, a visual cue associated with the person: a person uses the first model to determine a visual cue based off the visual data, using pen and paper.
Claim 13 contains no additional limitations.
Claim 14:
convert the speech to the encoded data based on converting, by the first model, the visual cue into the pronunciation data: a person uses the first model to convert visual cues into pronunciation data, using pen and paper.
Claim 14 contains no additional limitations.
Claim 15:
generate the synthesized speech based on the pronunciation data determined based on the visual cue: a person produces the speech using the pronunciation data determined from the visual cue.
Claim 15 contains no additional limitations.
Claim 16:
wherein the second model is trained to encode the visual cue by assigning a code to the visual cue: a person trains to use second model to assign a code to the visual cue to obtain the encoded data.
Claim 16 contains no additional limitations.
Claim 17:
remove background noise from the audio data; determine the pronunciation data based on the visual data that comprises visual cues of the person: a person removes background noise from the audio data they write down, and determines pronunciation based on visual data comprising visual cues.
Claim 17 contains no additional limitations.
Claims 11-17 do not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). As discussed above, the only additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the additional limitations do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, the claims do not integrate the judicial exception into a practical application. Accordingly, claims 11-17 are directed to an abstract idea (Step 2A: YES).
Claims 11-17 do not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the additional limitations do not amount to significantly more as mere instructions to implement the judicial exception using a generic computer do not provide an inventive concept. Therefore, the claims do not amount to significantly more than the judicial exception. Therefore, claims 11-17 are not patent eligible.
Regarding claim 18, “A non-transitory computer-readable medium” is recited, which is directed to one of the four statutory categories of invention (article of manufacture) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes which fall into the category of abstract idea (Step 2A Prong 1: YES).
The following limitations, under their broadest reasonable interpretation, recite mental processes:
obtaining audio-visual data comprising visual data associated with a person and audio data associated with the person: a person listens and watches a person speaking, writing down what they hear and what they see
determining, by utilizing a first model, pronunciation data associated with speech by the person, based on the visual data: a person reads the lips of the speaking person, and uses a first model to determine what they are pronouncing, and writes down pronunciation data using pen and paper
converting, by utilizing a second model, the speech to encoded data: a person writes down encoded data representing the speech they heard using pen and paper and a second model.
synthesizing, by utilizing the second model, the speech based on the encoded data to obtain synthesized speech: a person reproduces speech based on the encoded data using the second model.
outputting the synthesized speech while playing or rendering the visual data, the synthesized speech is synchronized with a mouth movement associated with the person such that a segment of the audio data determined as a corrupted portion is replaced by a corresponding subset of the synthesized speech being output while the mouth movement…of the person is speaking corresponding words associated with a same plurality of words of the subset of the synthesized speech: a person speaks while a visual data (video) is played, and can speak at the same rate/time as the corresponding mouth movements in the video for segments determined by the person to be corrupted (e.g. noisy, hard to hear), and speaks the same words as the person in the original would have been speaking.
Claim 18 does not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). The only additional limitations are “A non-transitory computer-readable medium storing instructions that, when executed, cause” and “being presented by a display device”. These limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, these limitations do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, the claim does not integrate the judicial exception into a practical application. Accordingly, claim 18 is directed to an abstract idea (Step 2A: YES).
Claim 18 does not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the additional limitations do not amount to significantly more as mere instructions to implement the judicial exception using a generic computer does not provide an inventive concept. Therefore, the claim does not amount to significantly more than the judicial exception. Therefore, claim 18 is not patent eligible.
Regarding dependent claims 19-20, “The non-transitory computer-readable medium” is recited, which is directed to one of the four statutory categories of invention (article of manufacture) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite further mental processes which fall into the category of abstract idea (Step 2A Prong 1: YES).
The following limitations, under their broadest reasonable interpretation, recite further mental processes:
Claim 19:
the outputting, by utilizing the second model, of the synthesized speech as computer-generated synthesized speech; and outputting the computer-generated synthesized speech while playing or rendering the audio-visual data, wherein the computer-generated synthesized speech is synchronized with one or more mouth movements associated with the person: a person outputs the speech and synchronizes its presentation with audio visual data and with the movement of the person
Claim 19 contains no additional limitations
Claim 20:
determining a duration in which the corrupted portion occurs; and outputting the synthesized speech during the duration: a person produces the speech during a duration where it is determined that corrupted speech occurs.
Claim 20 contains no additional limitations.
Claims 19-20 do not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). As discussed above, the only additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the additional limitations do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, the claims do not integrate the judicial exception into a practical application. Accordingly, claims 19-20 are directed to an abstract idea (Step 2A: YES).
Claims 19-20 do not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Even when viewed in combination, the additional limitations do not amount to significantly more as mere instructions to implement the judicial exception using a generic computer do not provide an inventive concept. Therefore, the claims do not amount to significantly more than the judicial exception. Therefore, claims 19-20 are not patent eligible.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
4. Claims 1-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kumar & Krishnamurthy (US 2023/0252993 A1, hereinafter Kumar).
Regarding claim 1, Kumar discloses A method comprising: obtaining audio-visual data (para. 0040 “To illustrate, the server device(s) 102 train a GAN, receive a digital video from the client device 110a…”; para. 0004 “In this manner, the disclosed systems can utilize unsupervised machine learning to efficiently, accurately, and flexibly recognize speech from digital video (with or without accompanying digital audio).”) comprising visual data associated with a person (para. 0028 “As further used herein, the term “digital video” refers to a series of visual representations (e.g., a series of digital images that, when rendered in sequence, portray a character speaking).) and audio data associated with the person (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio.”); determining, based on the visual data, pronunciation data associated with speech by the person (Fig. 3; para. 0052 “Furthermore, as illustrated in FIG. 3, the visual speech recognition system 106 generates visual speech representation clusters 310 from the visual speech representations 306 utilizing a clustering model 308.”; para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310.”); converting the speech to encoded data (para. 0062 “In relation to FIG. 4, the GAN 404 is trained to generate possible sequences of visemes from input visual speech embeddings (e.g., the segmented visemic speech representations). The GAN 404 (i.e., a generator G) maps the sequence of segmented visemic speech representations S=s.sub.1, s.sub.2, . . . , s.sub.T to viseme representations V=v, v.sub.2, . . . , v.sub.L.”); synthesizing, based on the encoded data, the speech to obtain synthesized speech (para. 0066 “Furthermore, as previously mentioned and as shown in FIG. 4, the visual speech recognition system 106 utilizes the GAN 404 to generate a viseme sequence for the segmented visemic speech representations and subsequently generate an electronic transcript 408 that represents speech portrayed within the digital video. Indeed, FIG. 5 illustrates the visual speech recognition system 106 utilizing a language decoder 504 to generate an electronic transcription 506 for a digital video 510 from a viseme sequence 502 that is generated for the digital video 510. Additionally, as shown in FIG. 5, in some embodiments, the visual speech recognition system 106 further utilizes the electronic transcription 506 to generate digital audio content 508 for the speech portrayed within the digital video 510.”; para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video. In some cases, the visual speech recognition system 106 utilizes a computerized speech model (e.g., artificial intelligence, speech reading algorithm) to create audio content for the text of the electronic transcription.”) and outputting the synthesized speech while playing or rendering the visual data (synthesized speech (generated digital audio content based on the transcription) is aligned with the digital frames of digital video: para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”; para. 0038 “To access the functionalities of the visual speech recognition system 106 (as described above), in one or more implementations, a user interacts with one of the digital graphics applications 112a-112n on the client devices 110a-110n. For instance, the digital graphics applications 112a-112n include one or more software applications installed on the client devices 110a-110n (e.g., to capture, play, and/or modify digital videos in accordance with one or more implementations herein).”; Such synthesis is used, for example, in online conferences: para. 0073 “In some cases, the visual speech recognition system 106 utilizes the visual speech recognition framework to repair (via electronic transcriptions and audio content) noisy, missing, and/or corrupted audio feeds of video conference calls.”), the synthesized speech is synchronized with a mouth movement associated with the person (synthesized speech is synchronized with mouth motion (digital audio aligned with timestamps of video corresponding to the when particular words are mouthed by user utilizing viseme representations): para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video…Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”) such that a segment of the audio data determined as a corrupted portion (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio.”) is replaced by a corresponding subset of the synthesized speech being output while the mouth movement, being presented by a display device (para. 0037 “For example, the client devices 110a-110n perform functions such as, but not limited to, capturing digital videos, editing digital videos, playing digital videos, and/or requesting speech recognition for digital videos.”), of the person is speaking corresponding words associated with a same plurality of words of the subset of the of the synthesized speech (digital audio (synthesized speech) to replace the corrupted audio is based on an electronic transcription of words spoken in corrupted audio; thus, the digital synthesized audio contains the same plurality of words that are being mouthed by user: para. 0071).
Regarding claim 2, Kumar discloses the outputting of the subset of synthesized speech is simultaneous with the mouth movement (para. 0071 “Then, in some implementations, the visual speech recognition system 106 overlays the audio content within the digital video in a target playback time as indicated by the electronic transcription.”; para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”), while being displayed by the display device, of the person (para. 0037 “For example, the client devices 110a-110n perform functions such as, but not limited to, capturing digital videos, editing digital videos, playing digital videos, and/or requesting speech recognition for digital videos.”; para. 0072-0073).
Regarding claim 3, Kumar discloses in response to determining a corrupted portion of the audio data (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio.”): determining a duration in which the corrupted portion occurs (para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript.”); and outputting the synthesized speech during the duration (para. 0071 “The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”).
Regarding claim 4, Kumar discloses wherein the determining the pronunciation data associated with the speech comprises determining, by a first model using the visual data, a visual cue of the person (para. 0050 “In one or more embodiments, the visual speech recognition system 106 utilizes an image encoder to generate deep embeddings (e.g., deep visual speech representations) from a digital video frame (e.g., an image) that capture facial features portrayed in the image.”; para. 0052 “Furthermore, as illustrated in FIG. 3, the visual speech recognition system 106 generates visual speech representation clusters 310 from the visual speech representations 306 utilizing a clustering model 308.”; para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310. In particular, as shown in FIG. 3, the visual speech recognition system 106 utilizes a segmentation model 312 to identify speech segment boundaries from the visual speech representation clusters 310 to identify groupings of digital video frames.”).
Regarding claim 5, Kumar discloses wherein converting the speech to the encoded data comprises converting, by the first model, the visual cue into the pronunciation data (para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310. In particular, as shown in FIG. 3, the visual speech recognition system 106 utilizes a segmentation model 312 to identify speech segment boundaries from the visual speech representation clusters 310 to identify groupings of digital video frames.”; para. 0062 “In relation to FIG. 4, the GAN 404 is trained to generate possible sequences of visemes from input visual speech embeddings (e.g., the segmented visemic speech representations). The GAN 404 (i.e., a generator G) maps the sequence of segmented visemic speech representations S=s.sub.1, s.sub.2, . . . , s.sub.T to viseme representations V=v, v.sub.2, . . . , v.sub.L.”).
Regarding claim 6, Kumar discloses wherein the synthesizing the speech comprises generating the synthesized speech based on the pronunciation data determined based on the visual cue (para. 0066 “Furthermore, as previously mentioned and as shown in FIG. 4, the visual speech recognition system 106 utilizes the GAN 404 to generate a viseme sequence for the segmented visemic speech representations and subsequently generate an electronic transcript 408 that represents speech portrayed within the digital video. Indeed, FIG. 5 illustrates the visual speech recognition system 106 utilizing a language decoder 504 to generate an electronic transcription 506 for a digital video 510 from a viseme sequence 502 that is generated for the digital video 510. Additionally, as shown in FIG. 5, in some embodiments, the visual speech recognition system 106 further utilizes the electronic transcription 506 to generate digital audio content 508 for the speech portrayed within the digital video 510.”; para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video. In some cases, the visual speech recognition system 106 utilizes a computerized speech model (e.g., artificial intelligence, speech reading algorithm) to create audio content for the text of the electronic transcription.”).
Regarding claim 7, Kumar discloses wherein the converting the speech to the encoded data comprises converting, by a second model, the speech to the encoded data, wherein the second model is trained to encode the visual cue by assigning a code to the visual cue (para. 0062 “In relation to FIG. 4, the GAN 404 is trained to generate possible sequences of visemes from input visual speech embeddings (e.g., the segmented visemic speech representations). The GAN 404 (i.e., a generator G) maps the sequence of segmented visemic speech representations S=s.sub.1, s.sub.2, . . . , s.sub.T to viseme representations V=v, v.sub.2, . . . , v.sub.L.”).
Regarding claim 8, Kumar discloses removing background noise from the audio data (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio. In some cases, the visual speech recognition system 106 utilizes the visual speech recognition framework to repair (via electronic transcriptions and audio content) noisy, missing, and/or corrupted audio feeds of video conference calls.”), wherein the determining the pronunciation data is based on the visual data that comprises visual cues associated with the person (para. 0050 “In one or more embodiments, the visual speech recognition system 106 utilizes an image encoder to generate deep embeddings (e.g., deep visual speech representations) from a digital video frame (e.g., an image) that capture facial features portrayed in the image.”; para. 0052 “Furthermore, as illustrated in FIG. 3, the visual speech recognition system 106 generates visual speech representation clusters 310 from the visual speech representations 306 utilizing a clustering model 308.”; para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310. In particular, as shown in FIG. 3, the visual speech recognition system 106 utilizes a segmentation model 312 to identify speech segment boundaries from the visual speech representation clusters 310 to identify groupings of digital video frames.”).
Regarding claim 9, Kumar discloses wherein the visual cues comprise one or more mouth movements associated with the person (para. 0032 “As used herein, the term “viseme” refers to a decomposable unit of representation for visual speech. In particular, in one or more embodiments, the term “viseme” refers to a decomposable unit that represents one or more audible phonemes from a distinct mouth movement corresponding to a particular or a group of sounds. Indeed, in one or more embodiments, a viseme includes a visual movement (e.g., mouth shape or movement) that represents one or more phonemes. In some cases, a viseme represents a distinct mouth movement that maps to one or more particular phonemes (e.g., a first viseme that maps to the phonemes p, b, m and a second viseme that maps to the phonemes t, d). In addition, as used herein, the term “viseme sequence” refers to a set of one or more visemes in a specific order that represent a phrase. For example, the phrases “I love you” and “Elephant Juice” have a viseme sequence of “rstfi.” In certain instances, a viseme sequence can include one or more visemes in combination with one or more phonemes that represent the one or more visemes.”).
Regarding claim 10, Kumar discloses A device (Fig. 10), comprising: one or more processors (Fig. 10, 1002); and at least one memory storing instructions (Fig. 10, 1004; para. 0124 “The memory 1004 may be used for storing data, metadata, and programs for execution by the processor(s)…”), that when executed by the one or more processors, cause the device to (para. 0111 “In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., memory), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.”): obtain audio-visual data comprising visual data associated with a person (para. 0028 “As further used herein, the term “digital video” refers to a series of visual representations (e.g., a series of digital images that, when rendered in sequence, portray a character speaking).) and audio data associated with the person (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio.”); determine, by utilizing a first model, pronunciation data associated with speech by the person, based on the visual data (para. 0050 “In one or more embodiments, the visual speech recognition system 106 utilizes an image encoder to generate deep embeddings (e.g., deep visual speech representations) from a digital video frame (e.g., an image) that capture facial features portrayed in the image.”; para. 0052 “Furthermore, as illustrated in FIG. 3, the visual speech recognition system 106 generates visual speech representation clusters 310 from the visual speech representations 306 utilizing a clustering model 308.”; para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310. In particular, as shown in FIG. 3, the visual speech recognition system 106 utilizes a segmentation model 312 to identify speech segment boundaries from the visual speech representation clusters 310 to identify groupings of digital video frames.”); convert, by utilizing a second model, the speech to encoded data (para. 0062 “In relation to FIG. 4, the GAN 404 is trained to generate possible sequences of visemes from input visual speech embeddings (e.g., the segmented visemic speech representations). The GAN 404 (i.e., a generator G) maps the sequence of segmented visemic speech representations S=s.sub.1, s.sub.2, . . . , s.sub.T to viseme representations V=v, v.sub.2, . . . , v.sub.L.”); synthesize, by utilizing the second model, the speech to obtain synthesized speech based on the encoded data (para. 0066 “Furthermore, as previously mentioned and as shown in FIG. 4, the visual speech recognition system 106 utilizes the GAN 404 to generate a viseme sequence for the segmented visemic speech representations and subsequently generate an electronic transcript 408 that represents speech portrayed within the digital video. Indeed, FIG. 5 illustrates the visual speech recognition system 106 utilizing a language decoder 504 to generate an electronic transcription 506 for a digital video 510 from a viseme sequence 502 that is generated for the digital video 510. Additionally, as shown in FIG. 5, in some embodiments, the visual speech recognition system 106 further utilizes the electronic transcription 506 to generate digital audio content 508 for the speech portrayed within the digital video 510.”; para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video. In some cases, the visual speech recognition system 106 utilizes a computerized speech model (e.g., artificial intelligence, speech reading algorithm) to create audio content for the text of the electronic transcription.”) and output the synthesized speech while playing or rendering the visual data (synthesized speech (generated digital audio content based on the transcription) is aligned with the digital frames of digital video: para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”; para. 0038 “To access the functionalities of the visual speech recognition system 106 (as described above), in one or more implementations, a user interacts with one of the digital graphics applications 112a-112n on the client devices 110a-110n. For instance, the digital graphics applications 112a-112n include one or more software applications installed on the client devices 110a-110n (e.g., to capture, play, and/or modify digital videos in accordance with one or more implementations herein).”; Such synthesis is used, for example, in online conferences: para. 0073 “In some cases, the visual speech recognition system 106 utilizes the visual speech recognition framework to repair (via electronic transcriptions and audio content) noisy, missing, and/or corrupted audio feeds of video conference calls.”), the synthesized speech is synchronized with a mouth movement associated with the person (synthesized speech is synchronized with mouth motion (digital audio aligned with timestamps of video corresponding to the when particular words are mouthed by user utilizing viseme representations): para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video…Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”) such that a segment of the audio data determined as a corrupted portion (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio.”) is replaced by a corresponding subset of the synthesized speech being output while the mouth movement, being presented by a display device (para. 0037 “For example, the client devices 110a-110n perform functions such as, but not limited to, capturing digital videos, editing digital videos, playing digital videos, and/or requesting speech recognition for digital videos.”), of the person is speaking corresponding words associated with a same plurality of words of the subset of the of the synthesized speech (digital audio (synthesized speech) to replace the corrupted audio is based on an electronic transcription of words spoken in corrupted audio; thus, the digital synthesized audio contains the same plurality of words that are being mouthed by user: para. 0071).
Regarding claim 11, Kumar discloses present, by the display device and a speaker (para. 0094 “For example, the digital video manager 802 identifies, stores, and/or displays digital videos as described above (e.g., in relation to FIGS. 1 and 2).”; para. 0127 “The I/O interfaces 1008 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers…”), the audio-visual data and the synthesized speech (para. 0071 “Then, in some implementations, the visual speech recognition system 106 overlays the audio content within the digital video in a target playback time as indicated by the electronic transcription.”); and wherein the synchronized speech being synchronized with the mouth movement of the person is synchronized while the audio-visual data is presented (para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”; para. 0072-0073).
Regarding claim 12, Kumar discloses determine a duration in which the corrupted portion occurs (para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript.”); and present the synthesized speech during the duration (para. 0071 “The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”).
Regarding claim 13, Kumar discloses determine the pronunciation data associated with the speech based on determining, by the first model utilizing the visual data, a visual cue associated with the person (para. 0050 “In one or more embodiments, the visual speech recognition system 106 utilizes an image encoder to generate deep embeddings (e.g., deep visual speech representations) from a digital video frame (e.g., an image) that capture facial features portrayed in the image.”; para. 0052 “Furthermore, as illustrated in FIG. 3, the visual speech recognition system 106 generates visual speech representation clusters 310 from the visual speech representations 306 utilizing a clustering model 308.”; para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310. In particular, as shown in FIG. 3, the visual speech recognition system 106 utilizes a segmentation model 312 to identify speech segment boundaries from the visual speech representation clusters 310 to identify groupings of digital video frames.”).
Regarding claim 14, Kumar discloses convert the speech to the encoded data based on converting, by the first model, the visual cue into the pronunciation data (para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310. In particular, as shown in FIG. 3, the visual speech recognition system 106 utilizes a segmentation model 312 to identify speech segment boundaries from the visual speech representation clusters 310 to identify groupings of digital video frames.”; para. 0062 “In relation to FIG. 4, the GAN 404 is trained to generate possible sequences of visemes from input visual speech embeddings (e.g., the segmented visemic speech representations). The GAN 404 (i.e., a generator G) maps the sequence of segmented visemic speech representations S=s.sub.1, s.sub.2, . . . , s.sub.T to viseme representations V=v, v.sub.2, . . . , v.sub.L.”).
Regarding claim 15, Kumar discloses generate the synthesized speech based on the pronunciation data determined based on the visual cue (para. 0066 “Furthermore, as previously mentioned and as shown in FIG. 4, the visual speech recognition system 106 utilizes the GAN 404 to generate a viseme sequence for the segmented visemic speech representations and subsequently generate an electronic transcript 408 that represents speech portrayed within the digital video. Indeed, FIG. 5 illustrates the visual speech recognition system 106 utilizing a language decoder 504 to generate an electronic transcription 506 for a digital video 510 from a viseme sequence 502 that is generated for the digital video 510. Additionally, as shown in FIG. 5, in some embodiments, the visual speech recognition system 106 further utilizes the electronic transcription 506 to generate digital audio content 508 for the speech portrayed within the digital video 510.”; para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video. In some cases, the visual speech recognition system 106 utilizes a computerized speech model (e.g., artificial intelligence, speech reading algorithm) to create audio content for the text of the electronic transcription.”).
Regarding claim 16, Kumar discloses wherein the second model is trained to encode the visual cue by assigning a code to the visual cue (para. 0062 “In relation to FIG. 4, the GAN 404 is trained to generate possible sequences of visemes from input visual speech embeddings (e.g., the segmented visemic speech representations). The GAN 404 (i.e., a generator G) maps the sequence of segmented visemic speech representations S=s.sub.1, s.sub.2, . . . , s.sub.T to viseme representations V=v, v.sub.2, . . . , v.sub.L.”).
Regarding claim 17, Kumar discloses remove background noise from the audio data (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio. In some cases, the visual speech recognition system 106 utilizes the visual speech recognition framework to repair (via electronic transcriptions and audio content) noisy, missing, and/or corrupted audio feeds of video conference calls.”), and determine the pronunciation data based on the visual data that comprises visual cues of the person (para. 0050 “In one or more embodiments, the visual speech recognition system 106 utilizes an image encoder to generate deep embeddings (e.g., deep visual speech representations) from a digital video frame (e.g., an image) that capture facial features portrayed in the image.”; para. 0052 “Furthermore, as illustrated in FIG. 3, the visual speech recognition system 106 generates visual speech representation clusters 310 from the visual speech representations 306 utilizing a clustering model 308.”; para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310. In particular, as shown in FIG. 3, the visual speech recognition system 106 utilizes a segmentation model 312 to identify speech segment boundaries from the visual speech representation clusters 310 to identify groupings of digital video frames.”).
Regarding claim 18, Kumar discloses A non-transitory computer-readable medium storing instructions that, when executed, cause (para. 0111-0112): obtaining audio-visual data comprising visual data associated with a person (para. 0028 “As further used herein, the term “digital video” refers to a series of visual representations (e.g., a series of digital images that, when rendered in sequence, portray a character speaking).) and audio data associated with the person (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio.”); determining, by utilizing a first model, pronunciation data associated with speech by the person, based on the visual data (para. 0050 “In one or more embodiments, the visual speech recognition system 106 utilizes an image encoder to generate deep embeddings (e.g., deep visual speech representations) from a digital video frame (e.g., an image) that capture facial features portrayed in the image.”; para. 0052 “Furthermore, as illustrated in FIG. 3, the visual speech recognition system 106 generates visual speech representation clusters 310 from the visual speech representations 306 utilizing a clustering model 308.”; para. 0054 “Moreover, as illustrated in FIG. 3, the visual speech recognition system 106 further determines segmented visemic speech representations 314 from the visual speech representation clusters 310. In particular, as shown in FIG. 3, the visual speech recognition system 106 utilizes a segmentation model 312 to identify speech segment boundaries from the visual speech representation clusters 310 to identify groupings of digital video frames.”); converting, by utilizing a second model, the speech to encoded data (para. 0062 “In relation to FIG. 4, the GAN 404 is trained to generate possible sequences of visemes from input visual speech embeddings (e.g., the segmented visemic speech representations). The GAN 404 (i.e., a generator G) maps the sequence of segmented visemic speech representations S=s.sub.1, s.sub.2, . . . , s.sub.T to viseme representations V=v, v.sub.2, . . . , v.sub.L.”); synthesizing, by utilizing the second model, the speech based on the encoded data to obtain synthesized speech (para. 0066 “Furthermore, as previously mentioned and as shown in FIG. 4, the visual speech recognition system 106 utilizes the GAN 404 to generate a viseme sequence for the segmented visemic speech representations and subsequently generate an electronic transcript 408 that represents speech portrayed within the digital video. Indeed, FIG. 5 illustrates the visual speech recognition system 106 utilizing a language decoder 504 to generate an electronic transcription 506 for a digital video 510 from a viseme sequence 502 that is generated for the digital video 510. Additionally, as shown in FIG. 5, in some embodiments, the visual speech recognition system 106 further utilizes the electronic transcription 506 to generate digital audio content 508 for the speech portrayed within the digital video 510.”; para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video. In some cases, the visual speech recognition system 106 utilizes a computerized speech model (e.g., artificial intelligence, speech reading algorithm) to create audio content for the text of the electronic transcription.”) and outputting the synthesized speech while playing or rendering the visual data (synthesized speech (generated digital audio content based on the transcription) is aligned with the digital frames of digital video: para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”; para. 0038 “To access the functionalities of the visual speech recognition system 106 (as described above), in one or more implementations, a user interacts with one of the digital graphics applications 112a-112n on the client devices 110a-110n. For instance, the digital graphics applications 112a-112n include one or more software applications installed on the client devices 110a-110n (e.g., to capture, play, and/or modify digital videos in accordance with one or more implementations herein).”; Such synthesis is used, for example, in online conferences: para. 0073 “In some cases, the visual speech recognition system 106 utilizes the visual speech recognition framework to repair (via electronic transcriptions and audio content) noisy, missing, and/or corrupted audio feeds of video conference calls.”), the synthesized speech is synchronized with a mouth movement associated with the person (synthesized speech is synchronized with mouth motion (digital audio aligned with timestamps of video corresponding to the when particular words are mouthed by user utilizing viseme representations): para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video…Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”) such that a segment of the audio data determined as a corrupted portion (para. 0073 “Furthermore, in one or more instances, the visual speech recognition system 106 utilizes the visual speech recognition framework to generate audio content for digital videos that have noisy, missing, and/or corrupted audio.”) is replaced by a corresponding subset of the synthesized speech being output while the mouth movement, being presented by a display device (para. 0037 “For example, the client devices 110a-110n perform functions such as, but not limited to, capturing digital videos, editing digital videos, playing digital videos, and/or requesting speech recognition for digital videos.”), of the person is speaking corresponding words associated with a same plurality of words of the subset of the of the synthesized speech (digital audio (synthesized speech) to replace the corrupted audio is based on an electronic transcription of words spoken in corrupted audio; thus, the digital synthesized audio contains the same plurality of words that are being mouthed by user: para. 0071).
Regarding claim 19, Kumar discloses the outputting, by utilizing the second model, of the synthesized speech as computer-generated synthesized speech (para. 0066 “Furthermore, as previously mentioned and as shown in FIG. 4, the visual speech recognition system 106 utilizes the GAN 404 to generate a viseme sequence for the segmented visemic speech representations and subsequently generate an electronic transcript 408 that represents speech portrayed within the digital video. Indeed, FIG. 5 illustrates the visual speech recognition system 106 utilizing a language decoder 504 to generate an electronic transcription 506 for a digital video 510 from a viseme sequence 502 that is generated for the digital video 510. Additionally, as shown in FIG. 5, in some embodiments, the visual speech recognition system 106 further utilizes the electronic transcription 506 to generate digital audio content 508 for the speech portrayed within the digital video 510.”; para. 0071 “In some cases, upon generating an electronic transcription for a digital video, the visual speech recognition system 106 also generates digital audio content for the digital video. In some cases, the visual speech recognition system 106 utilizes a computerized speech model (e.g., artificial intelligence, speech reading algorithm) to create audio content for the text of the electronic transcription.”); and outputting the computer-generated synthesized speech while playing or rendering the audio-visual data (para. 0071 “Then, in some implementations, the visual speech recognition system 106 overlays the audio content within the digital video in a target playback time as indicated by the electronic transcription.”), wherein the computer-generated synthesized speech is synchronized with one or more mouth movements associated with the person (para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript. The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”).
Regarding claim 20, Kumar discloses determining a duration in which the corrupted portion occurs (para. 0071 “Specifically, the visual speech recognition system 106 can determine timestamps corresponding to the audio and electronic transcript based on digital frames utilized to generate the segmented visemic speech representations mapped to the visemes and corresponding words of the electronic transcript.”); and outputting the synthesized speech during the duration (para. 0071 “The visual speech recognition system 105 can align the timestamps of the audio/electronic transcript with timestamps of digital frames of the digital video. Thus, the visual speech recognition system 106 can align generated digital audio to digital frames of the digital video.”).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Khalilia & Mishra (US 11,404,087 B1): audio frame replacement based on approximating portion of speech in the replaced audio frame using location data of lips (Fig. 14, Col. 26 Lines 30-56)
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CODY DOUGLAS HUTCHESON/ Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/ Supervisory Patent Examiner, Art Unit 2659