DETAILED ACTION
This communication is in response to the Amendments and Arguments filed on 01/14/2026.
Claims 1-20 are pending and have been examined.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the examiner.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claim(s) 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Please see the updated mappings below citing the art of Gilson for further detail.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1-14 are rejected on the ground of nonstatutory double patenting as being unpatentable over claim 4 of U.S. Patent No. 11942093. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the issued patent/co-pending application anticipate the claims of the instant application. Please see below for the mapping in the table, where the bolded limitations indicate the corresponding limitations between the issued patent/co-pending application and instant application. With respect to the dependent claims, each of the claims map to a corresponding dependent claim of the issued patent/co-pending application or are found within the scope of the independent claim.
With respect to each of the dependent claims and independent claims, each claim corresponds numerically. Please see mapping that follows: Instant application claim (I), Issued Patent/Co-Pending App (P) - Claim 1 (I):Claim 4 (P), Claim 2 (I):Claim 4 (P), Claim 3 (I):Claim 4 (P), Claim 4 (I):Claim 4 (P), Claim 5 (I):Claim 4 (P), Claim 6 (I):Claim 4 (P), Claim 7 (I):Claim 4 (P), Claim 8 (I):Claim 4 (P), Claim 9 (I):Claim 4 (P), Claim 10 (I):Claim 4 (P), Claim 11 (I):Claim 4 (P), Claim 12 (I):Claim 4 (P), Claim 13 (I):Claim 4 (P), Claim 14 (I):Claim 4 (P).
Instant Application: 18403829
Issued Patent: US 11942093
Claim 1: A method for generating captions for audiovisual media, comprising the steps of:
converting a speech component of an audio portion of audiovisual media into at least one text string, wherein the at least one text string comprises at least one word;
determining a temporal start point and a temporal end point for the at least one word;
visually inserting the at least one word in a video portion of the audiovisual media such that the temporal start point and the temporal end point for the at least one word are synchronized with corresponding temporal start and end points of the speech component of the audio portion of the audiovisual media; and
selectively inserting a latency period into broadcast of the audiovisual media such that the synchronization may be selectively adjusted by a user during the latency period.
Claim 1:A system that performs dubbing automatically for multiple languages simultaneously using speech-to-text transcriptions and language translation comprising:
a. a first device that captures an original video program further comprising video image frames and synchronized audio speech by one or more speakers recorded in a source language;
b. a first transmitter that transmits the original video program;
c. a second device that processes the original video program and transmits it to a transcription service that converts the synchronized audio speech to text strings, wherein each text string further comprises a plurality of words;
ii. determines the temporal start and end points for each of the plurality of words;
iii. from the temporal start and end points for each of the plurality of words, determines timing of pauses between each of the plurality of words;
iv. from the timing of the pauses, determines which words in each text string form phrases and which
words in each text string form sentences;
v. assigns temporal anchors to each phrase and sentence;
vi. assigns parameters to each word, phrase and sentence,
wherein said parameters determine: a speaker identifier; a gender of the speaker; whether the speaker is an adult or a child; an inflection and emphasis of each word in the phrase; a volume of each word in the phrase; a tonality of each word in the phrase; a raspness of each word in the phrase; and an emotional indicator for the phrase, wherein the speaker identifier and the emotional indicator are each determined using artificial intelligence;
vii. synchronizes the assigned parameters of each word,
phrase and sentence using the temporal anchors within each text string;
d. a translation engine that produces a plurality of text script in various target languages from each phrase, wherein each of plurality of text scripts contains a series of concatenated text strings along with associated inflection, tonality, emphasis, raspness, emotional indication, and volume indicators as well as timing and
speaker identifiers for each word, phrase, and sentence that is derived from the synchronized audio speech recorded in the source language;
e. a dubbing engine that creates audio strings in the
various target languages that are time synchronized to
their source language audio strings by utilizing the
temporal anchors;
f. an analysis module that analyzes the optional placement
and superposition of subtitles comprising the text strings in either the source language or the various target languages onto the original video program,
wherein the analysis of the optional placement and the
superposition of the subtitles is performed using artificial
intelligence; and
g. a second transmitter that transmits the original video
program containing the created audio strings in the various target languages, and which may also optionally comprise the subtitles.
Claim 4: The system of claim 1 wherein transmission of the original video program containing the created audio strings is delayed.
Regarding the differences between Claim 1 of the instant application and system claims 1 and 4 of the issued patent/co-pending application, it would have been obvious to one of ordinary skill in the art that the system limitation of the issued patent/co-pending application could be applied to perform the method as presented in the instant application.
As to claim(s) 15, this/these claim(s) are rejected over claim 4 of the issued patent/co-pending application in view of McCartney, Jr. et al. (U.S. PG Pub No. 2020/0404386), as found in the IDS, hereinafter McCartney.
Please see the respective claim mappings below for further detail.
As to claim(s) 16, this/these claim(s) are rejected over claim 4 of the issued patent/co-pending application in view of McCartney, and further in view of Gilson (U.S. PG Pub No. 2020/0051582), hereinafter Gilson.
Please see the respective claim mappings below for further detail.
As to claim(s) 17 and 18, this/these claim(s) are rejected over claim 4 of the issued patent/co-pending application in view of Gilson.
Please see the respective claim mapping below for further detail.
As to claim(s) 19 and 20, this/these claim(s) are rejected over claim 4 of the issued patent/co-pending application in view of McCartney, in view of Gilson, and further in view of Kim et al. (U.S. PG Pub No. 2020/0342852), as found in the IDS, hereinafter Kim.
Please see the respective claim mapping below for further detail.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-4, 17, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chaudhuri et al. (U.S. PG Pub No. 2017/0316792), hereinafter Chaudhuri, in view of Gilson.
Regarding claim 1, Chaudhuri teaches
A method for generating captions for audiovisual media (a method [0004]), comprising the steps of:
converting a speech component of an audio portion of audiovisual media into at least one text string, wherein the at least one text string comprises at least one word (a speech recognition engine may transcribe the speech in the content item, such as a video with audio, into caption text, i.e. converting a speech component of an audio portion of audiovisual media into at least one text string, and the text may be a phrase including a number of words, i.e. the at least one text string comprises at least one word [0023-4],[0078]);
determining a temporal start point and a temporal end point for the at least one word (a timing window is determined with a start timestamp and an end timestamp, i.e. determining a temporal start point and a temporal end point, based on the time when the speech sounds, for which the text is a transcription, start and end, i.e. for the at least one word [0024],[0030]);
visually inserting the at least one word in a video portion of the audiovisual media such that the temporal start point and the temporal end point for the at least one word are synchronized with corresponding temporal start and end points of the speech component of the audio portion of the audiovisual media (the text for a caption entry is visually presented during playback, i.e. visually inserting the at least one word in a video portion of the audiovisual media, between the start timestamp and the end timestamp of the timing window associated with that caption entry, as determined by analyzing the beginning and ending of the speech sounds, i.e. such that the temporal start point and the temporal end point for the at least one word are synchronized with corresponding temporal start and end points of the speech component of the audio portion of the audiovisual media [0023-4],[0030],[0034-5],[0078]).
selectively inserting a latency period into broadcast of the audiovisual media such that the synchronization may be selectively adjusted by a user during the latency period.
While Chaudhuri provides adjusting the timing of captions according to audio, Chaudhuri does not specifically teach inserting a latency period into a broadcast, and thus does not teach
selectively inserting a latency period into broadcast of the audiovisual media such that the synchronization may be selectively adjusted by a user during the latency period.
Gilson, however, teaches selectively inserting a latency period into broadcast of the audiovisual media such that the synchronization may be selectively adjusted by a user during the latency period (for live media content, the broadcaster may use a time delay between receiving the live media content, such as a video program, and transmitting the media content, which give time to generate and synchronize captions, i.e. selectively inserting a latency period into broadcast of the audiovisual media, where a human transcriber may correct the words in the initial transcript, where the first and second transcripts are synchronized to create the final captions [0024-5],[0028-9],[0031],[0055],[0068-9],[0085-6]).
Chaudhuri and Gilson are analogous art because they are from a similar field of endeavor in providing text captions for video with audio. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the adjusting the timing of captions according to audio teachings of Chaudhuri with the use of a transmission delay to allow correcting transcripts for captions as taught by Gilson. It would have been obvious to combine the references to enable the improvement of the accuracy of transcripts by using multiple transcribers (Gilson [0004]).
Regarding claim 2, Chaudhuri in view of Gilson teaches claim 1, and Chaudhuri further teaches
the at least one word comprises a plurality of words (a speech recognition engine may transcribe the speech in the content item, such as a video with audio, into caption text, where the text may be a phrase including a number of words, i.e. a plurality of words [0023-4],[0078]), and wherein the method further comprises selectively adjusting visual segmentation of the plurality of words (there may be short gaps in the speech timing where the speech sounds essentially continuous, but the timing window may be split at the gap to allow for generation of caption boxes that are split naturally at the gap point to create a more visually pleasing result [0023-4],[0068]).
Regarding claim 3, Chaudhuri in view of Gilson teaches claim 2, and Gilson further teaches
the selective adjustment of the visual segmentation of the plurality of words is performed by the user during the latency period (for live media content, the broadcaster may use a time delay between receiving the live media content, such as a video program, and transmitting the media content, which give time to generate and synchronize captions, i.e. during the latency period, where a human transcriber may correct the words in the initial transcript by adding, deleting, or modifying words, or metadata can be added to the transcript, such as the speaker’s name, where the first and second transcripts are synchronized to create the final captions, including adding caption text to indicate who is speaking, i.e. the selective adjustment of the visual segmentation of the plurality of words is performed by the user [0024-5],[0028-9],[0031],[0046],[0053],[0055],[0068-9],[0085-6]).
Where Chaudhuri further teaches that the alignment includes splitting timing windows for split caption boxes, i.e. selective adjustment of the visual segmentation of the plurality of words [0023-4],[0068].
And where the motivation to combine is the same as previously presented.
Regarding claim 4, Chaudhuri in view of Gilson teaches claim 2, and Chaudhuri further teaches
the selective adjustment of the visual segmentation of the plurality of words is performed by a machine learning-based system (there may be short gaps in the speech timing where the speech sounds essentially continuous, but the timing window may be split at the gap to allow for generation of caption boxes that are split naturally at the gap point to create a more visually pleasing result, i.e. the selective adjustment of the visual segmentation of the plurality of words, where the presence of speech is determined by a machine learning model, i.e. performed by a machine learning-based system [0023-4],[0030-1],[0046],[0068]).
Regarding claim 17, Chaudhuri in view of Gilson teaches claim 1, and Gilson further teaches
displaying a countdown to the user, wherein the countdown indicates a remaining time during the latency period (the GUI may display a transmit time deadline as a countdown indication, i.e. displaying a countdown to the user, where that there is enough time to generate captions for transmission during the delay, i.e. the countdown indicates a remaining time during the latency period Fig. 8,[0068-9],[0086]).
Regarding claim 18, Chaudhuri in view of Gilson teaches claim 1, and Gilson further teaches
an output stream including the audiovisual media is locked after expiration of the latency period (once the transmit deadline is reached, i.e. after expiration of the latency period, the system proceeds with a selected transcriber output, and the transcriber device displays the next speech to text output and/or plays back another audio segment, where captions are generated from the output, and the captions and synchronized media content are transmitted, i.e. an output stream including the audiovisual media is locked Fig. 7,[0052],[0055],[0068-9],[0077],[0085-6]).
Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chaudhuri, in view of Gilson, and further in view of Gonzales et al. (U.S. PG Pub No. 2003/0216922), hereinafter Gonzales.
Regarding claim 5, Chaudhuri in view of Gilson teaches claim 1.
While Chaudhuri in view of Gilson provides that a transcriber can be a translation software, Chaudhuri in view of Gilson does not specifically teach translating at least one word into a selected language, and thus does not teach
the step of translating the at least one word into a selected language prior to the step of visually inserting the at least one word in the video portion of the audiovisual media.
Gonzales, however, teaches the step of translating the at least one word into a selected language prior to the step of visually inserting the at least one word in the video portion of the audiovisual media (the subtitles are translated with a text-to-text machine translation block into the target language chosen by the viewer, i.e. translating the at least one word into a selected language, which can then be combined with the video data from the video delay buffer and multiplexed with the delayed audio data, i.e. prior to the step of visually inserting the at least one word in the video portion of the audiovisual media [0026],[0032]).
Chaudhuri, Gilson, and Gonzales are analogous art because they are from a similar field of endeavor in providing text captions for video with audio. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the using a translation software as a transcriber teachings of Chaudhuri, as modified by Gilson, with the translation of subtitle into a target language chosen by the viewer as taught by Gonzales. It would have been obvious to combine the references to enable a user to choose text and/or audio translations and enable full synchronization of video, audio, and translated subtitles as chosen by the user at a later time (Gonzales [0026],[0034-5]).
Claim(s) 6-9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chaudhuri, in view of Gilson, in view of Gonzales, and further in view of McCartney.
Regarding claim 6, Chaudhuri in view of Gilson and Gonzales teaches claim 5, and Chaudhuri further teaches
the at least one word comprises a plurality of words (a speech recognition engine may transcribe the speech in the content item, such as a video with audio, into caption text, where the text may be a phrase including a number of words, i.e. a plurality of words [0023-4],[0078]).
While Chaudhuri in view of Gilson and Gonzales provides identifying gaps in speech, Chaudhuri in view of Gilson and Gonzales does not specifically teach identifying the timing of pauses between each of the words, and thus does not teach
determining a timing of pauses between each of the words.
McCartney, however, teaches determining a timing of pauses between each of the words (speech recognition data includes a plurality of generated character strings, i.e. text string, where a phrase, such as 'hello world' may be recognized as a character string where there is no time between when 'hello' ends and 'world' begins, but there may be a gap between the words “is” and “fried”, i.e. timing of pauses between each of the words Fig. 5C,[0039],[0048],[0057:1-1 0]).
Where Chaudhuri further teaches the determination of gaps in speech [0023-4],[0068].
Chaudhuri, Gilson, Gonzales, and McCartney are analogous art because they are from a similar field of endeavor in processing speech to determine subtitle information. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the identifying gaps in speech teachings of Chaudhuri, as modified by Gilson and Gonzales, with the recognition of timing of specific words as taught by McCartney. It would have been obvious to combine the references to improve the syncing of translated audio for dubbed speech so that it is more appealing to users (McCartney [0001]).
Regarding claim 7, Chaudhuri in view of Gilson, Gonzales, and McCartney teaches claim 6, and McCartney further teaches
determining groups of the plurality of words which form phrases and sentences from the temporal start and end points for each of the words (speech recognition data includes a plurality of generated character strings, i.e. text string, where a phrase, such as 'hello world' may be recognized as a character string where there is no time between when 'hello' ends and 'world' begins, i.e. determining groups of the plurality of words which form phrases…from the temporal start and end points for each of the words, where sentences may be generated from consecutive sentence fragments that are close in time with short or no gaps between the end time of one sentence fragment and the start time of another sentence fragment, i.e. determining groups of the plurality of words which form…sentences from the temporal start and end points for each of the words Fig. 5C,[0026],[0039],[0041-2],[0048],[0095],[0104]).
Where the motivation to combine is the same as previously presented.
Regarding claim 8, Chaudhuri in view of Gilson, Gonzales, and McCartney teaches claim 7, and McCartney further teaches
assigning temporal anchors to each of the words, phrases and sentences (speech recognition data includes a plurality of generated character strings, where the words “hello” and “world” have a start and end time, i.e. assigning temporal anchors to each of the words, where a phrase, such as 'hello world' may be recognized as a character string where there is no time between when 'hello' ends and 'world' begins, i.e. assigning temporal anchors to each of the…phrases, and where sentences may be generated from consecutive sentence fragments that are close in time with short or no gaps between the end time of one sentence fragment and the start time of another sentence fragment, i.e. assigning temporal anchors to each of the…sentences Fig. 5C,[0026],[0039],[0041-2],[0048],[0095],[0104]).
Where the motivation to combine is the same as previously presented.
Regarding claim 9, Chaudhuri in view of Gilson, Gonzales, and McCartney teaches claim 8, and McCartney further teaches
determining at least one parameter associated with each of the words, phrases and sentences (audio voice identification may be used to determine the speaker ID information for each speaker in the video, where the speaker ID information can include the identity of the person who spoke the dialogue, as well as demographic information related to the speaker's gender, age, screen position, or any other relevant information related to the speaker's identity, i.e. determining at least one parameter, where each sentence fragment is associated with a speaker ID, and sentences are a combination of sentence fragments associated with the same speaker ID, i.e. associated with each of the words, phrases and sentences [0039],[0041],[0048],[0085],[0094-5],[0103-4]).
Where the motivation to combine is the same as previously presented.
Claim(s) 10-16, 19, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chaudhuri, in view of Gilson, in view of Gonzales, in view of McCartney, and further in view of Kim.
Regarding claim 10, Chaudhuri in view of Gilson, Gonzales, and McCartney teaches claim 9, and McCartney further teaches
the at least one parameter is selected from the group consisting of identification of a speaker, a gender of the speaker, an age of the speaker, …and combinations thereof (audio voice identification may be used to determine the speaker ID information for each speaker in the video, where the speaker ID information can include the identity of the person who spoke the dialogue, as well as demographic information related to the speaker's gender, age, or any other relevant information related to the speaker's identity [0094-5],[0103-4]).
Gilson teaches a volume, a raspness, an emotional indicator (time-coded meta information characterizing the audio can include volume designations, i.e. a volume, sentiment, i.e. emotional indicator, and accent, i.e. raspness [0026],[0041], [0043]).
While Chaudhuri in view of Gilson, Gonzales, and McCartney provides speaker information, volume, and sentiment, associated with the captions, Chaudhuri in view of Gonzales and McCartney does not specifically teach the parameters of inflection and emphasis, tonality, and raspness, and thus does not teach
an inflection and emphasis,…a tonality.
Kim, however, teaches an inflection and emphasis,…a tonality (an articulatory feature, such as tone, i.e. tonality, and pitch, i.e. inflection, a prosody feature, such as accentuation, i.e. raspness, and features such as emphasis, may be extracted from the speech data, where the speech feature may be related to a phoneme pronunciation [0056],[0058],[0060],[0065],[0132],[0151]).
Chaudhuri, Gilson, Gonzales, McCartney, and Kim, are analogous art because they are from a similar field of endeavor in automatic dubbing of video content into multiple languages. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the identification of prosody recommendations based on age or voice characteristics teachings of Chaudhuri, as modified by Gilson, Gonzales, and McCartney, with the specific recognition of an accent as taught by Kim. The motivation to do so would have been to achieve a predictable result of generating output speech data for text in a second language that simulates a speaker's speech (Kim [0058]).
Regarding claim 11, Chaudhuri in view of Gilson, Gonzales, McCartney, and Kim teaches claim 10, and Kim further teaches
the at least one parameter is determined using a machine learning-based system (the speaker identification network may extract feature information from the speech of the speaker using a machine learning model [0083]).
Where the motivation to combine is the same as previously presented.
Regarding claim 12, Chaudhuri in view of Gilson, Gonzales, McCartney, and Kim teaches claim 10, and McCartney further teaches
synchronizing the at least one parameter of each of the words, phrases and sentences with the temporal anchor associated therewith (audio voice identification may be used to determine the speaker ID information for each speaker in the video, where the speaker ID information can include the identity of the person who spoke the dialogue, as well as demographic information related to the speaker's gender, age, screen position, or any other relevant information related to the speaker's identity, at least one parameter, where each sentence fragment is associated with a speaker ID, and sentences are a combination of sentence fragments associated with the same speaker ID, and sentence fragments have specific start and end times, i.e. synchronizing…each of the words, phrases and sentences with the temporal anchor associated therewith [0039],[0041],[0048],[0085],[0094-5],[0103-4]).
Where the motivation to combine is the same as previously presented.
Regarding claim 13, Chaudhuri in view of Gilson, Gonzales, McCartney, and Kim teaches claim 12, and McCartney further teaches
converting each of the words, phrases and sentences into corresponding dubbed audio (the translated sentences, i.e. each of the words, phrases and sentences, are transformed into translated audio speech using a voice synthesizer to be overlaid onto the video, i.e. converting…into corresponding dubbed audio [0103-6]).
Where the motivation to combine is the same as previously presented.
Regarding claim 14, Chaudhuri in view of Gilson, Gonzales, McCartney, and Kim teaches claim 13, and McCartney further teaches
embedding the dubbed audio in the audio portion of the audiovisual media corresponding to the temporal anchors associated therewith (the translated audio speech is adjusted to ensure that the overlay of the translated audio speech matches the timing of the original audio speech [0103-6],[0112-4]).
Where the motivation to combine is the same as previously presented.
Regarding claim 15, Chaudhuri in view of Gilson, Gonzales, McCartney, and Kim teaches claim 14, and McCartney further teaches
applying the at least one parameter to the words, phrases and sentences of the dubbed audio prior to the step of embedding the dubbed audio in the audio portion of audiovisual media (the translated sentences, i.e. words, phrases and sentences, are transformed into translated audio speech using a voice synthesizer to be overlaid onto the video, i.e. step of embedding the dubbed audio in the audio portion of audiovisual media, where the machine generated audio speech matches the corresponding speaker ID properties for each translated sentence, i.e. applying the at least one parameter to the words, phrases and sentences of the dubbed audio prior to [0103-7]).
Where the motivation to combine is the same as previously presented.
Regarding claim 16, Chaudhuri in view of Gilson, Gonzales, McCartney, and Kim teaches claim 15, and Gilson further teaches
selectively adjusting at least one quality factor associated with the dubbed audio during the latency period, the selective adjustment of the at least one quality factor being performed by the user (for live media content, the broadcaster may use a time delay between receiving the live media content, such as a video program, and transmitting the media content, which give time to generate and synchronize captions, i.e. during the latency period, where a human transcriber may correct the words in the initial transcript by adding, deleting, or modifying words, or metadata can be added to the transcript, such as the speaker’s name, sentiment, accent, language, and volume, i.e. selectively adjusting at least one quality factor associated with the … audio during the latency period…being performed by the user [0024-6],[0028-9],[0031],[0041],[0043],[0046],[0053],[0055],[0068-9],[0085-6].
Where McCarney teaches that the translated audio speech is transformed using a voice synthesizer and the speaker properties for each translated sentence [0103-7].
And where the motivation to combine is the same as previously presented.
Regarding claim 19, Chaudhuri in view of Gilson, Gonzales, McCartney, and Kim teaches claim 16, and Gilson further teaches
the at least one quality factor is selected from the group consisting of volume, tonal quality, pauses, and combinations thereof (time-coded meta information characterizing the audio, i.e. the at least one quality factor is selected from the group consisting of, can include volume designations, i.e. volume [0026],[0041], [0043]).
Where Kim further teaches tonal quality, pauses, and combinations thereof (an articulatory feature, such as tone, i.e. tonality, and prosody features such as information on pause duration, i.e. pauses, may be extracted from the speech data, where the speech feature may be related to a phoneme pronunciation and used by a speech synthesizer [0056],[0058-60],[0065],[0132],[0151]).
And where the motivation to combine is the same as previously presented.
Regarding claim 20, Chaudhuri in view of Gilson, Gonzales, McCartney, and Kim teaches claim 19, and McCartney further teaches
the at least one quality factor is selected from the group consisting of synchronization of the dubbed audio (the audio rate of the translated audio speech, i.e. the at least one quality factor is selected from the group, may be adjusted in order to match the duration of the translated speech segment to the corresponding video segment, i.e. synchronization of the dubbed audio [0112-4]).
Where the motivation to combine is the same as previously presented.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NICOLE A K SCHMIEDER/Primary Examiner, Art Unit 2659