DETAILED ACTION
This office action is in response to Applicant’s Amendment/Request for Reconsideration, received on 02/17/2026. Claims 1-2, 4, 9-11, 13, 19-20 have been amended. Claims 8 and 17 have been cancelled. Claims 21 and 22 have been added. Claims 1-7, 9-16, 18-22 are pending and have been considered.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see pg. 11, filed 02/17/2026, with respect to the rejection(s) of independent claim(s) 1, 10, and 19 under 35 U.S.C. 103 (Moreno in view of Serry) have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Thomson et al. (US-20200175961-A1), hereinafter Thomson. Thomson discloses “obtaining first audio data of a first communication session between a first and second device and during the first communication session, obtaining a first text string that is a transcription of the first audio data and training a model of an automatic speech recognition system using the first text string and the first audio data” (abstract). See updated rejections below.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-5, 7, 9-14, 16, 18-22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Moreno et al. (US-8131545-B1), hereinafter Moreno, in view of Serry et al. (US-20250156642-A1), hereinafter Serry, further in view of Thomson et al. (US-20200175961-A1), hereinafter Thomson.
Regarding claim 1, Moreno discloses: a method ([Col. 4, Line 18] methods described here) comprising:
receiving a training sample comprising an audio file ([Fig. 1, Audio Data 106], [Col. 5, Lines 25-28] audio data 106 stored at the service client system 104 can later be accessed by speech recognition systems for training language models);
receiving a transcript of the audio file ([Fig. 1, Transcript 108], [Col. 4, Line 40] model builder 112 receives the transcript 108);
generating a predicted tokens sequence from the audio file ([Col. 6, Lines 1-5] The speech recognizer 212 can use a dictionary 214 to identify candidates for the recognized words. For example, if a particular recognized text is similar to a word in the dictionary 214, the dictionary word can be chosen as the recognized text, [Candidates of recognized words track to predictions for the recognized words, wherein words are reasonably understood to represent tokens, see [0034] of instant app]),
generating predicted timing labels ([Col. 6, Lines 15-17] The speech recognizer 212 may provide the recognized words and the times at which the recognized words occur in the audio), wherein each predicted token has an associated predicted timing label ([Col. 4, Lines 34-36] the speech recognizer 110 can output recognized words and a start time and stop time for each of the recognized words, [In view of the previously disclosed candidate recognized words indicating that each predicted, i.e. recognized, token, i.e. word, has an associated predicted timing label]);
predicting a ground truth tokens sequence from the transcript ([Col. 5, Lines 38-40] the factor automaton 208 can be navigated, or explored, to retrieve all possible substrings of the text in the transcript 204, [Wherein substrings of text (including a substring representing the entire string) are reasonably understood to be representative of a token sequence, i.e. if the tokens are consisting of phonemes and/or words]);
mapping the ground truth tokens, generated from the transcript to the predicted tokens, generated from the audio file, finding matched tokens ([Col. 6, Lines 18-20] The text aligner 216 may receive the recognized words and locate the recognized words in the factor automaton 208, [Locating recognized words, i.e. predicted tokens, in the factor automaton, i.e. containing the ground truth tokens, indicates the locating to be a mapping/matching operation between the two sources]);
assigning, to the ground truth tokens, the timing labels of the matched tokens ([Col. 9, Lines 29-37] the text aligner 216 can identify the times at which those recognized words occur and can associate the identified times with the corresponding words from the automaton arcs…the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204, [Associating times to words in automaton arcs, i.e. the ground truth tokens, indicates an assignment of timing based on the timings of the recognized words, i.e. predicted tokens, indicating the times to match when the tokens do, i.e. having corresponding words with corresponding times]);
dividing the audio file into chunks, based at least in part on the assigned timing labels ([Col. 9, Lines 30-35] The process 400 can align (410) a portion of the transcript with a portion of the audio data using the identified times, [Aligning a portion of a transcript based on identified times indicates the portion represents a division of the larger audio file based on the times of recognized words having assigned timing labels]);
determining portions of the transcript matching the audio file chunks, based at least in part on the assigned timing labels to the matched ground truth tokens ([Col. 9, Lines 45-51] a block of audio data may be sent back to the speech recognizer 212 and/or the text aligner 216 if a degree in which the transcript 204 matches the recognized words from the audio data 202 does not meet a threshold level, [Sending potions of audio back into an alignment process based on a poor degree of matching indicates that there is a determination of matching being made, i.e. portions which do not require a second round of alignment are a match]);
training a model with the audio chunks and the matching transcript portions ([Col. 4, Lines 14-20] These aligned transcripts can, in turn, provide a large audio corpus for training a speech recognizer so that the recognizer improves its accuracy in text recognition. In other implementations, the systems and methods described here may permit the alignment of audio books to their transcriptions, [Training using an aligned transcript for speech recognition indicates it to be trained based on audio chunks to be recognized with matching transcript portions]);
selecting a segment size ([Col. 5, Lines 40-45] The audio segmenter 210 segments or divides the audio data 202 into portions of audio that may be easily processed by a speech recognizer 212, [Col. 5, Lines 1-5] For example, the section of recognized audio being matched may represent a single sentence, [Segmenting audio for easy processing indicates a selected segment size, i.e. sentence, for ease of processing]);
determining a number of predicted tokens, in an alignment window of the segment size, in the predicted tokens sequence ([Col. 5, Lines 55-57] The speech recognizer 212 can analyze the segmented audio data to determine text or words that represent the audio data 202, [Determining words that represent segmented audio indicates a number of identified words, i.e. predicted tokens, for the selected segment having a segment size to facilitate ease of processing]);
aligning, within the alignment window of the segment size, a corresponding number of ground truth tokens from the ground truth tokens sequence equal to the determined number of predicted tokens, to the predicted tokens in the alignment window ([Col. 5, Lines 15-25] The text aligner 114 may align the transcript 108 with the audio data 106 by combining time indicators derived from the speech recognizer 110 with the transcript 108 to form an aligned transcript 116. The time indicators may then specify when text in the transcript occurs relative to the corresponding utterance in the audio data. The text aligner 114 can output the aligned transcript 116, [Aligning transcript with audio data based on times of text, i.e. aligning recognized words with words of the transcript, indicates the alignment is based on time matching. Further, consider Fig. 5 which displays a transcript 504 and a timing of each word 506 indicating the number of words to be equal between the ground truth tokens, i.e. transcription, and the predicted tokens, i.e. the timings on timeline 506]);
assigning timing labels, from the aligned predicted tokens to a selection of the aligned ground truth tokens in the alignment window ([Col. 9, Lines 29-37] the text aligner 216 can identify the times at which those recognized words occur and can associate the identified times with the corresponding words from the automaton arcs…the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204, [Associating times to words in automaton arcs, i.e. the ground truth tokens, indicates an assignment of timing based on the timings of the recognized words, i.e. predicted tokens, indicating the times to match when the tokens do, i.e. having corresponding words with corresponding times]);
advancing the alignment window along the predicted tokens sequence and the ground truth tokens sequence, with a selected overlap, until at least one of the sequences is exhausted ([Col. 4, Lines 48-55] In some implementations, the factor automaton includes a starting state with an arc corresponding to a first word (or other language element unit, such as a phone, phoneme, or syllable) in the transcript 108 (or a first word in a selected portion of the transcript). The factor automaton may also include states with arcs for each of the other words in the transcript 108 (or in the selected portion), [Including states with arcs for other words in the transcript, i.e. ground truth tokens outside of the selected portion, to later be compared/aligned to the audio file recognized words, i.e. predicted tokens, indicates the words outside of the selected portion to be overlapping between each individual sentence analysis when the additional words are not in the sentence currently being aligned, wherein the sliding is performed on a sentence-basis, i.e. a window size of a sentence with additional words outside of the analyzed sentence “overlapping” compared to when the sentence they are actually a part of is aligned]);
performing the determining, the aligning and the assigning until at least one of the sequences is exhausted ([Col. 8, Lines 65-67]-[Col. 9, Lines 1-3] In general, a model builder can generate a factor automaton much larger than the single sentence in this example. For example, an audio data and corresponding transcript may represent an entire television program, movie, theatrical product, radio program, or audio book, [In view of the previous disclosure of Moreno indicating a portion to be representative of a sentence, building a factor automaton, i.e. ground truth tokens, larger than one sentence indicates the previous align/keep/slide operations using these ground truth tokens on a sentence-basis to be repeated and/or performed until the entire document, consisting of ground truth tokens, is considered/exhausted. Further, portion/segment analysis as disclosed in Moreno indicates a combination of segments results in a repeated aligning/keeping/sliding operation for each sentence]);
outputting first stage timings, comprising the selected aligned ground truth tokens and the assigned timing labels ([Col. 6, Lines 24-26] The text aligner 216 can output an aligned transcript 218 that includes the transcript text and the associated times at which the words occur in the audio data 202); and,
generating token frequency for each ground truth token, based on the first stage timings ([Fig. 5, 506a-b, 506h-i], [Col. 10, Lines 29-50] The graph 502 includes a weighting curve 510 for the second occurrence of the word "let's" in the transcript 504. The weighting curve 510 indicates that this instance of the word "let's" has a high probability of occurring at a location between three and four seconds. In one example, a particular comparison operation may begin with a comparison of this second instance of the word "let's" in the transcript to the first recognized word "Let's." The first recognized word occurs between zero second and one second at the times 506a-b. While the first recognized word does match the transcript word, the first recognized word has a very low probability of representing the second occurrence of the word "let's" in the transcript 504 due to the weighting curve 510 having a value at or near zero between the times 506a-b. The eighth recognized word also matches the transcript word "let's." The eighth recognized word occurs between three seconds and four seconds at the times 506h-l. The eighth recognized word has a high probability of representing the second occurrence of the word "let's" in the transcript 504 due to the weighting curve 510 having a value at or near one between three and four seconds, [Performing probability evaluations based on word timings, i.e. first stage timings, wherein the evaluation is also considering a number of occurrences of the same word (comparing instances of the same word suggests a knowledge that more than one copy of the same word exists), indicates a token frequency value which is affecting the probability as displayed in the weighting curve of Fig. 5. The second occurrence of “let’s” can be in one of two positions (token frequency), wherein the more likely position is determined based on timings]).
Moreno does not disclose:
receiving an artificial intelligence training sample; and,
training a supervised artificial intelligence model.
Serry discloses:
receiving an artificial intelligence training sample ([0028] the text classification model 202 backpropagates the output of the text segment classification as a “ground truth” (e.g., a known good value) to update (e.g., train) the text classification model 202, [0068] In some examples, text classification model 202, text segmentation model 204, text encoder 402, fusion-layer transformer 404, and/or segmenter/classifier 406 include or are implemented as a large language model (LLM). Example models may include the GPT models from OpenAI, BARD from Google, and/or Large Language Model Meta AI (LLaMA) from Meta, among other types of artificial intelligence (AI) models.); and,
training a supervised artificial intelligence model ([0070] The LLM is generally trained using supervised learning based on large amounts of annotated text data, [In view of [0068] which disclosed the LLM to be implemented as an artificial intelligence LLM]).
Moreno and Serry are considered analogous art within automatic speech recognition based on associated textual representations. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Moreno to incorporate the teachings of Serry, because of the novel way to perform semantic text segmentation before or in conjunction with text classification to improve accuracy of the text classification as would be relevant to the matching audio to text operation of Moreno, i.e. the classifications of Serry could be applied to the inputs of Moreno for determining the matches as disclosed in Moreno (Serry, [0016]).
Moreno in view of Serry does not disclose:
generating alignment paths for the ground truth tokens;
scoring the alignment paths, based on token frequency;
selecting a final alignment, based on the scoring; and,
generating second stage alignment timings, based on the final alignment.
Thomson discloses:
generating alignment paths for the ground truth tokens ([0417] the align text process 1406 may find a path that best meets a selected set of performance criteria by constructing a two-dimensional grid representing the first sequence in a first dimension and the second sequence in a second dimension… where N is the number of words in the reference, [Wherein a reference transcription tracks to a ground truth token sequence (see [0235] comparing reference transcription to new audio transcription)]);
scoring the alignment paths, based on token frequency ([0417] The performance criteria may include the lowest cost or the highest score. For example, the cost may be a function of the number of deletions “D,” substitutions “S,” and insertions “I.” If all errors receive the same weight, the cost may be represented by D+S+I. The Viterbi path may then chose the alignment between the first and second sequence that results in the lowest cost as represented by D+S+I. The highest score may represent the Viterbi path that aligns the first and second sequences such that a score such as the number of matching words, the total path probability, or N-(D+S+I), [A perfect transcription with the lowest cost function would be one with no deletions, substitutions, or insertions, indicating a scoring operation based on a comparison of token frequency to a ground truth, i.e. reference, token frequency. The perfect transcription will have the same token frequency as the ground truth because it consists of the same tokens]);
selecting a final alignment, based on the scoring ([As previously disclosed, Thomson selects an alignment path which minimizes the cost function, representing a final alignment as compared to alignments with higher costs]); and,
generating second stage alignment timings, based on the final alignment ([0527] the synchronizer 1902 may use a Viterbi search or other dynamic programming method to align and identify segment matches in the first and second transcriptions. In some embodiments, the synchronizer 1902 may use information from the transcription units 1914 to align the first and second transcriptions. For example, the synchronizer 1902 may use word endpoints from ASR systems in the transcription units 1914 to align the first and second transcriptions, [In view of the previously disclosed Viterbi cost algorithm of Thomson used for alignment, indicating that the aligned tokens resulting in a second alignment as currently defined will result in second stage alignment timings, wherein these are “generated” when the output transcription of the final alignment is generated at 1410 after the final alignment is generated at 1408/1409 (see fusing of alignments based on endpoints, [0708])]).
Moreno, Serry, and Thomson are considered analogous art within speech-to-text alignment. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Moreno in view of Serry to incorporate the teachings of Thomson, because of the novel way to use additional criteria including determining speaking portions of audio for alignment of tokens, improving quality of output produced from a voting process of several transcripts (Thomson, [0375]-[0376]).
Regarding claim 2, Moreno in view of Serry, further in view of Thomson discloses: the method of claim 1.
Moreno further discloses:
determining number of predicted tokens in a segment of the predicted tokens sequence of the selected segment size ([Col. 5, Lines 55-57] The speech recognizer 212 can analyze the segmented audio data to determine text or words that represent the audio data 202, [Determining words that represent segmented audio indicates a number of identified words, i.e. predicted tokens, for the selected segment having a segment size to facilitate ease of processing]);
selecting the same number of ground truth tokens from the ground truth tokens sequence ([Col. 5, Lines 3-5] At the end of the sentence, the text aligner 114 stops the recognition at the last automaton state that has transcription text that matches the last recognized word in the audio, [Stopping recognition at the last automaton state, i.e. representing the last ground truth token, indicates stopping at the same number of tokens as the predicted tokens, requiring the same amount of tokens to be selected, i.e. the same sentence]);
aligning the selected ground truth tokens in the segment with the predicted tokens in the segment, finding the matched tokens ([Col. 5, Lines 15-25] The text aligner 114 may align the transcript 108 with the audio data 106 by combining time indicators derived from the speech recognizer 110 with the transcript 108 to form an aligned transcript 116. The time indicators may then specify when text in the transcript occurs relative to the corresponding utterance in the audio data. The text aligner 114 can output the aligned transcript 116, [Aligning transcript with audio data based on times of text, i.e. aligning recognized words with words of the transcript, indicates the alignment is based on time matching]);
keeping a selection of the matched tokens in a segment ([Col. 6, Lines 20-26] The text aligner 216 can associate the times of the recognized words that were matched to paths in the automaton. The text aligner 216 can output an aligned transcript 218 that includes the transcript text and the associated times at which the words occur in the audio data 202, [Outputting an aligned transcript indicates a required keeping of matched tokens which form the alignment]);
sliding the segment along the predicted tokens sequence and the ground truth sequence by an amount of overlap ([Col. 4, Lines 48-55] In some implementations, the factor automaton includes a starting state with an arc corresponding to a first word (or other language element unit, such as a phone, phoneme, or syllable) in the transcript 108 (or a first word in a selected portion of the transcript). The factor automaton may also include states with arcs for each of the other words in the transcript 108 (or in the selected portion), [Including states with arcs for other words in the transcript, i.e. ground truth tokens outside of the selected portion, to later be compared/aligned to the audio file recognized words, i.e. predicted tokens, indicates the words outside of the selected portion to be overlapping between each individual sentence analysis when the additional words are not in the sentence currently being aligned, wherein the sliding is performed on a sentence-basis, i.e. a window size of a sentence with additional words outside of the analyzed sentence “overlapping” compared to when the sentence they are actually a part of is aligned]); and,
performing the aligning, the keeping and the sliding until the predicted tokens sequence, or the ground truth token sequence is exhausted ([Col. 8, Lines 65-67]-[Col. 9, Lines 1-3] In general, a model builder can generate a factor automaton much larger than the single sentence in this example. For example, an audio data and corresponding transcript may represent an entire television program, movie, theatrical product, radio program, or audio book, [In view of the previous disclosure of Moreno indicating a portion to be representative of a sentence, building a factor automaton, i.e. ground truth tokens, larger than one sentence indicates the previous align/keep/slide operations using these ground truth tokens on a sentence-basis to be repeated and/or performed until the entire document, consisting of ground truth tokens, is considered/exhausted. Further, portion/segment analysis as disclosed in Moreno indicates a combination of segments results in a repeated aligning/keeping/sliding operation for each sentence]).
Regarding claim 3, Moreno in view of Serry, further in view of Thomson discloses: the method of claim 1.
Moreno further discloses:
generating and assigning synthetic times to unmatched tokens ([Col. 7, Lines 39-45] In some implementations, times for words in the transcript having no matching or similar words in the recognized words can be extrapolated. For example, a start time and stop time for the transition 304c representing the transcript word "to" can be estimated using the end time of the transition 304b and the start time of the transition 304d, [An extrapolated time indicates it is synthetic, i.e. not based on an actual time associated with the unmatched word]),
wherein determining portions of the transcript matching the audio file chunks is further based on the assigned synthetic times ([Col. 9, Lines 30-40] align (410) a portion of the transcript with a portion of the audio data using the identified times. For example, the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204. In some implementations, the text aligner 216 may estimate times for words in the transcript 204 that have no corresponding recognized words and times, [Aligning portions, i.e. chunks, based on times and/or estimated times, tracking to extrapolated times for unmatched tokens, indicates the alignment is based on assigned synthetic times as would be required to align estimated times of unmatched/unrecognized words]).
Regarding claim 4, Moreno in view of Serry, further in view of Thomson discloses: the method of claim 3.
Moreno further discloses:
generating and assigning synthetic times to unmatched tokens ([Col. 7, Lines 39-45] In some implementations, times for words in the transcript having no matching or similar words in the recognized words can be extrapolated. For example, a start time and stop time for the transition 304c representing the transcript word "to" can be estimated using the end time of the transition 304b and the start time of the transition 304d, [An extrapolated time indicates it is synthetic, i.e. not based on an actual time associated with the unmatched word]),
wherein determining portions of the transcript matching the audio file chunks is further based on the assigned synthetic times ([Col. 9, Lines 30-40] align (410) a portion of the transcript with a portion of the audio data using the identified times. For example, the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204. In some implementations, the text aligner 216 may estimate times for words in the transcript 204 that have no corresponding recognized words and times, [Aligning portions, i.e. chunks, based on times and/or estimated times, tracking to extrapolated times for unmatched tokens, indicates the alignment is based on assigned synthetic times as would be required to align estimated times of unmatched/unrecognized words]);
generating an alignment confidence for the matched tokens ([Col. 9, Lines 50-55] the speech recognizer 212 may provide confidence levels indicating an amount of confidence in the accuracy of an alignment of one or more words); and,
generating the alignment confidence for the unmatched tokens ([In view of the previously disclosed synthetic times for unmatched tokens, the confidence of alignment for unmatched tokens can be performed using the same method disclosed for matched tokens without a change in functionality to Moreno]),
wherein dividing the audio file into chunks is further based on the alignment confidence ([Col. 9, Lines 45-50] a block of audio data may be sent back to the speech recognizer 212 and/or the text aligner 216 if a degree in which the transcript 204 matches the recognized words from the audio data 202 does not meet a threshold level, [Determining to realign a block, i.e. chunk, based on alignment confidence not meeting a threshold value indicates dividing the chunk, i.e. for reprocessing/realignment, based on alignment confidence]).
Regarding claim 5, Moreno in view of Serry, further in view of Thomson discloses: the method of claim 4.
Moreno further discloses:
wherein the audio is not divided at a timestamp adjacent to a token having low alignment confidence ([Col. 9, Lines 45-50] a block of audio data may be sent back to the speech recognizer 212 and/or the text aligner 216 if a degree in which the transcript 204 matches the recognized words from the audio data 202 does not meet a threshold level, [Disclosure of only sending blocks for realignment which do not reach a confidence threshold indicates that tokens adjacent to these low-confidence portions are not realigned, i.e. not divided at a timestamp adjacent to a token having a low alignment confidence, but instead, at the timestamp of the first token of the low confidence portion/block/chunk]).
Regarding claim 7, Moreno in view of Serry, further in view of Thomson discloses: the method of claim 1.
Moreno further discloses:
wherein tokens comprise phonemes ([Col. 5, Lines 5-10] the text aligner 114 compares a unit of language other than a word, such as phones, phonemes, or syllables from the recognized text and from the substrings in the factor automaton), and the method further comprises:
determining word timings in the transcript, based at least in part on the assigned timing labels to the ground truth tokens ([Fig. 5, Timeline 500, Transcript 504], [Col. 10, Lines 9-17] The timeline 500 includes a number of times 506a-l ranging from zero seconds to five seconds representing the start and end times of the recognized words. For example, the time 506a is the start of the first recognized word. The time 506b is the end of the first recognized word and the start of the second recognized word. While the recognized words are shown here as sharing start and end times, pauses or other breaks between words can be represented separate from the recognized words, [generating a timeline representing the start and end times of each words associated with transcript 504 indicates a clear timing for each word in the transcript between any two neighboring times 506a-i]),
wherein dividing the audio file into chunks comprises dividing at timestamps, flanked by whole words ([Col. 5, Lines 1-5] the section of recognized audio being matched may represent a single sentence. At the end of the sentence, the text aligner 114 stops the recognition at the last automaton state that has transcription text that matches the last recognized word in the audio, [Dividing audio into sentence-based analysis indicates that each division will be flanked by whole words, i.e. the last word of a previous sentence and the first word of a next sentence]).
Regarding claim 9, Moreno in view of Serry, further in view of Thomson discloses: the method of claim 1.
Moreno further discloses:
generating synthetic timing labels for unaligned ground truth tokens ([Col. 7, Lines 39-45] In some implementations, times for words in the transcript having no matching or similar words in the recognized words can be extrapolated. For example, a start time and stop time for the transition 304c representing the transcript word "to" can be estimated using the end time of the transition 304b and the start time of the transition 304d, [An extrapolated time indicates it is synthetic, i.e. not based on an actual time associated with the unmatched word]);
generating an alignment confidence for each timing label assigned to an aligned ground truth token and for each synthetic timing label assigned to an unaligned ground truth token ([Col. 9, Lines 50-55] the speech recognizer 212 may provide confidence levels indicating an amount of confidence in the accuracy of an alignment of one or more words, [In view of the previously disclosed synthetic times for unmatched tokens, the confidence of alignment for unmatched tokens can be performed using the same method disclosed for matched tokens without a change in functionality to Moreno]);
outputting the alignment confidence ([Fig. 2, Speech Recognizer 212 Providing Input to Text Aligner 216], [In view of the above cited excerpt regarding the speech recognizer providing confidence levels, indicating the confidence level is output from the speech recognizer and input into the text aligner]).
Regarding claim 10, Moreno discloses: a non-transitory computer storage medium that stores executable program instructions that ([Col. 11, Lines 18-20] the memory 620 is a computer-readable medium. In one implementation, the memory 620 is a volatile memory unit. In another implementation, the memory 620 is a non-volatile memory unit, [Non-volatile and non-transitory forms of memory are synonymous]), when executed by one or more computing devices ([Col. 11, Lines 25-28] the storage device 630 is a computer-readable medium. In various different implementations, the storage device 630 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, [Computer-readable storage indicates it to be executed by a computing device]), configure the one or more computing devices to perform operations comprising:
receiving a training sample comprising an audio file ([Fig. 1, Audio Data 106], [Col. 5, Lines 25-28] audio data 106 stored at the service client system 104 can later be accessed by speech recognition systems for training language models);
receiving a transcript of the audio file ([Fig. 1, Transcript 108], [Col. 4, Line 40] model builder 112 receives the transcript 108);
generating a predicted tokens sequence from the audio file ([Col. 5, Lines 55-57] The speech recognizer 212 can analyze the segmented audio data to determine text or words that represent the audio data 202, [Col. 6, Lines 1-5] The speech recognizer 212 can use a dictionary 214 to identify candidates for the recognized words. For example, if a particular recognized text is similar to a word in the dictionary 214, the dictionary word can be chosen as the recognized text, [Candidates of recognized words track to predictions for the recognized words, wherein words are reasonably understood to represent tokens, see [0034] of instant app]),
generating predicted timing labels ([Col. 6, Lines 15-17] The speech recognizer 212 may provide the recognized words and the times at which the recognized words occur in the audio), wherein each predicted token has an associated predicted timing label ([Col. 4, Lines 34-36] the speech recognizer 110 can output recognized words and a start time and stop time for each of the recognized words, [In view of the previously disclosed candidate recognized words indicating that each predicted, i.e. recognized, token, i.e. word, has an associated predicted timing label]);
predicting a ground truth tokens sequence from the transcript ([Col. 5, Lines 38-40] the factor automaton 208 can be navigated, or explored, to retrieve all possible substrings of the text in the transcript 204, [Wherein substrings of text (including a substring representing the entire string) are reasonably understood to be representative of a token sequence, i.e. if the tokens are consisting of phonemes and/or words]);
mapping the ground truth tokens, generated from the transcript to the predicted tokens, generated from the audio file, finding matched tokens ([Col. 6, Lines 18-20] The text aligner 216 may receive the recognized words and locate the recognized words in the factor automaton 208, [Locating recognized words, i.e. predicted tokens, in the factor automaton, i.e. containing the ground truth tokens, indicates the locating to be a mapping/matching operation between the two sources]);
assigning, to the ground truth tokens, the timing labels of the matched tokens ([Col. 9, Lines 29-37] the text aligner 216 can identify the times at which those recognized words occur and can associate the identified times with the corresponding words from the automaton arcs…the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204, [Associating times to words in automaton arcs, i.e. the ground truth tokens, indicates an assignment of timing based on the timings of the recognized words, i.e. predicted tokens, indicating the times to match when the tokens do, i.e. having corresponding words with corresponding times]);
dividing the audio file into chunks, based at least in part on the assigned timing labels ([Col. 9, Lines 30-35] The process 400 can align (410) a portion of the transcript with a portion of the audio data using the identified times, [Aligning a portion of a transcript based on identified times indicates the portion represents a division of the larger audio file based on the times of recognized words having assigned timing labels]);
determining portions of the transcript matching the audio file chunks, based at least in part on the assigned timing labels to the matched ground truth tokens ([Col. 9, Lines 45-51] a block of audio data may be sent back to the speech recognizer 212 and/or the text aligner 216 if a degree in which the transcript 204 matches the recognized words from the audio data 202 does not meet a threshold level, [Sending potions of audio back into an alignment process based on a poor degree of matching indicates that there is a determination of matching being made, i.e. portions which do not require a second round of alignment are a match]); and,
training a model with the audio chunks and the matching transcript portions ([Col. 4, Lines 14-20] These aligned transcripts can, in turn, provide a large audio corpus for training a speech recognizer so that the recognizer improves its accuracy in text recognition. In other implementations, the systems and methods described here may permit the alignment of audio books to their transcriptions, [Training using an aligned transcript for speech recognition indicates it to be trained based on audio chunks to be recognized with matching transcript portions]);
selecting a segment size ([Col. 5, Lines 40-45] The audio segmenter 210 segments or divides the audio data 202 into portions of audio that may be easily processed by a speech recognizer 212, [Col. 5, Lines 1-5] For example, the section of recognized audio being matched may represent a single sentence, [Segmenting audio for easy processing indicates a selected segment size, i.e. sentence, for ease of processing]);
determining a number of predicted tokens, in an alignment window of the segment size, in the predicted tokens sequence ([Col. 5, Lines 55-57] The speech recognizer 212 can analyze the segmented audio data to determine text or words that represent the audio data 202, [Determining words that represent segmented audio indicates a number of identified words, i.e. predicted tokens, for the selected segment having a segment size to facilitate ease of processing]);
aligning, within the alignment window of the segment size, a corresponding number of ground truth tokens from the ground truth tokens sequence equal to the determined number of predicted tokens, to the predicted tokens in the alignment window ([Col. 5, Lines 15-25] The text aligner 114 may align the transcript 108 with the audio data 106 by combining time indicators derived from the speech recognizer 110 with the transcript 108 to form an aligned transcript 116. The time indicators may then specify when text in the transcript occurs relative to the corresponding utterance in the audio data. The text aligner 114 can output the aligned transcript 116, [Aligning transcript with audio data based on times of text, i.e. aligning recognized words with words of the transcript, indicates the alignment is based on time matching. Further, consider Fig. 5 which displays a transcript 504 and a timing of each word 506 indicating the number of words to be equal between the ground truth tokens, i.e. transcription, and the predicted tokens, i.e. the timings on timeline 506]);
assigning timing labels, from the aligned predicted tokens to a selection of the aligned ground truth tokens in the alignment window ([Col. 9, Lines 29-37] the text aligner 216 can identify the times at which those recognized words occur and can associate the identified times with the corresponding words from the automaton arcs…the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204, [Associating times to words in automaton arcs, i.e. the ground truth tokens, indicates an assignment of timing based on the timings of the recognized words, i.e. predicted tokens, indicating the times to match when the tokens do, i.e. having corresponding words with corresponding times]);
advancing the alignment window along the predicted tokens sequence and the ground truth tokens sequence, with a selected overlap, until at least one of the sequences is exhausted ([Col. 4, Lines 48-55] In some implementations, the factor automaton includes a starting state with an arc corresponding to a first word (or other language element unit, such as a phone, phoneme, or syllable) in the transcript 108 (or a first word in a selected portion of the transcript). The factor automaton may also include states with arcs for each of the other words in the transcript 108 (or in the selected portion), [Including states with arcs for other words in the transcript, i.e. ground truth tokens outside of the selected portion, to later be compared/aligned to the audio file recognized words, i.e. predicted tokens, indicates the words outside of the selected portion to be overlapping between each individual sentence analysis when the additional words are not in the sentence currently being aligned, wherein the sliding is performed on a sentence-basis, i.e. a window size of a sentence with additional words outside of the analyzed sentence “overlapping” compared to when the sentence they are actually a part of is aligned]);
performing the determining, the aligning and the assigning until at least one of the sequences is exhausted ([Col. 8, Lines 65-67]-[Col. 9, Lines 1-3] In general, a model builder can generate a factor automaton much larger than the single sentence in this example. For example, an audio data and corresponding transcript may represent an entire television program, movie, theatrical product, radio program, or audio book, [In view of the previous disclosure of Moreno indicating a portion to be representative of a sentence, building a factor automaton, i.e. ground truth tokens, larger than one sentence indicates the previous align/keep/slide operations using these ground truth tokens on a sentence-basis to be repeated and/or performed until the entire document, consisting of ground truth tokens, is considered/exhausted. Further, portion/segment analysis as disclosed in Moreno indicates a combination of segments results in a repeated aligning/keeping/sliding operation for each sentence]);
outputting first stage timings, comprising the selected aligned ground truth tokens and the assigned timing labels ([Col. 6, Lines 24-26] The text aligner 216 can output an aligned transcript 218 that includes the transcript text and the associated times at which the words occur in the audio data 202); and,
generating token frequency for each ground truth token, based on the first stage timings ([Fig. 5, 506a-b, 506h-i], [Col. 10, Lines 29-50] The graph 502 includes a weighting curve 510 for the second occurrence of the word "let's" in the transcript 504. The weighting curve 510 indicates that this instance of the word "let's" has a high probability of occurring at a location between three and four seconds. In one example, a particular comparison operation may begin with a comparison of this second instance of the word "let's" in the transcript to the first recognized word "Let's." The first recognized word occurs between zero second and one second at the times 506a-b. While the first recognized word does match the transcript word, the first recognized word has a very low probability of representing the second occurrence of the word "let's" in the transcript 504 due to the weighting curve 510 having a value at or near zero between the times 506a-b. The eighth recognized word also matches the transcript word "let's." The eighth recognized word occurs between three seconds and four seconds at the times 506h-l. The eighth recognized word has a high probability of representing the second occurrence of the word "let's" in the transcript 504 due to the weighting curve 510 having a value at or near one between three and four seconds, [Performing probability evaluations based on word timings, i.e. first stage timings, wherein the evaluation is also considering a number of occurrences of the same word (comparing instances of the same word suggests a knowledge that more than one copy of the same word exists), indicates a token frequency value which is affecting the probability as displayed in the weighting curve of Fig. 5. The second occurrence of “let’s” can be in one of two positions (token frequency), wherein the more likely position is determined based on timings]).
Moreno does not disclose:
receiving an artificial intelligence training sample; and,
training a supervised artificial intelligence model.
Serry discloses:
receiving an artificial intelligence training sample ([0028] the text classification model 202 backpropagates the output of the text segment classification as a “ground truth” (e.g., a known good value) to update (e.g., train) the text classification model 202, [0068] In some examples, text classification model 202, text segmentation model 204, text encoder 402, fusion-layer transformer 404, and/or segmenter/classifier 406 include or are implemented as a large language model (LLM). Example models may include the GPT models from OpenAI, BARD from Google, and/or Large Language Model Meta AI (LLaMA) from Meta, among other types of artificial intelligence (AI) models.); and,
training a supervised artificial intelligence model ([0070] The LLM is generally trained using supervised learning based on large amounts of annotated text data, [In view of [0068] which disclosed the LLM to be implemented as an artificial intelligence LLM]).
Moreno and Serry are considered analogous art within automatic speech recognition based on associated textual representations. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Moreno to incorporate the teachings of Serry, because of the novel way to perform semantic text segmentation before or in conjunction with text classification to improve accuracy of the text classification as would be relevant to the matching audio to text operation of Moreno, i.e. the classifications of Serry could be applied to the inputs of Moreno for determining the matches as disclosed in Moreno (Serry, [0016]).
Moreno in view of Serry does not disclose:
generating alignment paths for the ground truth tokens;
scoring the alignment paths, based on token frequency;
selecting a final alignment, based on the scoring; and,
generating second stage alignment timings, based on the final alignment.
Thomson discloses:
generating alignment paths for the ground truth tokens ([0417] the align text process 1406 may find a path that best meets a selected set of performance criteria by constructing a two-dimensional grid representing the first sequence in a first dimension and the second sequence in a second dimension… where N is the number of words in the reference, [Wherein a reference transcription tracks to a ground truth token sequence (see [0235] comparing reference transcription to new audio transcription)]);
scoring the alignment paths, based on token frequency ([0417] The performance criteria may include the lowest cost or the highest score. For example, the cost may be a function of the number of deletions “D,” substitutions “S,” and insertions “I.” If all errors receive the same weight, the cost may be represented by D+S+I. The Viterbi path may then chose the alignment between the first and second sequence that results in the lowest cost as represented by D+S+I. The highest score may represent the Viterbi path that aligns the first and second sequences such that a score such as the number of matching words, the total path probability, or N-(D+S+I), [A perfect transcription with the lowest cost function would be one with no deletions, substitutions, or insertions, indicating a scoring operation based on a comparison of token frequency to a ground truth, i.e. reference, token frequency. The perfect transcription will have the same token frequency as the ground truth because it consists of the same tokens]);
selecting a final alignment, based on the scoring ([As previously disclosed, Thomson selects an alignment path which minimizes the cost function, representing a final alignment as compared to alignments with higher costs]); and,
generating second stage alignment timings, based on the final alignment ([0527] the synchronizer 1902 may use a Viterbi search or other dynamic programming method to align and identify segment matches in the first and second transcriptions. In some embodiments, the synchronizer 1902 may use information from the transcription units 1914 to align the first and second transcriptions. For example, the synchronizer 1902 may use word endpoints from ASR systems in the transcription units 1914 to align the first and second transcriptions, [In view of the previously disclosed Viterbi cost algorithm of Thomson used for alignment, indicating that the aligned tokens resulting in a second alignment as currently defined will result in second stage alignment timings, wherein these are “generated” when the output transcription of the final alignment is generated at 1410 after the final alignment is generated at 1408/1409 (see fusing of alignments based on endpoints, [0708])]).
Moreno, Serry, and Thomson are considered analogous art within speech-to-text alignment. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Moreno in view of Serry to incorporate the teachings of Thomson, because of the novel way to use additional criteria including determining speaking portions of audio for alignment of tokens, improving quality of output produced from a voting process of several transcripts (Thomson, [0375]-[0376]).
Regarding claim 11, Moreno in view of Serry, further in view of Thomson discloses: the non-transitory computer storage of claim 10.
Moreno further discloses:
determining number of predicted tokens in a segment of the predicted tokens sequence of the selected segment size ([Col. 5, Lines 55-57] The speech recognizer 212 can analyze the segmented audio data to determine text or words that represent the audio data 202, [Determining words that represent segmented audio indicates a number of identified words, i.e. predicted tokens, for the selected segment having a segment size to facilitate ease of processing]);
selecting the same number of ground truth tokens from the ground truth tokens sequence ([Col. 5, Lines 3-5] At the end of the sentence, the text aligner 114 stops the recognition at the last automaton state that has transcription text that matches the last recognized word in the audio, [Stopping recognition at the last automaton state, i.e. representing the last ground truth token, indicates stopping at the same number of tokens as the predicted tokens, requiring the same amount of tokens to be selected, i.e. the same sentence]);
aligning the selected ground truth tokens in the segment with the predicted tokens in the segment, finding the matched tokens ([Col. 5, Lines 15-25] The text aligner 114 may align the transcript 108 with the audio data 106 by combining time indicators derived from the speech recognizer 110 with the transcript 108 to form an aligned transcript 116. The time indicators may then specify when text in the transcript occurs relative to the corresponding utterance in the audio data. The text aligner 114 can output the aligned transcript 116, [Aligning transcript with audio data based on times of text, i.e. aligning recognized words with words of the transcript, indicates the alignment is based on time matching]);
keeping a selection of the matched tokens in a segment ([Col. 6, Lines 20-26] The text aligner 216 can associate the times of the recognized words that were matched to paths in the automaton. The text aligner 216 can output an aligned transcript 218 that includes the transcript text and the associated times at which the words occur in the audio data 202, [Outputting an aligned transcript indicates a required keeping of matched tokens which form the alignment]);
sliding the segment along the predicted tokens sequence and the ground truth sequence by an amount of overlap ([Col. 4, Lines 48-55] In some implementations, the factor automaton includes a starting state with an arc corresponding to a first word (or other language element unit, such as a phone, phoneme, or syllable) in the transcript 108 (or a first word in a selected portion of the transcript). The factor automaton may also include states with arcs for each of the other words in the transcript 108 (or in the selected portion), [Including states with arcs for other words in the transcript, i.e. ground truth tokens outside of the selected portion, to later be compared/aligned to the audio file recognized words, i.e. predicted tokens, indicates the words outside of the selected portion to be overlapping between each individual sentence analysis when the additional words are not in the sentence currently being aligned, wherein the sliding is performed on a sentence-basis, i.e. a window size of a sentence with additional words outside of the analyzed sentence “overlapping” compared to when the sentence they are actually a part of is aligned]); and
performing the aligning, the keeping and the sliding until the predicted tokens sequence, or the ground truth token sequence is exhausted ([Col. 8, Lines 65-67]-[Col. 9, Lines 1-3] In general, a model builder can generate a factor automaton much larger than the single sentence in this example. For example, an audio data and corresponding transcript may represent an entire television program, movie, theatrical product, radio program, or audio book, [In view of the previous disclosure of Moreno indicating a portion to be representative of a sentence, building a factor automaton, i.e. ground truth tokens, larger than one sentence indicates the previous align/keep/slide operations using these ground truth tokens on a sentence-basis to be repeated and/or performed until the entire document, consisting of ground truth tokens, is considered/exhausted. Further, portion/segment analysis as disclosed in Moreno indicates a combination of segments results in a repeated aligning/keeping/sliding operation for each sentence]).
Regarding claim 12, Moreno in view of Serry, further in view of Thomson discloses: the non-transitory computer storage of claim 10.
Moreno further discloses:
generating and assigning synthetic times to unmatched tokens ([Col. 7, Lines 39-45] In some implementations, times for words in the transcript having no matching or similar words in the recognized words can be extrapolated. For example, a start time and stop time for the transition 304c representing the transcript word "to" can be estimated using the end time of the transition 304b and the start time of the transition 304d, [An extrapolated time indicates it is synthetic, i.e. not based on an actual time associated with the unmatched word]),
wherein determining portions of the transcript matching the audio file chunks is further based on the assigned synthetic times ([Col. 9, Lines 30-40] align (410) a portion of the transcript with a portion of the audio data using the identified times. For example, the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204. In some implementations, the text aligner 216 may estimate times for words in the transcript 204 that have no corresponding recognized words and times, [Aligning portions, i.e. chunks, based on times and/or estimated times, tracking to extrapolated times for unmatched tokens, indicates the alignment is based on assigned synthetic times as would be required to align estimated times of unmatched/unrecognized words]).
Regarding claim 13, Moreno in view of Serry, further in view of Thomson discloses: the non-transitory computer storage of claim 12.
Moreno further discloses:
generating and assigning synthetic times to unmatched tokens ([Col. 7, Lines 39-45] In some implementations, times for words in the transcript having no matching or similar words in the recognized words can be extrapolated. For example, a start time and stop time for the transition 304c representing the transcript word "to" can be estimated using the end time of the transition 304b and the start time of the transition 304d, [An extrapolated time indicates it is synthetic, i.e. not based on an actual time associated with the unmatched word]),
wherein determining portions of the transcript matching the audio file chunks is further based on the assigned synthetic times ([Col. 9, Lines 30-40] align (410) a portion of the transcript with a portion of the audio data using the identified times. For example, the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204. In some implementations, the text aligner 216 may estimate times for words in the transcript 204 that have no corresponding recognized words and times, [Aligning portions, i.e. chunks, based on times and/or estimated times, tracking to extrapolated times for unmatched tokens, indicates the alignment is based on assigned synthetic times as would be required to align estimated times of unmatched/unrecognized words]);
generating an alignment confidence for the matched tokens ([Col. 9, Lines 50-55] the speech recognizer 212 may provide confidence levels indicating an amount of confidence in the accuracy of an alignment of one or more words); and,
generating the alignment confidence for the unmatched tokens ([In view of the previously disclosed synthetic times for unmatched tokens, the confidence of alignment for unmatched tokens can be performed using the same method disclosed for matched tokens without a change in functionality to Moreno]),
wherein dividing the audio file into chunks is further based on the alignment confidence ([Col. 9, Lines 45-50] a block of audio data may be sent back to the speech recognizer 212 and/or the text aligner 216 if a degree in which the transcript 204 matches the recognized words from the audio data 202 does not meet a threshold level, [Determining to realign a block, i.e. chunk, based on alignment confidence not meeting a threshold value indicates dividing the chunk, i.e. for reprocessing/realignment, based on alignment confidence]).
Regarding claim 14, Moreno in view of Serry, further in view of Thomson discloses: the non-transitory computer storage of claim 10.
Moreno further discloses:
wherein the audio is not divided at a timestamp adjacent to a token having low alignment confidence ([Col. 9, Lines 45-50] a block of audio data may be sent back to the speech recognizer 212 and/or the text aligner 216 if a degree in which the transcript 204 matches the recognized words from the audio data 202 does not meet a threshold level, [Disclosure of only sending blocks for realignment which do not reach a confidence threshold indicates that tokens adjacent to these low-confidence portions are not realigned, i.e. not divided at a timestamp adjacent to a token having a low alignment confidence, but instead, at the timestamp of the first token of the low confidence portion/block/chunk]).
Regarding claim 16, Moreno in view of Serry, further in view of Thomson discloses: the non-transitory computer storage of claim 10.
Moreno further discloses:
wherein tokens comprise phonemes ([Col. 5, Lines 5-10] the text aligner 114 compares a unit of language other than a word, such as phones, phonemes, or syllables from the recognized text and from the substrings in the factor automaton), and the operations further comprises:
determining word timings in the transcript, based at least in part on the assigned timing labels to the ground truth tokens ([Fig. 5, Timeline 500, Transcript 504], [Col. 10, Lines 9-17] The timeline 500 includes a number of times 506a-l ranging from zero seconds to five seconds representing the start and end times of the recognized words. For example, the time 506a is the start of the first recognized word. The time 506b is the end of the first recognized word and the start of the second recognized word. While the recognized words are shown here as sharing start and end times, pauses or other breaks between words can be represented separate from the recognized words, [generating a timeline representing the start and end times of each words associated with transcript 504 indicates a clear timing for each word in the transcript between any two neighboring times 506a-i]),
wherein dividing the audio file into chunks comprises dividing at timestamps, flanked by whole words ([Col. 5, Lines 1-5] the section of recognized audio being matched may represent a single sentence. At the end of the sentence, the text aligner 114 stops the recognition at the last automaton state that has transcription text that matches the last recognized word in the audio, [Dividing audio into sentence-based analysis indicates that each division will be flanked by whole words, i.e. the last word of a previous sentence and the first word of a next sentence]).
Regarding claim 18, Moreno in view of Serry, further in view of Thomson discloses: the non-transitory computer storage of claim 10.
Moreno further discloses:
generating synthetic timing labels for unaligned ground truth tokens ([Col. 7, Lines 39-45] In some implementations, times for words in the transcript having no matching or similar words in the recognized words can be extrapolated. For example, a start time and stop time for the transition 304c representing the transcript word "to" can be estimated using the end time of the transition 304b and the start time of the transition 304d, [An extrapolated time indicates it is synthetic, i.e. not based on an actual time associated with the unmatched word]);
generating an alignment confidence for each timing label assigned to an aligned ground truth token and for each synthetic timing label assigned to an unaligned ground truth token ([Col. 9, Lines 50-55] the speech recognizer 212 may provide confidence levels indicating an amount of confidence in the accuracy of an alignment of one or more words, [In view of the previously disclosed synthetic times for unmatched tokens, the confidence of alignment for unmatched tokens can be performed using the same method disclosed for matched tokens without a change in functionality to Moreno]);
outputting the alignment confidence ([Fig. 2, Speech Recognizer 212 Providing Input to Text Aligner 216], [In view of the above cited excerpt regarding the speech recognizer providing confidence levels, indicating the confidence level is output from the speech recognizer and input into the text aligner]).
Regarding claim 19, Moreno discloses: a system comprising one or more processors ([Fig. 6, Processor 610]), wherein the one or more processors are configured to perform operations comprising:
receiving a training sample comprising an audio file ([Fig. 1, Audio Data 106], [Col. 5, Lines 25-28] audio data 106 stored at the service client system 104 can later be accessed by speech recognition systems for training language models);
receiving a transcript of the audio file ([Fig. 1, Transcript 108], [Col. 4, Line 40] model builder 112 receives the transcript 108);
generating a predicted tokens sequence from the audio file ([Col. 5, Lines 55-57] The speech recognizer 212 can analyze the segmented audio data to determine text or words that represent the audio data 202, [Col. 6, Lines 1-5] The speech recognizer 212 can use a dictionary 214 to identify candidates for the recognized words. For example, if a particular recognized text is similar to a word in the dictionary 214, the dictionary word can be chosen as the recognized text, [Candidates of recognized words track to predictions for the recognized words, wherein words are reasonably understood to represent tokens, see [0034] of instant app]),
generating predicted timing labels ([Col. 6, Lines 15-17] The speech recognizer 212 may provide the recognized words and the times at which the recognized words occur in the audio), wherein each predicted token has an associated predicted timing label ([Col. 4, Lines 34-36] the speech recognizer 110 can output recognized words and a start time and stop time for each of the recognized words, [In view of the previously disclosed candidate recognized words indicating that each predicted, i.e. recognized, token, i.e. word, has an associated predicted timing label]);
predicting a ground truth tokens sequence from the transcript ([Col. 5, Lines 38-40] the factor automaton 208 can be navigated, or explored, to retrieve all possible substrings of the text in the transcript 204, [Wherein substrings of text (including a substring representing the entire string) are reasonably understood to be representative of a token sequence, i.e. if the tokens are consisting of phonemes and/or words]);
mapping the ground truth tokens, generated from the transcript to the predicted tokens, generated from the audio file, finding matched tokens ([Col. 6, Lines 18-20] The text aligner 216 may receive the recognized words and locate the recognized words in the factor automaton 208, [Locating recognized words, i.e. predicted tokens, in the factor automaton, i.e. containing the ground truth tokens, indicates the locating to be a mapping/matching operation between the two sources]);
assigning, to the ground truth tokens, the timing labels of the matched tokens ([Col. 9, Lines 29-37] the text aligner 216 can identify the times at which those recognized words occur and can associate the identified times with the corresponding words from the automaton arcs…the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204, [Associating times to words in automaton arcs, i.e. the ground truth tokens, indicates an assignment of timing based on the timings of the recognized words, i.e. predicted tokens, indicating the times to match when the tokens do, i.e. having corresponding words with corresponding times]);
dividing the audio file into chunks, based at least in part on the assigned timing labels ([Col. 9, Lines 30-35] The process 400 can align (410) a portion of the transcript with a portion of the audio data using the identified times, [Aligning a portion of a transcript based on identified times indicates the portion represents a division of the larger audio file based on the times of recognized words having assigned timing labels]);
determining portions of the transcript matching the audio file chunks, based at least in part on the assigned timing labels to the matched ground truth tokens ([Col. 9, Lines 45-51] a block of audio data may be sent back to the speech recognizer 212 and/or the text aligner 216 if a degree in which the transcript 204 matches the recognized words from the audio data 202 does not meet a threshold level, [Sending potions of audio back into an alignment process based on a poor degree of matching indicates that there is a determination of matching being made, i.e. portions which do not require a second round of alignment are a match]);
training a model with the audio chunks and the matching transcript portions ([Col. 4, Lines 14-20] These aligned transcripts can, in turn, provide a large audio corpus for training a speech recognizer so that the recognizer improves its accuracy in text recognition. In other implementations, the systems and methods described here may permit the alignment of audio books to their transcriptions, [Training using an aligned transcript for speech recognition indicates it to be trained based on audio chunks to be recognized with matching transcript portions]);
selecting a segment size ([Col. 5, Lines 40-45] The audio segmenter 210 segments or divides the audio data 202 into portions of audio that may be easily processed by a speech recognizer 212, [Col. 5, Lines 1-5] For example, the section of recognized audio being matched may represent a single sentence, [Segmenting audio for easy processing indicates a selected segment size, i.e. sentence, for ease of processing]);
determining a number of predicted tokens, in an alignment window of the segment size, in the predicted tokens sequence ([Col. 5, Lines 55-57] The speech recognizer 212 can analyze the segmented audio data to determine text or words that represent the audio data 202, [Determining words that represent segmented audio indicates a number of identified words, i.e. predicted tokens, for the selected segment having a segment size to facilitate ease of processing]);
aligning, within the alignment window of the segment size, a corresponding number of ground truth tokens from the ground truth tokens sequence equal to the determined number of predicted tokens, to the predicted tokens in the alignment window ([Col. 5, Lines 15-25] The text aligner 114 may align the transcript 108 with the audio data 106 by combining time indicators derived from the speech recognizer 110 with the transcript 108 to form an aligned transcript 116. The time indicators may then specify when text in the transcript occurs relative to the corresponding utterance in the audio data. The text aligner 114 can output the aligned transcript 116, [Aligning transcript with audio data based on times of text, i.e. aligning recognized words with words of the transcript, indicates the alignment is based on time matching. Further, consider Fig. 5 which displays a transcript 504 and a timing of each word 506 indicating the number of words to be equal between the ground truth tokens, i.e. transcription, and the predicted tokens, i.e. the timings on timeline 506]);
assigning timing labels, from the aligned predicted tokens to a selection of the aligned ground truth tokens in the alignment window ([Col. 9, Lines 29-37] the text aligner 216 can identify the times at which those recognized words occur and can associate the identified times with the corresponding words from the automaton arcs…the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204, [Associating times to words in automaton arcs, i.e. the ground truth tokens, indicates an assignment of timing based on the timings of the recognized words, i.e. predicted tokens, indicating the times to match when the tokens do, i.e. having corresponding words with corresponding times]);
advancing the alignment window along the predicted tokens sequence and the ground truth tokens sequence, with a selected overlap, until at least one of the sequences is exhausted ([Col. 4, Lines 48-55] In some implementations, the factor automaton includes a starting state with an arc corresponding to a first word (or other language element unit, such as a phone, phoneme, or syllable) in the transcript 108 (or a first word in a selected portion of the transcript). The factor automaton may also include states with arcs for each of the other words in the transcript 108 (or in the selected portion), [Including states with arcs for other words in the transcript, i.e. ground truth tokens outside of the selected portion, to later be compared/aligned to the audio file recognized words, i.e. predicted tokens, indicates the words outside of the selected portion to be overlapping between each individual sentence analysis when the additional words are not in the sentence currently being aligned, wherein the sliding is performed on a sentence-basis, i.e. a window size of a sentence with additional words outside of the analyzed sentence “overlapping” compared to when the sentence they are actually a part of is aligned]);
performing the determining, the aligning and the assigning until at least one of the sequences is exhausted ([Col. 8, Lines 65-67]-[Col. 9, Lines 1-3] In general, a model builder can generate a factor automaton much larger than the single sentence in this example. For example, an audio data and corresponding transcript may represent an entire television program, movie, theatrical product, radio program, or audio book, [In view of the previous disclosure of Moreno indicating a portion to be representative of a sentence, building a factor automaton, i.e. ground truth tokens, larger than one sentence indicates the previous align/keep/slide operations using these ground truth tokens on a sentence-basis to be repeated and/or performed until the entire document, consisting of ground truth tokens, is considered/exhausted. Further, portion/segment analysis as disclosed in Moreno indicates a combination of segments results in a repeated aligning/keeping/sliding operation for each sentence]);
outputting first stage timings, comprising the selected aligned ground truth tokens and the assigned timing labels ([Col. 6, Lines 24-26] The text aligner 216 can output an aligned transcript 218 that includes the transcript text and the associated times at which the words occur in the audio data 202); and,
generating token frequency for each ground truth token, based on the first stage timings ([Fig. 5, 506a-b, 506h-i], [Col. 10, Lines 29-50] The graph 502 includes a weighting curve 510 for the second occurrence of the word "let's" in the transcript 504. The weighting curve 510 indicates that this instance of the word "let's" has a high probability of occurring at a location between three and four seconds. In one example, a particular comparison operation may begin with a comparison of this second instance of the word "let's" in the transcript to the first recognized word "Let's." The first recognized word occurs between zero second and one second at the times 506a-b. While the first recognized word does match the transcript word, the first recognized word has a very low probability of representing the second occurrence of the word "let's" in the transcript 504 due to the weighting curve 510 having a value at or near zero between the times 506a-b. The eighth recognized word also matches the transcript word "let's." The eighth recognized word occurs between three seconds and four seconds at the times 506h-l. The eighth recognized word has a high probability of representing the second occurrence of the word "let's" in the transcript 504 due to the weighting curve 510 having a value at or near one between three and four seconds, [Performing probability evaluations based on word timings, i.e. first stage timings, wherein the evaluation is also considering a number of occurrences of the same word (comparing instances of the same word suggests a knowledge that more than one copy of the same word exists), indicates a token frequency value which is affecting the probability as displayed in the weighting curve of Fig. 5. The second occurrence of “let’s” can be in one of two positions (token frequency), wherein the more likely position is determined based on timings]).
Moreno does not disclose:
receiving an artificial intelligence training sample; and,
training a supervised artificial intelligence model.
Serry discloses:
receiving an artificial intelligence training sample ([0028] the text classification model 202 backpropagates the output of the text segment classification as a “ground truth” (e.g., a known good value) to update (e.g., train) the text classification model 202, [0068] In some examples, text classification model 202, text segmentation model 204, text encoder 402, fusion-layer transformer 404, and/or segmenter/classifier 406 include or are implemented as a large language model (LLM). Example models may include the GPT models from OpenAI, BARD from Google, and/or Large Language Model Meta AI (LLaMA) from Meta, among other types of artificial intelligence (AI) models.); and,
training a supervised artificial intelligence model ([0070] The LLM is generally trained using supervised learning based on large amounts of annotated text data, [In view of [0068] which disclosed the LLM to be implemented as an artificial intelligence LLM]).
Moreno and Serry are considered analogous art within automatic speech recognition based on associated textual representations. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Moreno to incorporate the teachings of Serry, because of the novel way to perform semantic text segmentation before or in conjunction with text classification to improve accuracy of the text classification as would be relevant to the matching audio to text operation of Moreno, i.e. the classifications of Serry could be applied to the inputs of Moreno for determining the matches as disclosed in Moreno (Serry, [0016]).
Moreno in view of Serry does not disclose:
generating alignment paths for the ground truth tokens;
scoring the alignment paths, based on token frequency;
selecting a final alignment, based on the scoring; and,
generating second stage alignment timings, based on the final alignment.
Thomson discloses:
generating alignment paths for the ground truth tokens ([0417] the align text process 1406 may find a path that best meets a selected set of performance criteria by constructing a two-dimensional grid representing the first sequence in a first dimension and the second sequence in a second dimension… where N is the number of words in the reference, [Wherein a reference transcription tracks to a ground truth token sequence (see [0235] comparing reference transcription to new audio transcription)]);
scoring the alignment paths, based on token frequency ([0417] The performance criteria may include the lowest cost or the highest score. For example, the cost may be a function of the number of deletions “D,” substitutions “S,” and insertions “I.” If all errors receive the same weight, the cost may be represented by D+S+I. The Viterbi path may then chose the alignment between the first and second sequence that results in the lowest cost as represented by D+S+I. The highest score may represent the Viterbi path that aligns the first and second sequences such that a score such as the number of matching words, the total path probability, or N-(D+S+I), [A perfect transcription with the lowest cost function would be one with no deletions, substitutions, or insertions, indicating a scoring operation based on a comparison of token frequency to a ground truth, i.e. reference, token frequency. The perfect transcription will have the same token frequency as the ground truth because it consists of the same tokens]);
selecting a final alignment, based on the scoring ([As previously disclosed, Thomson selects an alignment path which minimizes the cost function, representing a final alignment as compared to alignments with higher costs]); and,
generating second stage alignment timings, based on the final alignment ([0527] the synchronizer 1902 may use a Viterbi search or other dynamic programming method to align and identify segment matches in the first and second transcriptions. In some embodiments, the synchronizer 1902 may use information from the transcription units 1914 to align the first and second transcriptions. For example, the synchronizer 1902 may use word endpoints from ASR systems in the transcription units 1914 to align the first and second transcriptions, [In view of the previously disclosed Viterbi cost algorithm of Thomson used for alignment, indicating that the aligned tokens resulting in a second alignment as currently defined will result in second stage alignment timings, wherein these are “generated” when the output transcription of the final alignment is generated at 1410 after the final alignment is generated at 1408/1409 (see fusing of alignments based on endpoints, [0708])]).
Moreno, Serry, and Thomson are considered analogous art within speech-to-text alignment. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Moreno in view of Serry to incorporate the teachings of Thomson, because of the novel way to use additional criteria including determining speaking portions of audio for alignment of tokens, improving quality of output produced from a voting process of several transcripts (Thomson, [0375]-[0376]).
Regarding claim 20, Moreno in view of Serry, further in view of Thomson discloses: the system of claim 19.
Moreno further discloses:
determining number of predicted tokens in a segment of the predicted tokens sequence of the selected segment size ([Col. 5, Lines 55-57] The speech recognizer 212 can analyze the segmented audio data to determine text or words that represent the audio data 202, [Determining words that represent segmented audio indicates a number of identified words, i.e. predicted tokens, for the selected segment having a segment size to facilitate ease of processing]);
selecting the same number of ground truth tokens from the ground truth tokens sequence ([Col. 5, Lines 3-5] At the end of the sentence, the text aligner 114 stops the recognition at the last automaton state that has transcription text that matches the last recognized word in the audio, [Stopping recognition at the last automaton state, i.e. representing the last ground truth token, indicates stopping at the same number of tokens as the predicted tokens, requiring the same amount of tokens to be selected, i.e. the same sentence]);
aligning the selected ground truth tokens in the segment with the predicted tokens in the segment, finding the matched tokens ([Col. 5, Lines 15-25] The text aligner 114 may align the transcript 108 with the audio data 106 by combining time indicators derived from the speech recognizer 110 with the transcript 108 to form an aligned transcript 116. The time indicators may then specify when text in the transcript occurs relative to the corresponding utterance in the audio data. The text aligner 114 can output the aligned transcript 116, [Aligning transcript with audio data based on times of text, i.e. aligning recognized words with words of the transcript, indicates the alignment is based on time matching]);
keeping a selection of the matched tokens in a segment ([Col. 6, Lines 20-26] The text aligner 216 can associate the times of the recognized words that were matched to paths in the automaton. The text aligner 216 can output an aligned transcript 218 that includes the transcript text and the associated times at which the words occur in the audio data 202, [Outputting an aligned transcript indicates a required keeping of matched tokens which form the alignment]);
sliding the segment along the predicted tokens sequence and the ground truth sequence by an amount of overlap ([Col. 4, Lines 48-55] In some implementations, the factor automaton includes a starting state with an arc corresponding to a first word (or other language element unit, such as a phone, phoneme, or syllable) in the transcript 108 (or a first word in a selected portion of the transcript). The factor automaton may also include states with arcs for each of the other words in the transcript 108 (or in the selected portion), [Including states with arcs for other words in the transcript, i.e. ground truth tokens outside of the selected portion, to later be compared/aligned to the audio file recognized words, i.e. predicted tokens, indicates the words outside of the selected portion to be overlapping between each individual sentence analysis when the additional words are not in the sentence currently being aligned, wherein the sliding is performed on a sentence-basis, i.e. a window size of a sentence with additional words outside of the analyzed sentence “overlapping” compared to when the sentence they are actually a part of is aligned]); and
performing the aligning, the keeping and the sliding until the predicted tokens sequence, or the ground truth token sequence is exhausted ([Col. 8, Lines 65-67]-[Col. 9, Lines 1-3] In general, a model builder can generate a factor automaton much larger than the single sentence in this example. For example, an audio data and corresponding transcript may represent an entire television program, movie, theatrical product, radio program, or audio book, [In view of the previous disclosure of Moreno indicating a portion to be representative of a sentence, building a factor automaton, i.e. ground truth tokens, larger than one sentence indicates the previous align/keep/slide operations using these ground truth tokens on a sentence-basis to be repeated and/or performed until the entire document, consisting of ground truth tokens, is considered/exhausted. Further, portion/segment analysis as disclosed in Moreno indicates a combination of segments results in a repeated aligning/keeping/sliding operation for each sentence]).
Regarding claim 21, Moreno in view of Serry, further in view of Thomson discloses: the system of claim 19.
Moreno further discloses:
wherein the operations further comprise:
generating and assigning synthetic times to unmatched tokens ([Col. 7, Lines 39-45] In some implementations, times for words in the transcript having no matching or similar words in the recognized words can be extrapolated. For example, a start time and stop time for the transition 304c representing the transcript word "to" can be estimated using the end time of the transition 304b and the start time of the transition 304d, [An extrapolated time indicates it is synthetic, i.e. not based on an actual time associated with the unmatched word]),
wherein determining portions of the transcript matching the audio file chunks is further based on the assigned synthetic times ([Col. 9, Lines 30-40] align (410) a portion of the transcript with a portion of the audio data using the identified times. For example, the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204. In some implementations, the text aligner 216 may estimate times for words in the transcript 204 that have no corresponding recognized words and times, [Aligning portions, i.e. chunks, based on times and/or estimated times, tracking to extrapolated times for unmatched tokens, indicates the alignment is based on assigned synthetic times as would be required to align estimated times of unmatched/unrecognized words]).
Regarding claim 22, Moreno in view of Serry, further in view of Thomson discloses: the system of claim 21.
Moreno further discloses:
wherein the operations further comprise:
generating and assigning synthetic times to unmatched tokens ([Col. 7, Lines 39-45] In some implementations, times for words in the transcript having no matching or similar words in the recognized words can be extrapolated. For example, a start time and stop time for the transition 304c representing the transcript word "to" can be estimated using the end time of the transition 304b and the start time of the transition 304d, [An extrapolated time indicates it is synthetic, i.e. not based on an actual time associated with the unmatched word]),
wherein determining portions of the transcript matching the audio file chunks is further based on the assigned synthetic times ([Col. 9, Lines 30-40] align (410) a portion of the transcript with a portion of the audio data using the identified times. For example, the text aligner 216 associates the identified times for decoded words with the corresponding words in the transcript 204. In some implementations, the text aligner 216 may estimate times for words in the transcript 204 that have no corresponding recognized words and times, [Aligning portions, i.e. chunks, based on times and/or estimated times, tracking to extrapolated times for unmatched tokens, indicates the alignment is based on assigned synthetic times as would be required to align estimated times of unmatched/unrecognized words]);
generating an alignment confidence for the matched tokens ([Col. 9, Lines 50-55] the speech recognizer 212 may provide confidence levels indicating an amount of confidence in the accuracy of an alignment of one or more words); and,
generating the alignment confidence for the unmatched tokens ([In view of the previously disclosed synthetic times for unmatched tokens, the confidence of alignment for unmatched tokens can be performed using the same method disclosed for matched tokens without a change in functionality to Moreno]),
wherein dividing the audio file into chunks is further based on the alignment confidence ([Col. 9, Lines 45-50] a block of audio data may be sent back to the speech recognizer 212 and/or the text aligner 216 if a degree in which the transcript 204 matches the recognized words from the audio data 202 does not meet a threshold level, [Determining to realign a block, i.e. chunk, based on alignment confidence not meeting a threshold value indicates dividing the chunk, i.e. for reprocessing/realignment, based on alignment confidence]).
Claim(s) 6, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Moreno in view of Serry, further in view of Thomson, further in view of Liao et al. (US-20210103635-A1), hereinafter Liao.
Regarding claim 6, Moreno in view of Serry, further in view of Thomson discloses: the method of claim 1.
Moreno in view of Serry, further in view of Thomson does not disclose:
wherein mapping the ground truth tokens to the predicted tokens comprises one or more of calculating minimum Levenshtein distance, longest common subsequence distance, minimum Damerau-Levenshtein distance, and a modified Levenshtein maximal match criterion.
Liao discloses:
wherein mapping the ground truth tokens to the predicted tokens comprises one or more of calculating minimum Levenshtein distance ([0092] a metric such as the Levenshtein edit distance between the script text (or a portion thereof) and the spoken phrase can be used to align the spoken phrase and the script text, [Determining a distance between script and speech for alignment indicates the distance should be a minimum between the two sources for best alignment]), longest common subsequence distance, minimum Damerau-Levenshtein distance ([The examiner would like to note that these element do not require a mapping due to the disjunctive nature of the claimed elements]), and a modified Levenshtein maximal match criterion ([0097] As another example, a masked convolutional or recurrent convolutional neural network can be trained to evaluate the spoken phrases and/or script text and filter out those words that have little impact on determining a probability of a match. The match metric can then be calculated in another way such as using the Levenshtein edit distance, [Determining a match metric using distance indicates that the lowest distance corresponds to a maximal match]).
Moreno, Serry, Thomson, and Liao are considered analogous art within speech-text alignment. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Moreno in view of Serry, further in view of Thomson to incorporate the teachings of Liao, because of the novel way to detect when a user is simply repeating what is displayed on a screen in the context of public speaking, i.e. determining an alignment/match of text to speech, and providing feedback to the speaker based on a variety of text-speech scenarios, resulting in increased accuracy and efficiency of the model providing feedback based on the text-speech comparison(s) (Liao, [0017]-[0019], [0026]-[0027]).
Regarding claim 15, Moreno in view of Serry, further in view of Thomson discloses: the non-transitory computer storage of claim 10.
Moreno in view of Serry, further in view of Thomson does not disclose:
wherein mapping the ground truth tokens to the predicted tokens comprises one or more of calculating minimum Levenshtein distance, longest common subsequence distance, minimum Damerau-Levenshtein distance, and a modified Levenshtein maximal match criterion.
Liao discloses:
wherein mapping the ground truth tokens to the predicted tokens comprises one or more of calculating minimum Levenshtein distance ([0092] a metric such as the Levenshtein edit distance between the script text (or a portion thereof) and the spoken phrase can be used to align the spoken phrase and the script text, [Determining a distance between script and speech for alignment indicates the distance should be a minimum between the two sources for best alignment]), longest common subsequence distance, minimum Damerau-Levenshtein distance ([The examiner would like to note that these element do not require a mapping due to the disjunctive nature of the claimed elements]), and a modified Levenshtein maximal match criterion ([0097] As another example, a masked convolutional or recurrent convolutional neural network can be trained to evaluate the spoken phrases and/or script text and filter out those words that have little impact on determining a probability of a match. The match metric can then be calculated in another way such as using the Levenshtein edit distance, [Determining a match metric using distance indicates that the lowest distance corresponds to a maximal match]).
Moreno, Serry, Thomson, and Liao are considered analogous art within speech-text alignment. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Moreno in view of Serry, further in view of Thomson to incorporate the teachings of Liao, because of the novel way to detect when a user is simply repeating what is displayed on a screen in the context of public speaking, i.e. determining an alignment/match of text to speech, and providing feedback to the speaker based on a variety of text-speech scenarios, resulting in increased accuracy and efficiency of the model providing feedback based on the text-speech comparison(s) (Liao, [0017]-[0019], [0026]-[0027]).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Chen et al. (US-20210050001-A1) discloses “a user reads a written transcript and the user's voice is recorded. Characters of the transcript are then represented as pinyins with tone markings. The voice recording is sectioned into individual phonemes that are aligned with the phonemes of the pinyins. For each character of the transcript, a tone is determined for the phonemes in the voice recording corresponding to that character. That tone is scored as correct or incorrect by comparison to the tone marking associated with the pinyins for that character. The pronunciation of each phoneme of the voice recording is also scored relative to the corresponding phonemes of the pinyins of the characters of the transcript. Further scores for words and sentences can be developed from the tone and pronunciation scores and provided to the user with feedback” (abstract). Specifically, [0035] discloses forced alignment of phonemes to graphemes based on the timings of the phonemes. See entire document.
Shir (US-20210233535-A1) discloses “a computer implemented method of aligning an automatically generated transcription of an audio recording to a manually generated transcription of the audio recording comprising: identifying non-aligned text fragments, each located between respective two non-continuous aligned text-fragments of the automatically generated transcription, each aligned text-fragment matching words of the manually generated transcription, for each respective non-aligned text fragment: mapping a target keyword of the manually generated transcription to phonemes, mapping the respective non-aligned text fragment to a corresponding audio-fragment of the audio recording, mapping the audio-fragment to phonemes, identifying at least some of the phonemes of the audio-fragment that correspond to the phonemes of the target keyword, and mapping the identified at least some of the phonemes of the audio-fragment to a corresponding word of the automatically generated transcript, wherein the corresponding word is an incorrect automated transcription of the target word appearing in the manually generated transcription” (abstract). See entire document.
Li (US-20230386472-A1) discloses “A search query of a text transcription is received. The search query includes a word or words having a specified spelling. A sequence of search phonemes corresponding to the specified spelling is generated. A sequence of transcript phonemes corresponding to the text transcription is generated from the text transcription. A search alignment in which the sequence of search phonemes is aligned to a transcript phoneme fragment is generated. Based at least on the search alignment having a quality score exceeding a quality score threshold, the transcript phoneme fragment and an associated portion of the text transcription is determined to result from an utterance of the specified spelling in an audio session corresponding to the text transcription. A search result indicating that the transcript phoneme fragment and the associated portion of the text transcription is determined to have resulted from the utterance is output” (abstract). See entire document.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THEODORE JOHN WITHEY whose telephone number is (703)756-1754. The examiner can normally be reached Monday - Friday, 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/THEODORE WITHEY/Examiner, Art Unit 2655 /ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655