DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Claims 1, 5, 7, 11, and 20 are amended. Claims 1-20 are presented for examination.
Response to Arguments
Rejection under 35 U.S.C. 101
Applicant’s arguments have been fully considered and are persuasive. The amended independent claims recite receiving an input speech from a user device, determining a recognition hypothesis by converting input speech into digital data and comparing/matching acoustic features of the input speech to reference words in a pronunciation dictionary to determine a sequence of words in the recognition hypothesis, comparing the recognition hypothesis with an expected response to determine a match, generating a phoneme sequence for each word of the recognition hypothesis, and updating phoneme sequences in a pronunciation dictionary. The claims recite a significant level of detail in the computerized realm, and thus cannot be considered a mental process under step 2A, prong 1.
Rejection under 35 U.S.C. 102/103
Applicant’s arguments have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-4, 8, 11-14, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Adams et al. (US 20150255069 A1; hereinafter referred to as Adams) in view of Stahl (US 20180182385 A1) and Braho et al. (US 20060178886 A1; hereinafter referred to as Braho).
Regarding claim 1, Adams teaches: a computer-implemented method for adapting speech recognition pronunciations to one or more users, the method comprising: receiving, by a processor, input speech ([0011] a user may speak a command to a computing device to "play" a certain item of music. The spoken command may be referred to as an utterance) from a device associated with a user ([0037] the speech storage 320 may be customized for an individual user based on his/her individualized speech input);
generating, by the processor based on the input speech and using a search algorithm, a recognition hypothesis, wherein (i) the search algorithm is informed by a pronunciation dictionary ([0012] When the ASR system receives an utterance, it may match the sound of the utterance to the stored expected pronunciations to match the utterance with one or more content items for retrieval), (ii) the pronunciation dictionary comprises one or more pronunciations for each word of a plurality of words ([0049] the lexicon also may include one or more expected pronunciations of each textual identifier, which allows the user to access associate content items through a speech command. For example, the user may attempt to play a song stored in the music catalog by saying the name of the artist, album or song title. The expected pronunciation may be determined based on a spelling of the word. The process of determining the expected pronunciation of the word based on the spelling is defined as grapheme to phoneme (G2P) conversion or pronunciation guessing) and corresponding sets of phoneme sequences ([0035] The speech storage 320 includes a variety of information for speech recognition such as data matching pronunciations of phonemes to particular words. This data may be referred to as an acoustic model. The speech storage may also include a dictionary of words or a lexicon. The speech storage may also include a lexicon matching textual identifiers to expected pronunciations of those identifiers), and (iii) the recognition hypothesis comprises a sequence of one or more words ([0011] The textual identifier may be text that identifies an item of content such as a song, video, etc. Example textual identifiers include a name of an artist, a band name, an album title, a song title, or some other label that identifies the music to be played. The textual identifier represents a recognition hypothesis.),
wherein generating the recognition hypothesis further comprises: converting the input speech into a digital stream of data, dividing the digital stream of data into a sequence of frames ([0032] The AFE 316 may divide the digitized audio data into frames or audio segments), extracting one or more acoustic features from each frame of the sequence of frames… ([0032] During that frame, the AFE 316 determines a set of values, called a feature vector, representing the features/qualities of the utterance portion within the frame. Feature vectors may contain a varying number of values, for example forty. The feature vector may represent different qualities of the audio data within the frame);
in response to determining that the recognition hypothesis matches the at least one expected response: ([0015] Different expected pronunciations of textual identifiers may be added to the lexicon and to accommodate different pronunciations from different users. The expected pronunciations may be linked to content items, such as a song stored in a music catalog. When the computing device receives a spoken utterance including a textual identifier, the computing device determines whether the spoken utterance includes a textual identifier by matching the utterance to the modified lexicon of expected pronunciations) generating by the processor, a phoneme sequence for each word in the recognition hypothesis based on the input speech ([0014] The expected pronunciation may include a combination of expected pronunciation based on language of origin, for example an expected pronunciation having certain phonemes of a textual identifier expected as if having one language of origin and other phonemes of the textual identifier expected as if having a different language of origin. Further, multiple expected pronunciations may be determined for each textual identifier… An expected pronunciation can be a phoneme sequence.);
and updating, by the processor, the set of phoneme sequences in the pronunciation dictionary associated with at least one word of the recognition hypothesis ([0037] the speech storage 320 may be customized for an individual user based on his/her individualized speech input. To improve performance, the ASR module 314 may revise/update the contents of the speech storage 320 based on feedback of the results of ASR processing, thus enabling the ASR module 314 to improve speech recognition beyond the capabilities provided in the training corpus).
Adams does not explicitly, but Stahl discloses: comparing the one or more acoustic features with reference representations of a plurality of words in the pronunciation dictionary ([0027] comparing acoustic features to phonemes in an acoustic model trained on numerous labeled speech utterances, the acoustic model trained to output one or more hypothesized probability-scored possible phoneme sequences in response to the acoustic features; (2) comparing the hypothesized phoneme sequences to words in a phonetic dictionary), identifying the one or more words from the plurality of words in the pronunciation dictionary that match the one or more acoustic features of the input speech ([0046] A speech engine 54 consumes the phoneme hypotheses and produces transcription hypotheses. It maps the set of hypothesized phoneme sequences to a set of hypothesized word sequences by matching in all possible pronunciations from the phonetic word vocabulary with contiguous subsequences of the phoneme sequences… This involves comparing the ordered sequence of phonemes in each hypothesis to the phonetic spelling of words in a phonetic dictionary. A phonetic dictionary is a list of words and their phonetic spellings), determining the sequence of the one or more words in the recognition hypothesis based on the identifying step… ([0046] Speech engine 54 fits orders of phonemes to possible orders of words that would have the same sequence of phonemes).
Adams and Stahl are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Adams to combine the teachings of Stahl because doing so would allow for more accurate recognition hypotheses by using comparisons between acoustic features and a phonetic dictionary to map a word sequence representing the recognition hypothesis, leading to less errors in speech recognition (Stahl [0051] by eliminating transcription hypotheses, or reducing the weight of unlikely transcription hypotheses, reduce the number of NLP parsing operations that computer processors need to perform to provide users with satisfactory accuracy. While the benefits are small on a per-utterance basis, at the scale of a cloud server farm, the resulting reduction in transcription hypotheses results in significantly lower power, higher throughput, better accuracy, or a combination of benefits).
The combination of Adams and Stahl does not explicitly, but Braho teaches: determining, by the processor, an expected response based at least on an operation being performed by the user ([0032] The user will then speak the two-digit check digit, for example "three five" (3, 5). The system or terminal, pursuant to an aspect of the invention, knows that the expected response from the user for the desired check digits at that shelf or bin are the words "three five"), wherein the expected response comprises a sequence of one or more words ([0032] the speech recognizer only compares the observed features of the spoken words to the model associated with the expected response of "three five". That is, effectively a single response model is used in the analysis);
comparing by the processor, the recognition hypothesis with the at least one expected response ([0023] this most probable sequence, or the hypothesis with the highest confidence factor, is compared, in step 210, to an expected response that was known beforehand. Then, based upon such a comparison, the acceptance algorithm is modified. If the comparison shows that the most probable speech hypothesis matches an expected response, the hypothesis is more favorably treated) to determine if the recognition hypothesis matches the at least one expected response… ([0040] when the hypothesis resulting from the input speech compares favorably with the expected response, the results of such a comparison are utilized as a feedback to provide adaptation of the acoustic models of the speech recognition system).
Adams, Stahl, and Braho are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Adams and Stahl to combine the teachings of Braho because doing so would improve the speed of speech recognition by allowing a model to quickly compare a recognition hypothesis to an expected response determined by a user, reducing the need to determine all possible user responses and leading to faster response times in speech recognition (Braho [0028] FIG. 3 illustrates a flowchart of an embodiment of the invention that uses knowledge about an expected response to improve the speed of a speech recognition system. As explained above, a speech recognizer searches through an HMM, or other type of model, to find a list of probable matches and then analyzes the potential matches accordingly).
Regarding claim 2, the combination of Adams, Stahl, and Braho teaches: the computer-implemented method of claim 1. Adams further teaches: wherein updating the set of phoneme sequences in the pronunciation dictionary associated with the at least one word comprises adding the phoneme sequence for the at least one word to the set of phoneme sequences ([0050] The ASR system may determine multiple expected pronunciations for a particular textual identifier, each with an associated likelihood. The expected pronunciations (and/or their associated likelihoods) may also be adjusted based on the pronunciation tendency of a user or group of users. The expected pronunciations may be added to the lexicon and linked to their respective content items for eventual retrieval by the ASR system. The expected pronunciations include phoneme sequences.).
Regarding claim 3, the combination of Adams, Stahl, and Braho teaches: the computer-implemented method of claim 1. Adams further teaches: further comprising storing the phoneme sequence for each word in the recognition hypothesis in a data repository ([0035] The speech storage 320 includes a variety of information for speech recognition such as data matching pronunciations of phonemes to particular words. This data may be referred to as an acoustic model. The speech storage may also include a dictionary of words or a lexicon. The speech storage may also include a lexicon matching textual identifiers to expected pronunciations of those identifiers).
Regarding claim 4, the combination of Adams, Stahl, and Braho teaches: the computer-implemented method of claim 1. Adams further teaches: further comprising for each word in the recognition hypothesis, updating an occurrence count for the phoneme sequence ([0064] The pronunciation pattern of the user may be determined based on a history pronunciations of a same or different words by the user. Based on the pronunciation pattern or history, the ASR device may anticipate future pronunciation of a same or different word by the user. The ASR device may also learn whether a user is familiar with a pronunciation of one or more languages based on the pronunciation pattern of the user. The pronunciation pattern/history keeps track of occurrences of a phoneme sequence.).
Regarding claim 8, the combination of Adams, Stahl, and Braho teaches: the computer-implemented method of claim 1. Adams further teaches: further comprising for each word in the recognition hypothesis: adding the phoneme sequence for the word to training data for a model configured to generate phoneme sequences ([0036] The training corpus may include a number of sample utterances with associated feature vectors and associated correct text that may be used to create, for example, acoustic models and language models. The sample utterances may be used to create mathematical models corresponding to expected audio for particular speech units. Those speech units may include a phoneme, syllable, part of a syllable, word, etc.);
and generating, using the model, a plurality of sampled phoneme sequences for the word ([0058] For example, a lexicon including German words and corresponding German pronunciations may be analyzed to determine an association between letter sequences, phoneme sequences and sounds of each word. For example, an expectation maximization algorithm may learn that letters P-H in English may be pronounced as F barring some exceptions. The expectation maximization algorithm may also learn when E is pronounced "eh" versus "ee" and so on. A model may be developed based on the analysis of the expectation maximization algorithm and used to predict a new phoneme sequence and subsequently an expected pronunciation of a new word).
Regarding claim 11, Adams teaches: an apparatus for adapting speech recognition pronunciations to one or more users, the apparatus comprising at least one processor and at least one non-transitory memory comprising program code stored thereon, wherein the at least one non-transitory memory and the program code are configured to, with the at least one processor, cause the apparatus to… ([0071] Aspects of the present disclosure may be implemented as a computer implemented method, a system, or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure). The rest of the claim recites similar limitations as claim 1 and therefore is rejected similarly.
Regarding claim 12, it recites similar limitations as claim 2 and therefore is rejected similarly.
Regarding claim 13, it recites similar limitations as claim 3 and therefore is rejected similarly.
Regarding claim 14, it recites similar limitations as claim 4 and therefore is rejected similarly.
Regarding claim 18, it recites similar limitations as claim 8 and therefore is rejected similarly.
Regarding claim 20, Adams teaches: a computer program product for adapting speech recognition pronunciations to one or more users, the computer program product comprising at least one non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer- readable program code portions comprising an executable portion configured to ([0071] Aspects of the present disclosure may be implemented as a computer implemented method, a system, or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium maybe readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure). The rest of the claim recites similar limitations as claim 1 and therefore is rejected similarly.
Claims 5-6, 9-10, 15-16, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Adams in view of Stahl and Braho, as applied to claims 1-4, 8, 11-14, 18, and 20 above, and further in view of Relin (US 20210312901 A1).
Regarding claim 5, the combination of Adams, Stahl, and Braho teaches: the computer-implemented method of claim 4. The combination of Adams, Stahl, and Braho does not explicitly, but Relin teaches: wherein the set of phoneme sequences in the pronunciation dictionary associated with the at least one word is updated in response to determining that the phoneme sequence for the at least one word satisfies updating criteria ([0074] FIG. 10 illustrates a method for automatically enhancing natural language recognition based on adding a new token to a pronunciation dictionary. A phoneme sequence can be recognized from speech audio at step 1010. The phoneme sequence can be tokenized into a token sequence of tokens from a pronunciation dictionary at step 1020. A phoneme subsequence within the phoneme sequence does not match a token in the pronunciation dictionary. A new token is then added to the pronunciation dictionary at step 1030), wherein the updating criteria comprises whether the occurrence count associated with the phoneme sequence satisfies an occurrence count threshold ([0073] The ASR system might compare the phoneme subsequence to a list of previously detected unknown phoneme subsequences. If it finds the phoneme subsequence in the list, the system increments a count of occurrences of the phoneme subsequence across many speech audio segments. If the phoneme subsequence is not in the list, the ASR system may add it to a list with an occurrence count of 1. After an occurrence count that satisfies (e.g., exceeds a threshold or becomes less than a threshold, based on design preference), a condition is met for an ASR system may automatically add the phoneme subsequence to the pronunciation dictionary as a new token).
Adams, Stahl, Braho, and Relin are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Adams, Stahl, and Braho to combine the teachings of Relin because doing so would allow for different phoneme sequences in a dictionary to be quickly updated and stored based on the frequency of occurrence, leading to better speech recognition by keeping track of most frequent pronunciations of words (Relin [0094] The above approaches allow an ASR system to automatically learn usages of tokens as alternate parts of speech and learn the part of speech of new tokens within a language all with minimal human effort and improved speed of system improvement).
Regarding claim 6, the combination of Adams, Stahl, Braho, and Relin teaches: the computer-implemented method of claim 5. Adams further teaches: wherein the phoneme sequence for the at least one word satisfies the updating criteria if the phoneme sequence is one of top N occurring phoneme sequences for the word ([0064] The assignment of the higher scores allows these paths of the graph to become more likely to represent an expected pronunciation of a foreign word by the user. Thus, the expected pronunciations may be associated with a graph of expected pronunciations, an N-best list of expected pronunciations, or some other organization of expected pronunciations).
Regarding claim 9, the combination of Adams, Stahl, and Braho teaches: the computer-implemented method of claim 8. The combination of Adams, Stahl, and Braho does not explicitly, but Relin teaches: further comprising for each word in the recognition hypothesis: determining top M occurring sampled phoneme sequences ([0073] If the phoneme subsequence is not in the list, the ASR system may add it to a list with an occurrence count of 1. After an occurrence count that satisfies (e.g., exceeds a threshold or becomes less than a threshold, based on design preference), a condition is met for an ASR system may automatically add the phoneme subsequence to the pronunciation dictionary as a new token) of the plurality of sampled phoneme sequences ([0057] The acoustic analysis 31 may run repeatedly on windows of speech audio data, for example at intervals around 5 to 25 milliseconds. the time windows can be selected to be frequent enough to capture transitions between even short phonemes of fast speech but not to waste processing power. The acoustic analysis 31 produces at least one phoneme sequence hypothesis. In some examples, the acoustic analysis 31 produces many phoneme sequence hypotheses and might, also, produce a score for each one);
and adding the top M occurring sampled phoneme sequences to the pronunciation dictionary ([0073] After an occurrence count that satisfies (e.g., exceeds a threshold or becomes less than a threshold, based on design preference), a condition is met for an ASR system may automatically add the phoneme subsequence to the pronunciation dictionary as a new token).
Regarding claim 10, the combination of Adams, Stahl, and Braho teaches: the computer-implemented method of claim 8. The combination of Adams, Stahl, and Braho does not explicitly, but Relin teaches: further comprising for each word in the recognition hypothesis: determining one or more of (i) if an occurrence count associated with a sampled phoneme sequence satisfies an occurrence count threshold, or (ii) if an occurrence ratio for the sampled phoneme sequence satisfies an occurrence ratio threshold ([0073] After an occurrence count that satisfies (e.g., exceeds a threshold or becomes less than a threshold, based on design preference), a condition is met for an ASR system may automatically add the phoneme subsequence to the pronunciation dictionary as a new token).
Regarding claim 15, it recites similar limitations as claim 5 and therefore is rejected similarly.
Regarding claim 16, it recites similar limitations as claim 6 and therefore is rejected similarly.
Regarding claim 19, it recites similar limitations as claim 9 and therefore is rejected similarly.
Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Adams in view of Stahl, Braho, and Relin, as applied to claims 5-6, 9-10, 15-16, and 19 above, and further in view of Lokeswarappa et al. (US 20180190269 A1).
Regarding claim 7, the combination of Adams, Stahl, Braho, and Relin teaches: the computer-implemented method of claim 5. The combination of Adams, Stahl, Braho, and Relin does not explicitly, but Lokeswarappa teaches: wherein the phoneme sequence for the at least one word satisfies the updating criteria ([0127] after recognizing the word a significant number of times in speech input, the system learns the frequency of each pronunciation. As a result, the system produces the preferred pronunciation in its speech synthesis of the word) if an occurrence ratio for the phoneme sequence satisfies an occurrence ratio threshold ([0125] Some embodiments crowdsource the order or weights of the different pronunciations of words that have multiple pronunciations. In some embodiments, speech engines recognize each of the pronunciations, and output the word in the transcription and an indication of which pronunciation was recognized. The embodiment accumulates counts of each pronunciation, and sorts or scores the pronunciation entries in the phonetic dictionary based on the counts for each pronunciation. This favors the pronunciations preferred by users who use a word frequently. Some embodiments count the preferred pronunciation across all users' profile word lists. The counts are a ratio based on the total counts for every pronunciation.), wherein the occurrence ratio includes a ratio of the occurrence count for a pronunciation for a word relative to the most occurring pronunciation for the word ([0140] At step 1814, the system calculates a correlation between the preferred pronunciation of the profile word and the various pronunciations of the text word. At step 1816, the system is choosing one of the various pronunciations of the text word based on the correlation).
Adams, Stahl, Braho, Relin, and Lokeswarappa are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Adams, Stahl, Braho, and Relin to combine the teachings of Lokeswarappa because doing so would allow for different pronunciations (phoneme sequences) to be compared and chosen based on an occurrence count, leading to a preferred pronunciation to be updated for speech analysis (Lokeswarappa [0047] Speech synthesis uses a phonetic dictionary of preferred pronunciations in order to produce speech output. The present invention, accordingly, is not abstract, but rather a specific improvement in the field of speech synthesis given the details provided with respect to the system and methods outlined. More specifically, in some embodiments, the preferred pronunciation phonetic dictionary has generally preferred pronunciations. When the system captures speech through ASR, it responds to the user with the preferred pronunciation).
Regarding claim 17, it recites similar limitations as claim 7 and therefore is rejected similarly.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Nathan Tengbumroong whose telephone number is (703)756-1725. The examiner can normally be reached Monday - Friday, 11:30 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NATHAN TENGBUMROONG/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654