Last updated: April 19, 2026
Application No. 18/118,502
METHOD AND DEVICE FOR PROCESSING VOICE INPUT OF USER

Non-Final OA §102§103
Filed
Mar 07, 2023
Examiner
KIM, JONATHAN C
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Samsung Electronics Co., Ltd.
OA Round
3 (Non-Final)
Interview Optional

— +40.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 355 resolved cases, 2023–2026
Examiner Intelligence

KIM, JONATHAN C View full profile →
Grants 74% — above average
Career Allow Rate
261 granted / 355 resolved
+11.5% vs TC avg
Strong +41% interview lift
Without
With
+40.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
20 currently pending
Career history
375
Total Applications
across all art units
Statute-Specific Performance

§101
17.6%
-22.4% vs TC avg
§103
47.5%
+7.5% vs TC avg
§102
11.8%
-28.2% vs TC avg
§112
15.0%
-25.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 355 resolved cases
Office Action

§102 §103
DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 12/4/2025.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers submitted under 35 U.S.C. 119(a)-(d), which papers have been placed of record in the file.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/4/2025 has been entered.

Response to Arguments
Regarding the claim rejections under 102(a)(2), Applicant’s arguments, pages 9-11, have been fully considered, but they are not persuasive. 
SHARIFI teaches a similarity between the first and the second audio signals. (Par 41 – “The similarly engine 141 can compare acoustic features of the spoken utterance to additional acoustic features of the additional spoken utterance to identify a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. The acoustic features can include, for example, MFCCs, audio waveforms, and/or other acoustic features of audio data that can be compared. Continuing with the above example, the similarity engine 141 can compare acoustic features of the spoken utterance of “send a text message to Ron” to additional acoustic features of the additional spoken utterance of “No, Ron”, and can identify that the portion of the spoken utterance corresponding to “Ron” is acoustically similar to the portion of the additional spoken utterance corresponding to “Ron”.”). 
SHARIFI teaches both a vocal characteristic and the voice pattern corresponds to a preset voce pattern.  When method/system of SHARIFI determines the similarity between the two signals, the method/system determines/identifies the acoustic features (Par 56 – “For example, the system can compare the waveforms of the spoken utterance and the additional spoken utterance, MFCCs of the spoken utterance and the additional spoken utterance, mel filterbank features of the of the spoken utterance and the additional spoken utterance, and/or other of the spoken utterance and the additional spoken utterance.”). The acoustic features are the identified vocal characteristic of the second voice.  
SHARIFI also teaches a voice pattern corresponding to a preset voice pattern (e.g., preset utterances such as “no,” “wait,” “I said,” etc.) to correct misrecognized words) in the same way the specification describes the voice pattern (e.g., pg. 12 3rd paragraph of the specification – “In detail, based on whether the second audio signal “Not ‘지양’ but ‘지(…)향’” 141 corresponds to at least one preset voice pattern, the electronic device 200 may identify whether the second audio signal is for correcting the first audio signal.”).
For at least the reasons above, the Examiner maintains the rejections.  Please see the rejections below for more details.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1-2, 4-5, 8, 10, 12-13, 15-16, and 18-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by SHARIFI (US 2022/0068261 A1).

REGARDING CLAIM 1, SHARIFI discloses a method, performed by an electronic device, of processing a voice input, the method comprising: 
obtaining a first audio signal from a first user voice input of a user (Par 51 – “At block 352, the system receives audio data corresponding to a spoken utterance of a user via microphone(s) of a computing device of a user (e.g., “Play a song by John Doe”, “Turn on the bedroom lights”, etc.). At block 354, the system processes, using a speech recognition model, the audio data corresponding to the spoken utterance to generate a plurality of speech hypotheses based on output generated using the speech recognition model. The plurality of speech hypotheses can include, for example, term hypotheses for each term included in the spoken utterance (e.g., “Doe”, “Roe”, “Row”, “Dough”, etc.) and/or transcription hypotheses that include various combinations of the term hypotheses (e.g., “Play a song by John Doe”, “Play a song by John Roe”, and so on). Each of the plurality of speech hypotheses may be associated with a respective confidence level. For example, each term hypothesis may be associated with a respective confidence level. In some implementations, block 354 may include optional sub-block 354A. If included, at sub-block 354A, the system can generate a transcription corresponding to the spoken utterance. For example, the system can generate a transcription associated with a given transcription hypothesis, of the plurality of transcription hypotheses, that has the highest respective confidence level.”); 
obtaining a second audio signal from a second user voice input of the user subsequent to the first audio signal (Par 54 – “At block 360, the system determines whether additional audio data corresponding to an additional spoken utterance is received via the microphone(s) of the computing device of the user.”); 
identifying whether the second audio signal is an audio signal for correcting the obtained first audio signal (Par 54 – “As another example, word detection may be kept active such that a word detection model is utilized to determine whether a specific term or phrase is included in the additional audio data (e.g., correction term(s) or phrase(s), different speech hypotheses, and/or other term(s) or phrase(s)).”; Par 75 – “The automated assistant can identify the misrecognition based on processing the additional spoken utterance 656A.  … For instance, the automated assistant can process the additional spoken utterance 656A to determine that “John Doe” from the additional spoken utterance 656A is acoustically similar to “John Doe” from the original spoken utterance 652A. As another example, the automated assistant can identify correction terms or phrases included in the spoken utterance (e.g. “No”, “I said”, “What I said is”, and/or other terms that are indicative of the misrecognition).”) based on a similarity between the first audio signal and the second audio signal (Par 41 – “The similarly engine 141 can compare acoustic features of the spoken utterance to additional acoustic features of the additional spoken utterance to identify a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. The acoustic features can include, for example, MFCCs, audio waveforms, and/or other acoustic features of audio data that can be compared. Continuing with the above example, the similarity engine 141 can compare acoustic features of the spoken utterance of “send a text message to Ron” to additional acoustic features of the additional spoken utterance of “No, Ron”, and can identify that the portion of the spoken utterance corresponding to “Ron” is acoustically similar to the portion of the additional spoken utterance corresponding to “Ron”.”), and at least one of weather the obtained second audio signal has at least one vocal characteristic (Par 41 – “The similarly engine 141 can compare acoustic features of the spoken utterance to additional acoustic features of the additional spoken utterance to identify a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. The acoustic features can include, for example, MFCCs, audio waveforms, and/or other acoustic features of audio data that can be compared. Continuing with the above example, the similarity engine 141 can compare acoustic features of the spoken utterance of “send a text message to Ron” to additional acoustic features of the additional spoken utterance of “No, Ron”, and can identify that the portion of the spoken utterance corresponding to “Ron” is acoustically similar to the portion of the additional spoken utterance corresponding to “Ron”.”; In other words, the acoustic similarity is also determined. Thus, the vocal characteristic (e.g., MFCCs, mel-filerbank filters, etc.) are identified to determine whether they are similar to the ones in the first audio signal.) or whether a voice pattern of the obtained second audio signal corresponds to at least one preset voice pattern (Par 54 – “At block 360, the system determines whether additional audio data corresponding to an additional spoken utterance is received via the microphone(s) of the computing device of the user. …  As another example, word detection may be kept active such that a word detection model is utilized to determine whether a specific term or phrase is included in the additional audio data (e.g., correction term(s) or phrase(s), different speech hypotheses, and/or other term(s) or phrase(s)).”; Par 78 – “However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”. In this example, the automated assistant can process acoustic features and/or textual features of the additional spoken utterance 656B to determine whether a portion of the additional spoken utterance 656A is intended to correct a portion of the interpretation of the spoken utterance 652B. For instance, the automated assistant can process the additional spoken utterance 656B to determine that “John Smith” from the additional spoken utterance 656B is not acoustically similar to “John Doe” from the original spoken utterance 652B.”; Par 49 – “Further, a second word detection model, of the word detection model(s) 140A, may a correction term or phrase model that is trained to detect words and/or phrases that are indicative of an occurrence of a misrecognition (e.g., “no”, “wait”, “I said”, and/or any other term or phrase that is indicative of the misrecognition). In some versions of those implementations, the speech misrecognition system 160 may be implemented responsive to determining a given spoken utterance includes the correction term or phrase. As another example, a word detection model trained to identify specific term(s) or phrase(s) included in the plurality of speech hypotheses may be included (e.g., acoustically similar speech hypotheses).”); 
based on the second audio signal being identified as the audio signal for correcting the first audio signal, obtaining at least one of one or more corrected words or one or more corrected syllables through processing of content of the second audio signal (Par 76 – “Further, the automated assistant can process, using the speech recognition model(s), additional audio data corresponding to the additional spoken utterance 656A to generate a plurality of additional speech hypotheses. The plurality of additional speech hypotheses can be restricted to those that overlap with the original plurality of speech hypotheses generated in processing the spoken utterance 652A, and optionally omit the given speech hypothesis, of the plurality of speech hypotheses, that was selected as corresponding to the spoken utterance 652A. … However, since the automated assistant previously selected the term “Roe” as corresponding to “Doe” based on the term hypotheses in table the table 620A and the automated assistant determined the additional spoken utterance 656A is intended to correct a misrecognition, the automated assistant may select the next term hypothesis with the highest confidence level, such as “Doe”.”; Par 31 – “For instance, with such models the predicted phoneme(s) (and/or other representations) are then utilized by the speech recognition engine(s) 120A1 and/or 120A2 to determine the plurality of speech hypotheses that conform to the predicted phoneme(s). In doing so, the speech recognition engine(s) 120A1 and/or 120A2 can optionally employ a decoding graph, a lexicon, and/or other resource(s).”); 
based on the obtained at least one of the one or more corrected words or the one or more corrected syllables (Par 13 – “In this example (and assuming the automated assistant is confident the term hypotheses for “Play”, “a”, “song”, “by”, “John” are correct), the automated assistant compares acoustic features of the spoken utterances, and determines that “Roe” and “Doe” are similar (e.g., phonetically, acoustically, and/or textually). The additional spoken utterance of “No, John Doe” can be processed, using the speech recognition model(s), to generate the additional speech hypotheses (e.g., transcription hypotheses and/or term hypotheses) that are restricted to those considered in previously processing the prior spoken utterance of “Play a song by John Doe”.”), correcting at least one misrecognized word in the first audio signal obtained from the first user voice input according to the at least one of the corrected words or corrected syllables (Par 57 – “At block 454, the system identifies, based on comparing the acoustic features to the additional acoustic features, at least a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. For instance, assume the original spoken utterance included the term “Doe”, and assume speech recognition misrecognized “Doe” as “Roe”. Further assume the additional spoken utterance included the term “Doe” again. In these instances, the acoustic features indicate that the portion of the original spoken utterance corresponding to “Doe” is similar to the portion of the additional spoken utterance corresponding to “Doe”.”); 
processing the identified at least one corrected audio signal (Par 59 – “At block 458, the system identifies an alternate speech hypothesis, of the plurality of speech hypotheses, based on the additional audio data. The alternate speech hypothesis may include an alternate transcription hypothesis and/or alternate term hypotheses generated based on processing the additional audio data using the speech recognition model(s). … block 458 may include optional sub-block 458A. If included, at sub-block 458A, the system may generate an additional transcription corresponding to the additional spoken utterance. The additional transcription may be based on the alternate speech hypothesis, and optionally the original speech hypothesis. For example, the additional transcription may be based on an alternate transcription hypothesis, a plurality of alternate term hypotheses, and/or an alternate term hypothesis and the original transcription hypothesis.”); and
outputting search information for the at least one corrected audio signal (Fig. 6A; Par 77 – “In some versions of these implementations, the automated assistant can cause the computing device 610 to audibly render a further spoken response 658A of “Right, John Doe” that acknowledges the correction of the misrecognition, and can perform the alternate action as indicated by 660A.”; Par 38 – “For example, if the action determined based on the audio data 201 includes a slot value that is a proper noun (e.g., “Give me directions to Ron's house”, “Play a song by John Doe”, etc.)”; Par 72 – “The computing device 610 may include microphone(s) to generate audio data based on spoken utterances and/or other audible input, speaker(s) to audibly render synthesized speech and/or other audible output, and/or a display 680 to visually render transcriptions and/or other visual output.”).


REGARDING CLAIM 2, SHARIFI discloses the method of claim 1, wherein the identifying of whether the obtained second audio signal is the audio signal for correcting the first audio signal comprises: based on the similarity between the obtained first audio signal and the obtained second audio signal (Par 41 – “The similarly engine 141 can compare acoustic features of the spoken utterance to additional acoustic features of the additional spoken utterance to identify a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. The acoustic features can include, for example, MFCCs, audio waveforms, and/or other acoustic features of audio data that can be compared. Continuing with the above example, the similarity engine 141 can compare acoustic features of the spoken utterance of “send a text message to Ron” to additional acoustic features of the additional spoken utterance of “No, Ron”, and can identify that the portion of the spoken utterance corresponding to “Ron” is acoustically similar to the portion of the additional spoken utterance corresponding to “Ron”.”), identifying the at least one of whether the obtained second audio signal has the at least one vocal characteristic (Par 41 – “The similarly engine 141 can compare acoustic features of the spoken utterance to additional acoustic features of the additional spoken utterance to identify a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. The acoustic features can include, for example, MFCCs, audio waveforms, and/or other acoustic features of audio data that can be compared. Continuing with the above example, the similarity engine 141 can compare acoustic features of the spoken utterance of “send a text message to Ron” to additional acoustic features of the additional spoken utterance of “No, Ron”, and can identify that the portion of the spoken utterance corresponding to “Ron” is acoustically similar to the portion of the additional spoken utterance corresponding to “Ron”.”; In other words, the acoustic similarity is also determined. Thus, the vocal characteristic (e.g., MFCCs, mel-filerbank filters, etc.) are identified to determine whether they are similar to the ones in the first audio signal.) or whether the voice pattern of the obtained second audio signal corresponds to at least one preset voice pattern (Par 78 – “However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”. In this example, the automated assistant can process acoustic features and/or textual features of the additional spoken utterance 656B to determine whether a portion of the additional spoken utterance 656A is intended to correct a portion of the interpretation of the spoken utterance 652B. For instance, the automated assistant can process the additional spoken utterance 656B to determine that “John Smith” from the additional spoken utterance 656B is not acoustically similar to “John Doe” from the original spoken utterance 652B.”; Par 49 – “Further, a second word detection model, of the word detection model(s) 140A, may a correction term or phrase model that is trained to detect words and/or phrases that are indicative of an occurrence of a misrecognition (e.g., “no”, “wait”, “I said”, and/or any other term or phrase that is indicative of the misrecognition). In some versions of those implementations, the speech misrecognition system 160 may be implemented responsive to determining a given spoken utterance includes the correction term or phrase. As another example, a word detection model trained to identify specific term(s) or phrase(s) included in the plurality of speech hypotheses may be included (e.g., acoustically similar speech hypotheses).”).


REGARDING CLAIM 4, SHARIFI discloses the method of claim 1, wherein the identifying of the at least one of whether the obtained second audio signal has the at least one vocal characteristic, and whether the voice pattern of the obtained second audio signal corresponds to the at least one preset voice pattern (see the rejection of Claim 1) comprises, 
based on the similarity between the obtained first audio signal and the obtained second audio signal being greater than or equal to a preset second threshold, identifying whether the obtained second audio signal has the at least one vocal characteristic (Par 14 – “The similarity between the transcriptions can be determined based on, for example, an edit distance between one or more recognized terms included in the transcriptions and/or other textual or phonetic similarities. Continuing with the above examples, a portion of the transcription that corresponds that to the given speech hypothesis (e.g., “Roe”) can be compared to an additional portion of the additional transcription that corresponds to the additional given speech hypothesis (e.g., “Doe” or “Smith”). Based on an edit distance between “Roe” and “Doe” determined based on the comparison, the automated assistant can determine the similarity satisfies a similarity threshold. In this example, the edit distance may be relatively small, and the misrecognition (e.g., selecting “Roe” instead of “Doe”) can be identified. In contrast, based on an edit distance between “Roe” and “Smith” determined based on the comparison, the automated assistant can determine the similarity does not satisfy a similarity threshold. In this example, the edit distance may be relatively larger, and that the additional spoken utterance was not provided due to misrecognition.”; Par 13 – “In this example (and assuming the automated assistant is confident the term hypotheses for “Play”, “a”, “song”, “by”, “John” are correct), the automated assistant compares acoustic features of the spoken utterances, and determines that “Roe” and “Doe” are similar (e.g., phonetically, acoustically, and/or textually).”; In other words, the acoustic similarity is also determined as well as the phonetic similarity. Thus, the vocal characteristic (e.g., MFCCs, mel-filerbank filters, etc.) are identified to determine whether they are similar to the ones in the first audio signal.), and 
based on the obtained similarity being less than the preset second threshold, identifying whether the voice pattern of the obtained second audio signal corresponds to the at least one preset voice pattern (Par 14 – “Based on an edit distance between “Roe” and “Doe” determined based on the comparison, the automated assistant can determine the similarity satisfies a similarity threshold. In this example, the edit distance may be relatively small, and the misrecognition (e.g., selecting “Roe” instead of “Doe”) can be identified. In contrast, based on an edit distance between “Roe” and “Smith” determined based on the comparison, the automated assistant can determine the similarity does not satisfy a similarity threshold. In this example, the edit distance may be relatively larger, and that the additional spoken utterance was not provided due to misrecognition.”; Par 49 – “Further, a second word detection model, of the word detection model(s) 140A, may a correction term or phrase model that is trained to detect words and/or phrases that are indicative of an occurrence of a misrecognition (e.g., “no”, “wait”, “I said”, and/or any other term or phrase that is indicative of the misrecognition). In some versions of those implementations, the speech misrecognition system 160 may be implemented responsive to determining a given spoken utterance includes the correction term or phrase.”; Par 58 –“In these instances, the acoustic features may not indicate that the portion of the original spoken utterance corresponding to “Doe” is similar to the portion of the additional spoken utterance corresponding to “Smith”. Further, a distinct action may be performed by the automated assistant based on the additional audio data, but the distinct action is not an alternate action in that it is very unlikely a term hypothesis for “Smith” or transcription hypothesis including the term hypothesis “Smith” was generated in originally processing the portion of the spoken utterance corresponding to “Doe”.”; Par 75 – “For instance, the automated assistant can process the additional spoken utterance 656A to determine that “John Doe” from the additional spoken utterance 656A is acoustically similar to “John Doe” from the original spoken utterance 652A. As another example, the automated assistant can identify correction terms or phrases included in the spoken utterance (e.g. “No”, “I said”, “What I said is”, and/or other terms that are indicative of the misrecognition). In response to determining that the additional spoken utterance corrects the misrecognition, the automated assistant can cause the original action to be cancelled if initiated and/or being performed (e.g., stop playing the song by John Roe if it had started).”; Par 78 – “However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”. In this example, the automated assistant can process acoustic features and/or textual features of the additional spoken utterance 656B to determine whether a portion of the additional spoken utterance 656A is intended to correct a portion of the interpretation of the spoken utterance 652B. For instance, the automated assistant can process the additional spoken utterance 656B to determine that “John Smith” from the additional spoken utterance 656B is not acoustically similar to “John Doe” from the original spoken utterance 652B.”; Par 79 – “For example, assume the automated assistant is highly confident in term hypotheses corresponding to “Wait” and “John”.”; Par 80 – “In these implementations, and with respect to the action, the automated assistant can determine a distinct action to be performed by the automated assistant. As shown in FIG. 6B, the automated assistant can still utilize the “play a song” intent from the original spoken utterance 652, but utilize a distinct slot value of “John Smith” for the “artist” parameter.”; In other words, the method/system of SHARIFI still corrects the misrecognized term when the similarity is less than a threshold, e.g., Fig. 3 Step 364->No-> the distinct action, wherein the distinct action also corrects the term by replacing the previously recognized term. Since the explicit correction terms (e.g., wait, no, I said, etc.) clearly indicate that the user’s intention is to correct the recognition result, the system/method still correct the result despite of the low similarity score.).


REGARDING CLAIM 5, SHARIFI discloses the method of claim 4, wherein the identifying of whether the obtained second audio signal has the at least one vocal characteristic comprises: 
obtaining second pronunciation information for each of at least one syllable included in the obtained second audio signal (Par 4 – “As one example, the additional spoken utterance can be processed to determine that the “Doe” portion of the additional spoken utterance is similar to the “Doe” portion of the prior spoken utterance (i.e., that was misrecognized as “Roe”). This similarity determination can include, for example, determining the portions are phonetically similar, acoustically similar, and/or have one or more overlapping term hypotheses from speech recognition.”; Par 12 – “The similarity between the portion of the spoken utterance and the additional portion of the additional spoken utterance can be, for example, phonetic similarity. In some further versions of those implementations, acoustic features of the features of the spoken utterance can be compared to additional acoustic features of the additional spoken utterance. Comparing the acoustic features and the additional acoustic can include, for example, and with respect to the similar portions of the spoken utterance and the additional spoken utterance, comparing an audio waveform of the spoken utterance and the additional spoken utterance, comparing mel-frequency Cepstral coefficients (MFCCs) of the spoken utterance and the additional spoken utterance, comparing mel filterbank features of the spoken utterance and the additional spoken utterance and/or comparing other acoustic features, such as an embedding derived from a machine learning model that processes or recognizes the speech (e.g., a voice activity detector, word detection model(s) 150A, speech recognition model(s) 120A, and/or other machine learning models). Based on comparing the acoustic features and the additional acoustic features, the automated assistant can identify whether at least a portion of the spoken utterance and an additional portion of the additional spoken are similar.”;  Par 14 – “The similarity between the transcriptions can be determined based on, for example, an edit distance between one or more recognized terms included in the transcriptions and/or other textual or phonetic similarities.”; In other words, the phonemes, MFCCs, mel-filterbank features of the first and the second audio signals are compared for similarities. Thus, the pronunciation information (e.g., phonemes, MFCCs, mel-filterank features, etc.) is obtained.); and 
based on the second pronunciation information, identifying whether the at least one syllable included in the obtained second audio signal has the at least one vocal characteristic (Par 12 – “The similarity between the portion of the spoken utterance and the additional portion of the additional spoken utterance can be, for example, phonetic similarity. In some further versions of those implementations, acoustic features of the features of the spoken utterance can be compared to additional acoustic features of the additional spoken utterance. Comparing the acoustic features and the additional acoustic can include, for example, and with respect to the similar portions of the spoken utterance and the additional spoken utterance, comparing an audio waveform of the spoken utterance and the additional spoken utterance, comparing mel-frequency Cepstral coefficients (MFCCs) of the spoken utterance and the additional spoken utterance, comparing mel filterbank features of the spoken utterance and the additional spoken utterance and/or comparing other acoustic features, such as an embedding derived from a machine learning model that processes or recognizes the speech (e.g., a voice activity detector, word detection model(s) 150A, speech recognition model(s) 120A, and/or other machine learning models). Based on comparing the acoustic features and the additional acoustic features, the automated assistant can identify whether at least a portion of the spoken utterance and an additional portion of the additional spoken are similar.”; Par 13 – “In this example (and assuming the automated assistant is confident the term hypotheses for “Play”, “a”, “song”, “by”, “John” are correct), the automated assistant compares acoustic features of the spoken utterances, and determines that “Roe” and “Doe” are similar (e.g., phonetically, acoustically, and/or textually).”; In other words, the phonetic/acoustic similarities are determined using pronunciation information (e.g., phonemes, MFCCs, mel-filerbank filters, etc.).  Thus, the vocal characteristics (e.g., similar or not similar phonetic/acoustic features) are identified.).


REGARDING CLAIM 8, SHARIFI discloses the method of claim 4, wherein the identifying of whether the voice pattern of the obtained second audio signal corresponds to the at least one preset voice pattern comprises, 
based on a natural language processing (NLP) model (Par 33 – “The action can include, for example, an intent and slot value(s) for parameter(s) associated with the intent. As described below (e.g., with respect to FIGS. 2A and 2B), the automated assistant can process, using a natural language understanding (“NLU”) engine, the given transcription hypothesis to determine the intent. Continuing with the above example, assume the automated assistant 180 selected “send a text message to Don” as the given transcription hypothesis corresponding to the spoken utterance (instead of “Ron” as intended by the user). In this example, the intent of the action may be “send a text message”, and the slot value for a “recipient” parameter may be “Don” (and optionally a slot value for a “text message content” parameter if included in the spoken utterance).”), identifying that the voice pattern of the obtained second audio signal corresponds to the at least one preset voice pattern (Par 75 – “For instance, the automated assistant can process the additional spoken utterance 656A to determine that “John Doe” from the additional spoken utterance 656A is acoustically similar to “John Doe” from the original spoken utterance 652A. As another example, the automated assistant can identify correction terms or phrases included in the spoken utterance (e.g. “No”, “I said”, “What I said is”, and/or other terms that are indicative of the misrecognition).”), and 
the obtaining of the at least one of the one or more corrected words or the one or more corrected syllables comprises, based on the voice pattern of the second audio signal (Par 78 – “However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”.”), obtaining the at least one of the one or more corrected words or the one or more corrected syllables, by using the NLP model (Par 77 – “In these implementations, and with respect to the action, the automated assistant can supplant the slot value of “John Roe” for the artist parameter with an alternate slot value of “John Doe”, resulting in an alternate action to be performed by the automated assistant. In some versions of these implementations, the automated assistant can cause the computing device 610 to audibly render a further spoken response 658A of “Right, John Doe” that acknowledges the correction of the misrecognition, and can perform the alternate action as indicated by 660A.”).


REGARDING CLAIM 10, SHARIFI discloses the method of claim 8, wherein the correcting of the first audio signal to be corrected comprises: 
based on the at least one of the one or more corrected words or the one or more corrected syllables, obtaining at least one of one or more misrecognized words or one or more misrecognized syllables included in the obtained first audio signal (Par 76 – “Further, the automated assistant can process, using the speech recognition model(s), additional audio data corresponding to the additional spoken utterance 656A to generate a plurality of additional speech hypotheses. The plurality of additional speech hypotheses can be restricted to those that overlap with the original plurality of speech hypotheses generated in processing the spoken utterance 652A, and optionally omit the given speech hypothesis, of the plurality of speech hypotheses, that was selected as corresponding to the spoken utterance 652A. … However, since the automated assistant previously selected the term “Roe” as corresponding to “Doe” based on the term hypotheses in table the table 620A and the automated assistant determined the additional spoken utterance 656A is intended to correct a misrecognition, the automated assistant may select the next term hypothesis with the highest confidence level, such as “Doe”.”; Par 31 – “For instance, with such models the predicted phoneme(s) (and/or other representations) are then utilized by the speech recognition engine(s) 120A1 and/or 120A2 to determine the plurality of speech hypotheses that conform to the predicted phoneme(s). In doing so, the speech recognition engine(s) 120A1 and/or 120A2 can optionally employ a decoding graph, a lexicon, and/or other resource(s).”); and 
based on the at least one of the one or more corrected words or the one or more corrected syllables, and the at least one of the one or more misrecognized words or the one or more misrecognized syllables included in the obtained first audio signal, identifying the at least one corrected audio signal (Par 57 – “At block 454, the system identifies, based on comparing the acoustic features to the additional acoustic features, at least a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. For instance, assume the original spoken utterance included the term “Doe”, and assume speech recognition misrecognized “Doe” as “Roe”. Further assume the additional spoken utterance included the term “Doe” again. In these instances, the acoustic features indicate that the portion of the original spoken utterance corresponding to “Doe” is similar to the portion of the additional spoken utterance corresponding to “Doe”.”).


REGARDING CLAIM 12, SHARIFI discloses an electronic device for processing a voice input of a user, the electronic device comprising: a memory storing one or more instructions; and at least one processor configured to execute the one or more instructions (Par 88 – “processor .. memory …”) to perform the steps of claim 1; thus, it is rejected under the same rationale.


REGARDING CLAIM 13, SHARIFI discloses the electronic device of claim 12, wherein the at least one processor is further configured to execute the one or more instructions to, 
based on the similarity between the obtained first audio signal and the obtained second audio signal (Par 41 – “The similarly engine 141 can compare acoustic features of the spoken utterance to additional acoustic features of the additional spoken utterance to identify a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. The acoustic features can include, for example, MFCCs, audio waveforms, and/or other acoustic features of audio data that can be compared. Continuing with the above example, the similarity engine 141 can compare acoustic features of the spoken utterance of “send a text message to Ron” to additional acoustic features of the additional spoken utterance of “No, Ron”, and can identify that the portion of the spoken utterance corresponding to “Ron” is acoustically similar to the portion of the additional spoken utterance corresponding to “Ron”.”), identify at least one of whether the second audio signal has at least one vocal characteristic (Par 41 – “The similarly engine 141 can compare acoustic features of the spoken utterance to additional acoustic features of the additional spoken utterance to identify a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. The acoustic features can include, for example, MFCCs, audio waveforms, and/or other acoustic features of audio data that can be compared. Continuing with the above example, the similarity engine 141 can compare acoustic features of the spoken utterance of “send a text message to Ron” to additional acoustic features of the additional spoken utterance of “No, Ron”, and can identify that the portion of the spoken utterance corresponding to “Ron” is acoustically similar to the portion of the additional spoken utterance corresponding to “Ron”.”; In other words, the acoustic similarity is also determined. Thus, the vocal characteristic (e.g., MFCCs, mel-filerbank filters, etc.) are identified to determine whether they are similar to the ones in the first audio signal.) or whether the voice pattern of the obtained second audio signal corresponds to the at least one preset voice pattern (Par 78 – “However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”. In this example, the automated assistant can process acoustic features and/or textual features of the additional spoken utterance 656B to determine whether a portion of the additional spoken utterance 656A is intended to correct a portion of the interpretation of the spoken utterance 652B. For instance, the automated assistant can process the additional spoken utterance 656B to determine that “John Smith” from the additional spoken utterance 656B is not acoustically similar to “John Doe” from the original spoken utterance 652B.”; Par 49 – “Further, a second word detection model, of the word detection model(s) 140A, may a correction term or phrase model that is trained to detect words and/or phrases that are indicative of an occurrence of a misrecognition (e.g., “no”, “wait”, “I said”, and/or any other term or phrase that is indicative of the misrecognition). In some versions of those implementations, the speech misrecognition system 160 may be implemented responsive to determining a given spoken utterance includes the correction term or phrase. As another example, a word detection model trained to identify specific term(s) or phrase(s) included in the plurality of speech hypotheses may be included (e.g., acoustically similar speech hypotheses).”).


REGARDING CLAIM 15, SHARIFI discloses the electronic device of claim 12, wherein the at least one processor is further configured to execute the one or more instructions to, 
based on the similarity being greater than or equal to a preset second threshold, identify whether the obtained second audio signal has at least one vocal characteristic (Par 14 – “The similarity between the transcriptions can be determined based on, for example, an edit distance between one or more recognized terms included in the transcriptions and/or other textual or phonetic similarities. Continuing with the above examples, a portion of the transcription that corresponds that to the given speech hypothesis (e.g., “Roe”) can be compared to an additional portion of the additional transcription that corresponds to the additional given speech hypothesis (e.g., “Doe” or “Smith”). Based on an edit distance between “Roe” and “Doe” determined based on the comparison, the automated assistant can determine the similarity satisfies a similarity threshold. In this example, the edit distance may be relatively small, and the misrecognition (e.g., selecting “Roe” instead of “Doe”) can be identified. In contrast, based on an edit distance between “Roe” and “Smith” determined based on the comparison, the automated assistant can determine the similarity does not satisfy a similarity threshold. In this example, the edit distance may be relatively larger, and that the additional spoken utterance was not provided due to misrecognition.”; Par 13 – “In this example (and assuming the automated assistant is confident the term hypotheses for “Play”, “a”, “song”, “by”, “John” are correct), the automated assistant compares acoustic features of the spoken utterances, and determines that “Roe” and “Doe” are similar (e.g., phonetically, acoustically, and/or textually).”; In other words, the acoustic similarity is also determined as well as the phonetic similarity. Thus, the vocal characteristic (e.g., MFCCs, mel-filerbank filters, etc.) are identified to determine whether they are similar to the ones in the first audio signal.), and 
when the similarity is less than the preset second threshold, identify whether a voice pattern of the obtained second audio signal corresponds to the at least one preset voice pattern (Par 14 – “Based on an edit distance between “Roe” and “Doe” determined based on the comparison, the automated assistant can determine the similarity satisfies a similarity threshold. In this example, the edit distance may be relatively small, and the misrecognition (e.g., selecting “Roe” instead of “Doe”) can be identified. In contrast, based on an edit distance between “Roe” and “Smith” determined based on the comparison, the automated assistant can determine the similarity does not satisfy a similarity threshold. In this example, the edit distance may be relatively larger, and that the additional spoken utterance was not provided due to misrecognition.”; Par 49 – “Further, a second word detection model, of the word detection model(s) 140A, may a correction term or phrase model that is trained to detect words and/or phrases that are indicative of an occurrence of a misrecognition (e.g., “no”, “wait”, “I said”, and/or any other term or phrase that is indicative of the misrecognition). In some versions of those implementations, the speech misrecognition system 160 may be implemented responsive to determining a given spoken utterance includes the correction term or phrase.”; Par 58 –“In these instances, the acoustic features may not indicate that the portion of the original spoken utterance corresponding to “Doe” is similar to the portion of the additional spoken utterance corresponding to “Smith”. Further, a distinct action may be performed by the automated assistant based on the additional audio data, but the distinct action is not an alternate action in that it is very unlikely a term hypothesis for “Smith” or transcription hypothesis including the term hypothesis “Smith” was generated in originally processing the portion of the spoken utterance corresponding to “Doe”.”; Par 75 – “For instance, the automated assistant can process the additional spoken utterance 656A to determine that “John Doe” from the additional spoken utterance 656A is acoustically similar to “John Doe” from the original spoken utterance 652A. As another example, the automated assistant can identify correction terms or phrases included in the spoken utterance (e.g. “No”, “I said”, “What I said is”, and/or other terms that are indicative of the misrecognition). In response to determining that the additional spoken utterance corrects the misrecognition, the automated assistant can cause the original action to be cancelled if initiated and/or being performed (e.g., stop playing the song by John Roe if it had started).”; Par 78 – “However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”. In this example, the automated assistant can process acoustic features and/or textual features of the additional spoken utterance 656B to determine whether a portion of the additional spoken utterance 656A is intended to correct a portion of the interpretation of the spoken utterance 652B. For instance, the automated assistant can process the additional spoken utterance 656B to determine that “John Smith” from the additional spoken utterance 656B is not acoustically similar to “John Doe” from the original spoken utterance 652B.”; Par 79 – “For example, assume the automated assistant is highly confident in term hypotheses corresponding to “Wait” and “John”.”; Par 80 – “In these implementations, and with respect to the action, the automated assistant can determine a distinct action to be performed by the automated assistant. As shown in FIG. 6B, the automated assistant can still utilize the “play a song” intent from the original spoken utterance 652, but utilize a distinct slot value of “John Smith” for the “artist” parameter.”; In other words, the method/system of SHARIFI still corrects the misrecognized term when the similarity is less than a threshold, e.g., Fig. 3 Step 364->No-> the distinct action, wherein the distinct action also corrects the term by replacing the previously recognized term. Since the explicit correction terms (e.g., wait, no, I said, etc.) clearly indicate that the user’s intention is to correct the recognition result, the system/method still correct the result despite of the low similarity score.).


REGARDING CLAIM 16, SHARIFI discloses the electronic device of claim 15, wherein the at least one processor is further configured to execute the one or more instructions to: 
obtain second pronunciation information for each of at least one syllable included in the obtained second audio signal (Par 4 – “As one example, the additional spoken utterance can be processed to determine that the “Doe” portion of the additional spoken utterance is similar to the “Doe” portion of the prior spoken utterance (i.e., that was misrecognized as “Roe”). This similarity determination can include, for example, determining the portions are phonetically similar, acoustically similar, and/or have one or more overlapping term hypotheses from speech recognition.”; Par 14 – “The similarity between the transcriptions can be determined based on, for example, an edit distance between one or more recognized terms included in the transcriptions and/or other textual or phonetic similarities.”; In other words, the phonemes of the first and the second audio signals are compared for phonetic similarities. Thus, the pronunciation information is obtained.), and 
based on the second pronunciation information, identify whether the at least one syllable included in the obtained second audio signal has the at least one vocal characteristic (Par 12 – “The similarity between the portion of the spoken utterance and the additional portion of the additional spoken utterance can be, for example, phonetic similarity. In some further versions of those implementations, acoustic features of the features of the spoken utterance can be compared to additional acoustic features of the additional spoken utterance. Comparing the acoustic features and the additional acoustic can include, for example, and with respect to the similar portions of the spoken utterance and the additional spoken utterance, comparing an audio waveform of the spoken utterance and the additional spoken utterance, comparing mel-frequency Cepstral coefficients (MFCCs) of the spoken utterance and the additional spoken utterance, comparing mel filterbank features of the spoken utterance and the additional spoken utterance and/or comparing other acoustic features, such as an embedding derived from a machine learning model that processes or recognizes the speech (e.g., a voice activity detector, word detection model(s) 150A, speech recognition model(s) 120A, and/or other machine learning models). Based on comparing the acoustic features and the additional acoustic features, the automated assistant can identify whether at least a portion of the spoken utterance and an additional portion of the additional spoken are similar.”; Par 13 – “In this example (and assuming the automated assistant is confident the term hypotheses for “Play”, “a”, “song”, “by”, “John” are correct), the automated assistant compares acoustic features of the spoken utterances, and determines that “Roe” and “Doe” are similar (e.g., phonetically, acoustically, and/or textually).”; In other words, the acoustic similarity is also determined as well as the phonetic similarity. Thus, the vocal characteristic (e.g., MFCCs, mel-filerbank filters, etc.) are identified to determine whether they are similar to the ones in the first audio signal.).


REGARDING CLAIM 18, SHARIFI discloses the electronic device of claim 12, wherein the at least one processor is further configured to execute the one or more instructions to: 
based on a natural language processing (NLP) model stored in the memory (Par 33 – “The action can include, for example, an intent and slot value(s) for parameter(s) associated with the intent. As described below (e.g., with respect to FIGS. 2A and 2B), the automated assistant can process, using a natural language understanding (“NLU”) engine, the given transcription hypothesis to determine the intent. Continuing with the above example, assume the automated assistant 180 selected “send a text message to Don” as the given transcription hypothesis corresponding to the spoken utterance (instead of “Ron” as intended by the user). In this example, the intent of the action may be “send a text message”, and the slot value for a “recipient” parameter may be “Don” (and optionally a slot value for a “text message content” parameter if included in the spoken utterance).”), identify whether a voice pattern of the obtained second audio signal corresponds to at least one preset voice pattern (Par 75 – “For instance, the automated assistant can process the additional spoken utterance 656A to determine that “John Doe” from the additional spoken utterance 656A is acoustically similar to “John Doe” from the original spoken utterance 652A. As another example, the automated assistant can identify correction terms or phrases included in the spoken utterance (e.g. “No”, “I said”, “What I said is”, and/or other terms that are indicative of the misrecognition).”), and 
based on the voice pattern of the obtained second audio signal (Par 78 – “However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”.”), obtain the at least one of the one or more corrected words or the one or more corrected syllables, by using the NLP model (Par 77 – “In these implementations, and with respect to the action, the automated assistant can supplant the slot value of “John Roe” for the artist parameter with an alternate slot value of “John Doe”, resulting in an alternate action to be performed by the automated assistant. In some versions of these implementations, the automated assistant can cause the computing device 610 to audibly render a further spoken response 658A of “Right, John Doe” that acknowledges the correction of the misrecognition, and can perform the alternate action as indicated by 660A.”).


REGARDING CLAIM 19, SHARIFI discloses the electronic device of claim 12, wherein the at least one processor is further configured to execute the one or more instructions to: 
based on the at least one of the one or more corrected words or the one or more corrected syllables, obtain at least one of one or more misrecognized words or one or more misrecognized syllables included in the obtained first audio signal (Par 76 – “Further, the automated assistant can process, using the speech recognition model(s), additional audio data corresponding to the additional spoken utterance 656A to generate a plurality of additional speech hypotheses. The plurality of additional speech hypotheses can be restricted to those that overlap with the original plurality of speech hypotheses generated in processing the spoken utterance 652A, and optionally omit the given speech hypothesis, of the plurality of speech hypotheses, that was selected as corresponding to the spoken utterance 652A. … However, since the automated assistant previously selected the term “Roe” as corresponding to “Doe” based on the term hypotheses in table the table 620A and the automated assistant determined the additional spoken utterance 656A is intended to correct a misrecognition, the automated assistant may select the next term hypothesis with the highest confidence level, such as “Doe”.”; Par 31 – “For instance, with such models the predicted phoneme(s) (and/or other representations) are then utilized by the speech recognition engine(s) 120A1 and/or 120A2 to determine the plurality of speech hypotheses that conform to the predicted phoneme(s). In doing so, the speech recognition engine(s) 120A1 and/or 120A2 can optionally employ a decoding graph, a lexicon, and/or other resource(s).”), and 
based on the at least one of the one or more corrected words or the one or more corrected syllables, and the at least one of the one or more misrecognized words or the one or more misrecognized syllables included in the obtained first audio signal, identify the at least one corrected audio signal (Par 57 – “At block 454, the system identifies, based on comparing the acoustic features to the additional acoustic features, at least a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. For instance, assume the original spoken utterance included the term “Doe”, and assume speech recognition misrecognized “Doe” as “Roe”. Further assume the additional spoken utterance included the term “Doe” again. In these instances, the acoustic features indicate that the portion of the original spoken utterance corresponding to “Doe” is similar to the portion of the additional spoken utterance corresponding to “Doe”.”).


REGARDING CLAIM 20, SHARIFI discloses a non-transitory computer-readable recording medium having recorded thereon instructions for causing a processor of an electronic device to perform the method of claim 1; thus, it is rejected under the same rationale as explained in the rejection of claim 1.






Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 3, 9, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over SHARIFI, and in further view of BEAUFAYS (US 2021/0327410 A1).

REGARDING CLAIM 3, SHARIFI discloses the method of claim 1, wherein the correcting of the first audio signal comprises: 
based on the obtained at least one of one or more corrected words or one or more corrected syllables, obtaining the at least one misrecognized word included in the obtained first audio signal (Par 76 – “Further, the automated assistant can process, using the speech recognition model(s), additional audio data corresponding to the additional spoken utterance 656A to generate a plurality of additional speech hypotheses. The plurality of additional speech hypotheses can be restricted to those that overlap with the original plurality of speech hypotheses generated in processing the spoken utterance 652A, and optionally omit the given speech hypothesis, of the plurality of speech hypotheses, that was selected as corresponding to the spoken utterance 652A. … However, since the automated assistant previously selected the term “Roe” as corresponding to “Doe” based on the term hypotheses in table the table 620A and the automated assistant determined the additional spoken utterance 656A is intended to correct a misrecognition, the automated assistant may select the next term hypothesis with the highest confidence level, such as “Doe”.”; Par 31 – “For instance, with such models the predicted phoneme(s) (and/or other representations) are then utilized by the speech recognition engine(s) 120A1 and/or 120A2 to determine the plurality of speech hypotheses that conform to the predicted phoneme(s). In doing so, the speech recognition engine(s) 120A1 and/or 120A2 can optionally employ a decoding graph, a lexicon, and/or other resource(s).”); 
obtaining, from among at least one word included in a [named entity (NE)] dictionary, at least one word (Par 48 – “The biasing engine 143 can identify term(s) that were misrecognized and/or term(s) that supplant the misrecognized term(s), and can store these term(s) in the biased term(s) database 143A. Continuing with the above example, the biasing engine 143 can cause the term(s) “Ron” and “Don” to be stored in the biased term(s) database. Further assume that the client device 110 receives a subsequent spoken utterance of “call Ron”. In this example, the biasing engine 143 can cause the automated assistant to bias speech recognition of the subsequent spoken utterance towards the term hypothesis of “Ron” and/or away from the term hypothesis of “Don”. In some implementations, the biasing engine 143 can cause the biasing engine 143 to bias the subsequent spoken utterance towards and/or away from the biased term(s) for only the current human-to-computer dialog session. In some additional and/or alternative implementations, the biasing engine 143 can cause the biasing engine 143 to bias the subsequent spoken utterance to and/or away from the biased term(s) for subsequent human-to-computer dialog sessions. For instance, the biasing may be limited to a particular user interaction and/or a particular duration of time. By limiting the biasing, the speech misrecognition system 160 can combat over biasing towards and/or away from the biased term(s).”; Par 81 – “In various implementations, the automated assistant can bias future speech recognition towards the alternate speech hypothesis, and away from the originally selected speech hypothesis when a misrecognition is corrected. For example, in the dialog of FIG. 6A, if a subsequent spoken utterance is received at the client device 610 (e.g., during the current dialog session and/or during a subsequent dialog session) where “Doe” is a term hypothesis and/or “Roe” is a term hypothesis, then the speech recognition can be biased towards the term hypothesis for “Doe” and/or away from the term hypothesis for “Roe”.”), a similarity of which to the one or more corrected words is greater than or equal to a preset first threshold (Par 14 – “Based on an edit distance between “Roe” and “Doe” determined based on the comparison, the automated assistant can determine the similarity satisfies a similarity threshold. In this example, the edit distance may be relatively small, and the misrecognition (e.g., selecting “Roe” instead of “Doe”) can be identified.”); and 
correcting the obtained at least one misrecognized word, to at least one of the at least one word corresponding to the obtained at least one misrecognized word, or the at least one corrected word (Par 57 – “At block 454, the system identifies, based on comparing the acoustic features to the additional acoustic features, at least a portion of the spoken utterance that is similar to an additional portion of the additional spoken utterance. For instance, assume the original spoken utterance included the term “Doe”, and assume speech recognition misrecognized “Doe” as “Roe”. Further assume the additional spoken utterance included the term “Doe” again. In these instances, the acoustic features indicate that the portion of the original spoken utterance corresponding to “Doe” is similar to the portion of the additional spoken utterance corresponding to “Doe”.”; In other words, because the method/system of SHARIF identifies the corrected word “Doe” that corresponds to the misrecognized word, “Roe” and further correcting the error, the portion of the original spoken utterance is identified and corrected to produce a correction recognition result.).
SHARIFI does not explicitly teach the [square-bracketed] limitation.  In other words, SHARIFI teaches looking up the biased terms stored in a memory (143A) or a table (630A) for obtaining a word with a high similarity, but does not explicitly teach the biased terms stored in a memory or a table is a [named entity] dictionary.

BEAUFAYS discloses a method/system for correcting misrecognized utterances comprising:
obtaining, from among at least one word included in a [named entity (NE)] dictionary, at least one word (BEAUFAYS Par 29 – “In some implementations, after updating of an on-device speech recognition model according to implementations disclosed herein, biasing of the on-device speech recognition based on textual segment(s) can also be utilized, when the on-device speech recognition model is processing audio data corresponding to spoken utterances to generate corresponding predicted textual segments. For example, the on-device speech recognition can be biased toward one or more textual segments stored on the device, such as contact alias(es), road name(s), media name(s), and/or other textual segment(s). The textual segment(s) to which on-device speech recognition is biased can be the same across multiple iterations of speech recognition and/or can vary amongst iterations.”), a similarity of which to the one or more corrected words is greater than or equal to a preset first threshold (Par 24 – “Determining that a correction is directed to performance of the on-device speech recognition can, in some versions of those implementations, be based on determining a measure of similarity between term(s) of the predicted textual segment(s) and modified and/or replaced term(s) of the alternate textual segment(s), and based on determining the measure of similarity satisfies a threshold.”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of SHARIFI to include a named entity dictionary, as taught by BEAUFAYS.
One of ordinary skill would have been motivated to include a named entity dictionary, in order to accurately recognized user’s utterance according to the user context.


REGARDING CLAIM 9, SHARIFI discloses the method of claim 8, wherein the correcting of the first audio signal to be corrected comprises: 
identifying, by using the NLP model, whether the voice pattern of the obtained second audio signal is a complete voice pattern among the at least one preset voice pattern (Fig. 6B – “Wait, John Smith 646B”; Par 49 – “Further, a second word detection model, of the word detection model(s) 140A, may a correction term or phrase model that is trained to detect words and/or phrases that are indicative of an occurrence of a misrecognition (e.g., “no”, “wait”, “I said”, and/or any other term or phrase that is indicative of the misrecognition). In some versions of those implementations, the speech misrecognition system 160 may be implemented responsive to determining a given spoken utterance includes the correction term or phrase. As another example, a word detection model trained to identify specific term(s) or phrase(s) included in the plurality of speech hypotheses may be included (e.g., acoustically similar speech hypotheses).”); 
based on the voice pattern of the obtained second audio signal being identified as the complete voice pattern, obtaining at least one of one or more misrecognized words and one or more misrecognized syllables included in the obtained first audio signal (Par 78 – “However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”. In this example, the automated assistant can process acoustic features and/or textual features of the additional spoken utterance 656B to determine whether a portion of the additional spoken utterance 656A is intended to correct a portion of the interpretation of the spoken utterance 652B. For instance, the automated assistant can process the additional spoken utterance 656B to determine that “John Smith” from the additional spoken utterance 656B is not acoustically similar to “John Doe” from the original spoken utterance 652B.”; Par 80 – “In these implementations, and with respect to the action, the automated assistant can determine a distinct action to be performed by the automated assistant. As shown in FIG. 6B, the automated assistant can still utilize the “play a song” intent from the original spoken utterance 652, but utilize a distinct slot value of “John Smith” for the “artist” parameter.”; Par 58 – “In these instances, the acoustic features may not indicate that the portion of the original spoken utterance corresponding to “Doe” is similar to the portion of the additional spoken utterance corresponding to “Smith”. Further, a distinct action may be performed by the automated assistant based on the additional audio data, but the distinct action is not an alternate action in that it is very unlikely a term hypothesis for “Smith” or transcription hypothesis including the term hypothesis “Smith” was generated in originally processing the portion of the spoken utterance corresponding to “Doe”. If, at an iteration of block 456, the system determines the similarity between the portion of the spoken utterance that is similar to the additional portion of the additional spoken utterance satisfies the similarity threshold, then the system may proceed to block 458.”); and 
identifying the at least one corrected audio signal by correcting the obtained at least one of the one or more misrecognized words or the one or more misrecognized syllables, to the at least one of the one or more corrected words or the one or more corrected syllables corresponding thereto (Par 80 – “In these implementations, and with respect to the action, the automated assistant can determine a distinct action to be performed by the automated assistant. As shown in FIG. 6B, the automated assistant can still utilize the “play a song” intent from the original spoken utterance 652, but utilize a distinct slot value of “John Smith” for the “artist” parameter. In some versions of these implementations, the automated assistant can cause the computing device 610 to audibly render a further spoken response 658B of “Okay, John Smith” that acknowledges the distinct action, and can perform the distinct action as indicated by 660B.”), and 
the complete voice pattern is a voice pattern including [at least one of one or more misrecognized words or one or more misrecognized syllables of an audio signal], and at least one of one or more corrected words or one or more corrected syllables, among the at least one preset voice pattern (Par 78 – “For example, and referring specifically to FIG. 6B, assume an additional spoken utterance 656B is received at the computing device 610 in response to the computing device 610 audibly rendering the spoken response 654A. However, in contrast with FIG. 6A, assume the additional spoken utterance 656B corresponds to “Wait, John Smith”.”).
SHARIFI does not explicitly teach the [square-bracketed] limitation.

BEAUFAYS discloses a method/system for correcting misrecognized utterances comprising:
the complete voice pattern is a voice pattern including [at least one of one or more misrecognized words or one or more misrecognized syllables of an audio signal] (BEAUFAY Par 100 – “For example, in FIG. 5E, the state of the user 501 can be based on, for example, predicted textual segments included in the additional spoken utterance “NO! Call Ron, NOT Don!””), and at least one of one or more corrected words or one or more corrected syllables, among the at least one preset voice pattern (BEAUFAY Par 100 – “For example, in FIG. 5E, the state of the user 501 can be based on, for example, predicted textual segments included in the additional spoken utterance “NO! Call Ron, NOT Don!””)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of SHARIFI to include a misrecognized word, as taught by BEAUFAYS.
One of ordinary skill would have been motivated to include a misrecognized word, in order to clarify what word to be corrected.

CLAIM 14 is similar to Claim 3; thus, it is rejected under the same rationale.



Claims 11 is rejected under 35 U.S.C. 103 as being unpatentable over SHARIFI, and in further view of AGARWAL (US 2020/0184956 A1).

REGARDING CLAIM 11, SHARIFI discloses the method of claim 1, wherein the processing of the at least one corrected audio signal comprises 
receiving, from the user, a response signal related to misrecognition (Par 49 – “For example, a first word detection model, of the word detection model(s) 150A, may be a hotword or phrase detection model that is trained to detect words and/or phrases that invoke the automated assistant 180 (e.g., “Assistant”, “Hey Assistant”, “Okay Assistant”, and so on). Further, a second word detection model, of the word detection model(s) 140A, may a correction term or phrase model that is trained to detect words and/or phrases that are indicative of an occurrence of a misrecognition (e.g., “no”, “wait”, “I said”, and/or any other term or phrase that is indicative of the misrecognition).”), as search information for the at least one corrected audio signal is output to the user (Par 74 – “For example, in response to receiving the spoken utterance 652A, the automated assistant can cause the computing device 610 to audibly render a spoken response 654A of “Okay, playing a song by John Roe”. The automated assistant can optionally cause the computing device to visually render a transcription corresponding to the spoken response 654A via the display 680.”; Par 75 – “As another example, the automated assistant can identify correction terms or phrases included in the spoken utterance (e.g. “No”, “I said”, “What I said is”, and/or other terms that are indicative of the misrecognition). In response to determining that the additional spoken utterance corrects the misrecognition, the automated assistant can cause the original action to be cancelled if initiated and/or being performed (e.g., stop playing the song by John Roe if it had started).”), and 
[requesting] the user to perform reutterance according to the response signal (Par 75 – “For example, assume the additional spoken utterance 656A is received at the computing device 610 in response to the computing device 610 audibly rendering the spoken response 654A. In this example, the automated assistant can process acoustic features and/or textual features of the additional spoken utterance 656A to determine whether a portion of the additional spoken utterance 656A is intended to correct a portion of the interpretation of the spoken utterance 652A. For instance, the automated assistant can process the additional spoken utterance 656A to determine that “John Doe” from the additional spoken utterance 656A is acoustically similar to “John Doe” from the original spoken utterance 652A.”).
SHARIFI does not explicitly teach the [square-bracketed] limitation. In other words, SHARIFI teaches receiving a correction term or phrase (e.g., wait, no, I said, etc.) and also receiving reutterance from the user (e.g., “John Doe” as shown in Fig. 6A).  The Method/system of SHARIF receives the correction term/phrase and the reutterance simultaneously whereas the claim requires the system to request the user to reutter.  

AGARWAL discloses the [square-bracketed] limitation. AGARWAL discloses a method/system for voice interaction with a user comprising:
receiving, from the user, a response signal related to misrecognition (ARGAWAL Par 75 – “Assume, instead, that the user says “no” in response to the confirmation prompt. The URFC 624 and SRFC 622 can respond to this event using different environment-specific strategies.”), as search information for the at least one corrected audio signal is output to the user (ARGAWAL Par 43 – “The prediction system 106 terminates the transaction by sending a confirmation prompt to the user which notifies him that it is calling James Smith. The prompt invites the user to cancel this operation if the user did not intend to call James Smith.”; Par 74 – “Another kind of confirmation template notifies the user that the digital assistant 104 is currently acting on, or has already acted on, the predicted response; the template can then invite the user to cancel or reverse the action.”), and 
[requesting] the user to perform reutterance according to the response signal  (ARGAWAL Par 75 – “Assume, instead, that the user says “no” in response to the confirmation prompt. The URFC 624 and SRFC 622 can respond to this event using different environment-specific strategies. In one approach, the URFC 624 instructs the SRFC 622 to issue the original system prompt, rather than the confirmation prompt. The original prompt reads, “At which theater should I book the tickets?” In another approach, the SRFC 622 can offer another confirmation prompt to the user based on another predicted response provided by the predictor component 610 (e.g., which may correspond to the record entry having the second-best matching score, the user having already rejected the record entry having the best matching score).”; In other words, AGARWAL teaches inviting a user to correct the misrecognition by cancelling the action (e.g., Fig. 2 – “Calling James Smith, cancel if you’d like to stop). Upon receiving the cancel request (e.g., “No”), the method/system further issues original system prompt (e.g., “Which one would you like to call, James Brown, James Smith, or James Johnson” in Fig. 2) to the user to reutter the request.);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of SHARIFI to include requesting a user to reutter, as taught by ARGAWAL.
One of ordinary skill would have been motivated to include requesting a user to reutter, in order to provide a clear instruction to the user for correcting a speech recognition.



Claims 6-7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over SHARIFI, and in further view of MOSLEY (US 2018/0315415 A1) and YAMADA (US 2011/0313773 A1).

REGARDING CLAIM 6, SHARIFI discloses the method of claim 5.
SHARIFI does not teach a voice change.

MOSLEY teaches a method/system for correcting misrecognized speech, wherein the identifying of whether the obtained second audio signal has the at least one vocal characteristic (MOSLEY Par 96 – “Some embodiments detect word emphasis, and strengthen the hypothesis of a word sequence error if the second occurrence of the phoneme sequence has significantly greater emphasis.”) comprises: 
based on the at least one syllable being included in the obtained second audio signal that has the at least one vocal characteristic, obtaining first pronunciation information for each of at least one syllable included in the obtained first audio signal (MOSLEY Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”); 
obtaining a score for a voice change in the at least one syllable included in the obtained second audio signal, by comparing the obtained first pronunciation information with the obtained second pronunciation information (MOSLEY Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”); and 
identifying at least one syllable, the obtained score of which is [greater than or equal to a preset third threshold] significantly great (MOSLEY Par 96 – “Some embodiments detect word emphasis, and strengthen the hypothesis of a word sequence error if the second occurrence of the phoneme sequence has significantly greater emphasis.”; Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”), and identifying, as the one or more corrected syllables and the one or more corrected words, the identified at least one syllable and at least one word corresponding to the identified at least one syllable, respectively (MOSLEY Figs. 15A and 15B; Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of SHARIFI to include a voice change, as taught by MOSLEY.
One of ordinary skill would have been motivated to include a voice change, in order to strengthen the hypothesis of a word sequence error (Par 96).


SHARIFI in view of MOSLEY does not explicitly teach the [square-bracketed] limitations, and teaches the underlined feature instead. 

YAMADA discloses the [square-bracketed] limitations. YAMADA discloses a method/system for voice interaction with a user comprising:
obtaining a score for a voice change in the at least one syllable included in the obtained second audio signal, by comparing the obtained first pronunciation information with the obtained second pronunciation information (YAMADA Fig. 14 – “Voice pitch” and “Voice volume” of “No Emphasis” and “Emphasized Sound”; Par 351 – “In addition, the paralinguistic information extracting unit 58B can encourage the user to sequentially perform the same utterance with the normal sound and with the emphasized sound, compare the pitch, the power, and the utterance speed in the normal sound of the first utterance with the pitch, the power, and the utterance speed in the emphasized sound of the second utterance, respectively, and estimate the section, in which a significant difference (a difference which is equal to or greater than a threshold value) is present in the pitch, the power, or the utterance speed in the comparison, as an emphasized section.”); and
when the obtained score of which is [greater than or equal to a preset third threshold], the score is significantly great (YAMADA Par 351 – “In addition, the paralinguistic information extracting unit 58B can encourage the user to sequentially perform the same utterance with the normal sound and with the emphasized sound, compare the pitch, the power, and the utterance speed in the normal sound of the first utterance with the pitch, the power, and the utterance speed in the emphasized sound of the second utterance, respectively, and estimate the section, in which a significant difference (a difference which is equal to or greater than a threshold value) is present in the pitch, the power, or the utterance speed in the comparison, as an emphasized section.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of SHARIFI in view of MOSLEY to include a threshold, as taught by YAMADA.
One of ordinary skill would have been motivated to include a threshold, in order to accurately determine whether or not the difference between the signals are significant.


REGARDING CLAIM 7, SHARIFI in view of MOSLEY and YAMADA discloses the method of claim 6.
MOSLEY further teaches wherein the first pronunciation information comprises at least one of accent information, amplitude information, or duration information for each of the at least one syllable included in the obtained first audio signal (MOSLEY Par 96 – “Some embodiments detect word emphasis, and strengthen the hypothesis of a word sequence error if the second occurrence of the phoneme sequence has significantly greater emphasis.”; Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”), and the second pronunciation information comprises at least one of accent information, amplitude information, or duration information for each of the at least one syllable included in the obtained second audio signal (MOSLEY Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of SHARIFI to include amplitude/accent information, as taught by MOSLEY.
One of ordinary skill would have been motivated to include amplitude/accent information, in order to strengthen the hypothesis of a word sequence error (Par 96).



REGARDING CLAIM 17, SHARIFI discloses the electronic device of claim 16.
SHARIFI does not teach a voice change.

MOSLEY teaches a method/system for correcting misrecognized speech, wherein the at least one processor is further configured to execute the one or more instructions to, 
based on the at least one syllable included in the obtained second audio signal having the at least one vocal characteristic, obtain first pronunciation information for each of at least one syllable included in the obtained first audio signal (MOSLEY Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”), 
obtain a score for a voice change in the at least one syllable included in the obtained second audio signal by comparing the obtained first pronunciation information with the obtained second pronunciation information (MOSLEY Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”), and 
identify at least one syllable, the obtained score of which is [greater than or equal to a preset third threshold] significantly great (MOSLEY Par 96 – “Some embodiments detect word emphasis, and strengthen the hypothesis of a word sequence error if the second occurrence of the phoneme sequence has significantly greater emphasis.”; Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”), and identify, as the one or more corrected syllables and the one or more corrected words, the identified at least one syllable and at least one word corresponding to the identified at least one syllable, respectively (MOSLEY Figs. 15A and 15B; Par 99 – “In one scenario, the text string, “gone within WITH THE wind” is a text transcription of the phoneme <G AO N W IH TH IH N W IH N D ##W IH TH DH AH##>, including the emphasized phonemes subsequence ##W IH TH DH AH##. Some embodiments identify that a significant part of the emphasized phoneme subsequence (in this case the first three phonemes) match a recent phoneme subsequence. This indicates a user repeating a portion of an incorrectly transcribed word sequence. Some embodiments therefor hypothesize that the matching previous phoneme sequence (<W IH TH IH N> in this scenario) should be replaced by the emphasized phoneme subsequence (<W IH TH DH AH> in this scenario).”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of SHARIFI to include a voice change, as taught by MOSLEY.
One of ordinary skill would have been motivated to include a voice change, in order to strengthen the hypothesis of a word sequence error (Par 96).

SHARIFI in view of MOSLEY does not explicitly teach the [square-bracketed] limitations, and teaches the underlined feature instead. 

YAMADA discloses the [square-bracketed] limitations. YAMADA discloses a method/system for voice interaction with a user comprising:
obtain a score for a voice change in the at least one syllable included in the obtained second audio signal, by comparing the obtained first pronunciation information with the obtained second pronunciation information (YAMADA Fig. 14 – “Voice pitch” and “Voice volume” of “No Emphasis” and “Emphasized Sound”; Par 351 – “In addition, the paralinguistic information extracting unit 58B can encourage the user to sequentially perform the same utterance with the normal sound and with the emphasized sound, compare the pitch, the power, and the utterance speed in the normal sound of the first utterance with the pitch, the power, and the utterance speed in the emphasized sound of the second utterance, respectively, and estimate the section, in which a significant difference (a difference which is equal to or greater than a threshold value) is present in the pitch, the power, or the utterance speed in the comparison, as an emphasized section.”); and
when the obtained score of which is [greater than or equal to a preset third threshold], the score is significantly great (YAMADA Par 351 – “In addition, the paralinguistic information extracting unit 58B can encourage the user to sequentially perform the same utterance with the normal sound and with the emphasized sound, compare the pitch, the power, and the utterance speed in the normal sound of the first utterance with the pitch, the power, and the utterance speed in the emphasized sound of the second utterance, respectively, and estimate the section, in which a significant difference (a difference which is equal to or greater than a threshold value) is present in the pitch, the power, or the utterance speed in the comparison, as an emphasized section.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of SHARIFI in view of MOSLEY to include a threshold, as taught by YAMADA.
One of ordinary skill would have been motivated to include a threshold, in order to accurately determine whether or not the difference between the signals are significant.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C KIM whose telephone number is (571)272-3327. The examiner can normally be reached Monday to Friday 8:00 AM thru 4:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JONATHAN C KIM/Primary Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

Mar 07, 2023
Application Filed
Apr 17, 2025
Non-Final Rejection — §102, §103
Jun 30, 2025
Interview Requested
Jul 08, 2025
Applicant Interview (Telephonic)
Jul 11, 2025
Examiner Interview Summary
Jul 22, 2025
Response Filed
Sep 02, 2025
Final Rejection — §102, §103
Dec 04, 2025
Request for Continued Examination
Dec 19, 2025
Response after Non-Final Action
Jan 23, 2026
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/188,223
Patent 12573391
Generating Contextual Responses for Out-of-coverage Requests for Assistant Systems
2y 5m to grant Granted Mar 10, 2026
18/247,754
Patent 12561110
AUDIO PLAYBACK METHOD AND APPARATUS, COMPUTER READABLE STORAGE MEDIUM, AND ELECTRONIC DEVICE
2y 5m to grant Granted Feb 24, 2026
18/367,180
Patent 12555578
METHOD AND SYSTEM OF AUDIO FALSE KEYPHRASE REJECTION USING SPEAKER RECOGNITION
2y 5m to grant Granted Feb 17, 2026
18/278,537
Patent 12547372
DISPLAY APPARATUS AND DISPLAY METHOD
2y 5m to grant Granted Feb 10, 2026
18/067,277
Patent 12537000
METHOD OF IDENTIFYING TARGET DEVICE AND ELECTRONIC DEVICE THEREFOR
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+40.6%)
2y 7m
Median Time to Grant
High
PTA Risk
Based on 355 resolved cases by this examiner. Grant probability derived from career allow rate.