Last updated: April 19, 2026

Application No. 18/168,691

SYSTEMS AND METHODS FOR PHONEME RECOGNITION

Non-Final OA §102§103

Filed

Feb 14, 2023

Examiner

SERROU, ABDELALI

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Amazon Technologies, Inc.

OA Round

3 (Non-Final)

Interview Optional

— +30.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 587 resolved cases, 2023–2026

Examiner Intelligence

SERROU, ABDELALI View full profile →

Grants 74% — above average

Career Allow Rate

437 granted / 587 resolved

+12.4% vs TC avg

Strong +30% interview lift

Without

With

+30.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 3m

Avg Prosecution

23 currently pending

Career history

610

Total Applications

across all art units

Statute-Specific Performance

§101

19.7%

-20.3% vs TC avg

§103

42.4%

+2.4% vs TC avg

§102

17.5%

-22.5% vs TC avg

§112

8.8%

-31.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 587 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/27/2026 has been entered.
 Response to Amendment
3.	In response to the office action mailed on 10/28/2025, applicant filed an amendment on 01/27/2026, amending claims 1, 3, 5, 9, 11-13, 17, 19, 20.  Claim 21 is newly added.  Claims 4, 10, 18 are cancelled.  The pending claims are 1-3, 5-9, 11-17, and 19-21. 
Response to Arguments
4.	Applicant’s arguments with respect to the pending claims have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 5-7, 9-15, and 17-21 are rejected under 35 U.S.C. 102(aa)(1) as being anticipated by Cheng (CN 112951208) in view of Kakkar (US 2023/0306196), and further in view of Arel (US 10573296).
As per claim 5, Cheng teaches receiving first audio data corresponding to a first spoken input in a first language ([0006], [0047], obtaining speech to be recognized from a user learning a foreign language); 
determining, using a second machine learning model first language and a second language, first phonemes corresponding to the first spoken input in the first audio data ([0006], [0048], obtaining the classification result of the phoneme of the speech to be identified according to the neural network model; the classification result of the phoneme comprises the phoneme as which phoneme in the mixed phoneme set; the mixed phoneme set comprises all phonemes of the first language and the second language; the first language is the target language; the second language is the mother tongue of the pronunciation of the speech to be identified); 
determining, from stored data, second phonemes corresponding to at least a first word included in the first spoken input ([0007], [0049], the neural network model is used to identify which phoneme in the speech to be recognized is the phoneme of the target language or the phoneme of the native language); 
based at least in part on the first phonemes the second phonemes, determining first output data indicating pronunciation feedback with respect to the first spoken input ([0053]- [0078], [0100], determining whether similarity conditions are met or not to determine mispronounced phonemes, and determining, based on first and second phonemes, and the similarity data, mispronounced phonemes); and 
causing presentation of the first output data ([0043], [0049], displaying processing results, and providing richer feedback to users so that users can correct and improve their pronunciation in a targeted manner).
Cheng may not explicitly disclose receiving a plurality of machine-generated words that are generated using at least one word and a first machine learning model, the plurality of machine-generated words representing at least one of grammatical errors, mispronunciations, and spelling errors corresponding to the at least one word; and training a second machine learning model to recognize phonemes of a first language and phonemes of a second language, the training using training data corresponding to the plurality of machine-generated words and the at least one word.  However, Kakkar in the same field of endeavor teaches a system and method for machine translation-based spelling correction and more particularly, to machine learning assisted spelling corrections, wherein one or more spelling errors may be generated and stored as spelling error training data. Then, the processor is further configured to train the attention model with synthetically generated training data ([0044]).  Therefore, it would have been obvious at the time the application was filed to use the above feature of Kakkar with the system of Cheng, in order to improve accuracy and minimize errors.  As to wherein the plurality of words are generated by a first machine learning model and training a second machine learning model using the generated words, Arel in the same field of endeavor teaches using a machine learning model to process a synthetic training data item, wherein synthetic training data item is generated/received by a first machine learning model  and may be used to train a second machine learning model that processes data output by the acoustic model (Abstract).  Therefore, it would have been obvious at the time the application was filed to use the above feature of Arel with the system of Cheng in view of Kakkar, in order to improve accuracy and minimize errors.
As per claim 6, Cheng teaches determining that a first portion of the first phonemes correspond to the second language, the first portion representing a second portion of the first word; determining second audio data representing the second portion of the first word in the first language; and causing presentation of the second audio data ([0050], the word "nice" includes the three phonemes "n", "ai" and "s". In Chinglish, the phoneme "n" may be pronounced as the Chinese phoneme "na", and the phoneme "ai" may be pronounced as the Chinese phoneme"~"- The neural network model in the embodiment of the present application can identify whether each phoneme in the speech is pronounced as an English phoneme or a Chinese phoneme).
As per claim 7, Cheng teaches determining that a first portion of the first phonemes are different than a second portion of the second phonemes; and based at least in part on the first portion of the first phonemes being different than the second portion of the second phonemes, determining the first output data to include at least a representation of the second portion of the second phonemes ([0113], wherein said, the evaluation result of whether the pronunciation of phonemes is biased towards Chinese phonemes or English phonemes is fed back to the user, so that the user can correct the pronunciation that is biased towards Chinese phonemes in a targeted manner, thereby effectively improving the user's pronunciation level).
As per claim 9, Cheng teaches determining training data including: second words corresponding to the first language, the second words labeled with third phonemes, and third words corresponding to the second language, the third words labeled with fourth phonemes ([0013], training the neural network model based on phoneme samples with labels, where the labels are similarities with respect to phonemes in the mixed phoneme set).  As to receiving audio data corresponding to the first data, the plurality of machine-generated words, and the at least one word, wherein the training data includes the audio data, Cheng teaches a speech recognition system (Abstract).  Therefore, it would have been obvious at the time the application was filed for the system of Cheng in view of Kakkar and Arel to receive the claimed audio data.  This would improve accuracy in speech recognition results.
  As per claim 11, Cheng teaches wherein determining the similarity data representing a difference between the first phonemes and the second phonemes; determining that the similarity data satisfies a condition; and in response to the similarity data satisfying the condition, determining the first output data ([0100], determining, based on first and second phonemes similarity/difference, mispronounced phonemes. [0114]- [0116], the score of the phoneme relative to the correct phoneme can also be obtained through a neural network model. This score serves as a score for the phoneme, used to feedback the pronunciation quality of the phoneme.  See more in claim 5 rejection).
As per claim 12, Cheng teaches determining a first value representing a difference between the first phonemes and the second phonemes, determining that the first value satisfies a first condition; in response to the first value satisfying the first condition, determining the first output data ([0057]- [0078], [0100]) ; receiving second audio data corresponding to a second spoken input in the first language and including at least the first word; determining, using the machine learning model, third phonemes corresponding to the second audio data; determining a second value representing a difference between the third phonemes and the second phonemes; and based at least in part on the second spoken input succeeding the first spoken input, determining that the second value satisfies a second condition different than the first condition ([0038], wherein said, input device 120 can input a section of voice can also input multiple sections of voice; and [0050], [0057]- [0078], [0100] for determining, based on first and second phonemes similarity, mispronounced phonemes).
As per claims 13-15, 17-20, system claims 13-15, 17-20 and method claims 5-7, 9-12 are related as apparatus and the method of using same, with each claimed element's function corresponding to the claimed method step.  Accordingly claims 13-15, 17-20 are similarly rejected under the same rationale as applied above with respect to method claims 5-7, 9-12.  Furthermore, Cheng teaches one or more processors; and memory storing thereon instructions, as claimed ([0039]- [0040]).
Claim 21 is rejected under 35 U.S.C. 102(aa)(1) as being anticipated by Cheng in view of Kakkar and Arel, and further in view of Cobo Rus (US 20220254330).
As per claim 21, may not explicitly disclose wherein the second machine learning model is an autoregressive model.  Cobo Rus in the same field of endeavor teaches a speech processing system training an autoregressive model ([0035], [0087]).  Therefore, it would have been obvious at the time the application was filed to use the above feature of Cobo Rus with the system of Cheng in view of Kakkar and Arel, in order to provide efficient results.

Claims 1-3, 8, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Cheng in view of Kakkar and Arel, and further in view of Shpiro (US 2004/0176960).

As per claim 1, Cheng teaches causing presentation of first output data requesting a user to speak a first word in a first language (Abstract); 
receiving, in response to the first output data, first audio data corresponding to a first spoken input including the first word ([0006], [0047], obtaining speech to be recognized from a user learning a foreign language); 
determining, using a machine learning model, first phonemes corresponding to the first word in the first audio data ([0006], [0048], obtaining the classification result of the phoneme of the speech to be identified according to the neural network model; the classification result of the phoneme comprises the phoneme as which phoneme in the mixed phoneme set; the mixed phoneme set comprises all phonemes of the first language and the second language; the first language is the target language; the second language is the mother tongue of the pronunciation of the speech to be identified); 
determining, from stored data, second phonemes corresponding to the first word as represented in first output data, the second phonemes being pronunciation reference phonemes for the first word ([0007], [0049], the neural network model is used to identify which phoneme in the speech to be recognized is the phoneme of the target language or the phoneme of the native language"); 
determining similarity data representing a similarity between the first phonemes and the second phonemes paragraph [0053]- [0054], determining similarity between phonemes of speech to be identified and phonemes samples representing reference data used by the neural network model to identify whether an input phoneme is an English phoneme or Chinese phoneme.  The similarity is determined for the phonemes with obvious difference as 0 and for different phonemes, the intersection ratio of the pronunciation mode feature set can be used as the similarity of them ([0076]).  Furthermore, paragraph [0105] of Cheng teaches, after training the neural network model, when speech recognition, firstly obtaining the phoneme characteristic of speech to be identified; then according to the characteristic of the phoneme and the neural network model, obtaining the classification result of the phoneme. The obtaining way of the character of the phoneme can adopt the obtaining way which is the same as the aforementioned training stage); 
determining that the similarity data satisfies a condition indicating that at least a first phoneme of the first word is mispronounced in the first spoken input (in addition to paragraphs [0057]- [0078], [0100], wherein Cheng teaches setting similarity condition, applicant is referred to paragraph [0108], wherein the evaluation result of the speech to be identified comprises whether the pronunciation of the phoneme is correct, whether the pronunciation of the phoneme is biased to Chinese phonemes or English phonemes.  More, applicant is referred to paragraphs [0114]- [0119], wherein similarity score of each phoneme in the speech to be recognized is used to determine whether the pronunciation is correct or incorrect, biased to the target language phonemes or the native language phonemes); 
determining second output data indicating that the at least first phoneme of the first word is mispronounced in the first spoken input ([0100], determining, based on first and second phonemes similarity, mispronounced phonemes); and 
causing presentation of the second output data ([0043], [0049], displaying processing results, and providing richer feedback to users so that users can correct and improve their pronunciation in a targeted manner).
Cheng may not explicitly disclose causing presentation of first output data representing a prompt for a user to speak a first word in a first language.  Shpiro in the same field of endeavor teaches causing presentation of first output data representing a prompt for a user to speak a first word in a first language ([0019] FIG. 3 shows the display screen of the FIG. 1 system providing a prompt for a user to speak a word and thereby provide the system with a user utterance for analysis).  Therefore, it would have been obvious at the time the application was filed to use the above feature of Shpiro with the system of Cheng, in order to improve articulation and sound production and benefit individuals learning foreign languages.
Chengin view of Shpiro may not explicitly disclose receiving a plurality of machine-generated words that are generated using at least one word and a first machine learning model, the plurality of machine-generated words representing at least one of grammatical errors, mispronunciations, and spelling errors corresponding to the at least one word; and training a second machine learning model to recognize phonemes of a first language and phonemes of a second language, the training using training data corresponding to the plurality of machine-generated words and the at least one word.  However, Kakkar in the same field of endeavor teaches a system and method for machine translation-based spelling correction and more particularly, to machine learning assisted spelling corrections, wherein one or more spelling errors may be generated and stored as spelling error training data. Then, the processor is further configured to train the attention model with synthetically generated training data ([0044]).  Therefore, it would have been obvious at the time the application was filed to use the above feature of Kakkar with the system of Cheng in view of Shpiro, in order to improve accuracy and minimize errors.  As to wherein the plurality of words are generated by a first machine learning model and training a second machine learning model using the generated words, Arel in the same field of endeavor teaches using a machine learning model to process a synthetic training data item, wherein synthetic training data item is generated/received by a first machine learning model  and may be used to train a second machine learning model that processes data output by the acoustic model (Abstract).  Therefore, it would have been obvious at the time the application was filed to use the above feature of Arel with the system of Cheng in view of Shpiro and Kakkar, in order to improve accuracy and minimize errors.
As per claim 2, Cheng teaches determining a second portion of the first phonemes corresponding to a first portion of the first word; determining that the second portion of the first phonemes correspond to the second language; determining a third portion of the second phonemes corresponding to the first portion of the first word; based at least in part on determining that the second portion of the first phonemes correspond to the second language, determining second output data indicating pronunciation of the third portion of the second phonemes; and causing presentation of the second output data ([0050], the word "nice" includes the three phonemes "n", "ai" and "s". In Chinglish, the phoneme "n" may be pronounced as the Chinese phoneme "na", and the phoneme "ai" may be pronounced as the Chinese phoneme"~"- The neural network model in the embodiment of the present application can identify whether each phoneme in the speech is pronounced as an English phoneme or a Chinese phoneme).
As per claim 3, Cheng teaches training the machine learning model using: a first plurality of words corresponding to the first language, the first plurality of words labeled with third phonemes, and a second plurality of words corresponding to the second language, the second plurality of words labeled with fourth phonemes ([0013], training the neural network model based on phoneme samples with labels, where the labels are similarities with respect to phonemes in the mixed phoneme set.  See also rejection of claim 1).
As per claims 8 and 16, Cheng may not explicitly disclose determining the first output data indicating a first portion of the second phonemes to be stressed during pronunciation. Shpiro in the same field of endeavor teaches determining the first output data indicating a first portion of the second phonemes to be stressed during pronunciation ([0015], [0039]).  Therefore, it would have been obvious at the time the application was filed to use the above feature of Shpiro with the system of Cheng, in order to improve speech recognition system by identifying, quantifying, and correcting pronunciation errors that may affect the accuracy and reliability of speech recognition results 

Conclusion
6.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDELALI SERROU whose telephone number is (571)272-7638. The examiner can normally be reached M-F 9 Am - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ABDELALI SERROU/Primary Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Feb 14, 2023

Application Filed

Apr 12, 2025

Non-Final Rejection — §102, §103

Aug 11, 2025

Response Filed

Oct 25, 2025

Final Rejection — §102, §103

Jan 27, 2026

Request for Continued Examination

Jan 30, 2026

Response after Non-Final Action

Feb 21, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/438,923

Patent 12602544

INFORMATION PROCESSING APPARATUS, OPERATION METHOD, AND RECORDING MEDIUM

2y 5m to grant Granted Apr 14, 2026

18/371,344

Patent 12596875

TECHNIQUES FOR ADAPTIVE LARGE LANGUAGE MODEL USAGE

2y 5m to grant Granted Apr 07, 2026

18/494,763

Patent 12597417

EXPORTING MODULAR ENCODER FEATURES FOR STREAMING AND DELIBERATION ASR

2y 5m to grant Granted Apr 07, 2026

18/675,840

Patent 12596889

GENERATION OF NATURAL LANGUAGE (NL) BASED SUMMARIES USING A LARGE LANGUAGE MODEL (LLM) AND SUBSEQUENT MODIFICATION THEREOF FOR ATTRIBUTION

2y 5m to grant Granted Apr 07, 2026

18/078,740

Patent 12591603

AUTOMATED KEY-VALUE EXTRACTION USING NATURAL LANGUAGE INTENTS

2y 5m to grant Granted Mar 31, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

74%

Grant Probability

99%

With Interview (+30.4%)

3y 3m

Median Time to Grant

High

PTA Risk

Based on 587 resolved cases by this examiner. Grant probability derived from career allow rate.

SYSTEMS AND METHODS FOR PHONEME RECOGNITION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email