Last updated: April 19, 2026
Application No. 18/724,673
SPEECH RECOGNIZING SYSTEM, AND SPEECH RECOGNIZING METHOD

Non-Final OA §102§103
Filed
Jun 27, 2024
Examiner
LAM, PHILIP HUNG FAI
Art Unit
2656
Tech Center
2600 — Communications
Assignee
NEC Corporation
OA Round
1 (Non-Final)
Interview Optional

— +45.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 129 resolved cases, 2023–2026
Examiner Intelligence

LAM, PHILIP HUNG FAI View full profile →
Grants 83% — above average
Career Allow Rate
107 granted / 129 resolved
+20.9% vs TC avg
Strong +46% interview lift
Without
With
+45.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
29 currently pending
Career history
158
Total Applications
across all art units
Statute-Specific Performance

§101
23.7%
-16.3% vs TC avg
§103
53.7%
+13.7% vs TC avg
§102
11.1%
-28.9% vs TC avg
§112
5.3%
-34.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 129 resolved cases
Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION

Introduction

This office action is in response to Applicant’s submission filed on 6/27/2024. As such, claims 1-9 have been examined.

Drawings
The drawings are objected to because fig. 21 contains typos.  For figs 21, in ref box S852, “READING COMVERSION MODEL” should read “READING CONVERSION MODEL”, Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Objections
Claim 1 is objected to because of the following informalities. In line 14, “speech recognize” should read “recognize speech from”.  Appropriate correction is required.
Claim 3 is objected to because of the following informalities. In line 6, “speech recognize” should read “recognize speech”.  Appropriate correction is required.
Claim 4 is objected to because of the following informalities. In line 2, after “executing the instructions to” should be follow by a “:”.  Also in line 5, “speech recognizing” should read “recognized speech”. Appropriate correction is required.
Claim 8 is objected to because of the following informalities. In line 13, “speech recognize” should read “recognize speech from”.  Appropriate correction is required.
Claim 9 is objected to because of the following informalities. In line 8, “speech recognize” should read “recognize speech from”.  Appropriate correction is required.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-5, 7 and 9 are rejected under 35 U.S.C. 102 (a)(2) as being anticipated by Biadsy (US 20220068257)
	Regarding Claim 1, Biadsy discloses: A speech recognizing system comprising: at least one memory that is configured to store instructions; ([0027] FIG. 4 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.) [fig. 4, contains memory hardware 420]
and at least one processor that is configured to execute the instructions to: ([0027] FIG. 4 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.) [fig. 4, contains processor 410, Also see para 0081 for computer device configuration details.]
acquire real utterance data uttered by a speaker; ([0032] Specifically, implementations include sampling initial personalized seed data corresponding to transcribed acoustic data of recorded utterances spoken by the target speaker with atypical speech and using the sampled seed data to adapt/tune a baseline text-to-speech (TTS) model. Here, the “baseline TTS model” simply refers to a reference/existing TTS model previously trained to convert input text into synthesized canonical speech in the voice of one or more predefined speakers. Here, the personalized seed data sampled from the target speaker tunes/adapt the baseline TTS model to convert input text into output synthesized speech in the voice of the target speaker and having the atypical speech pattern of the target speaker.)
convert the real utterance data into text data; ([0032] Specifically, implementations include sampling initial personalized seed data corresponding to transcribed acoustic data of recorded utterances spoken by the target speaker with atypical speech and using the sampled seed data to adapt/tune a baseline text-to-speech (TTS) model.) Also see [0011] A first portion of the plurality of training text utterances includes a plurality of transcriptions in a set of spoken training utterances.
generate corresponding synthesis speech corresponding to the real utterance data by speech synthesizing using the text data; ([0032] Specifically, implementations include sampling initial personalized seed data corresponding to transcribed acoustic data of recorded utterances spoken by the target speaker with atypical speech and using the sampled seed data to adapt/tune a baseline text-to-speech (TTS) model. Here, the “baseline TTS model” simply refers to a reference/existing TTS model previously trained to convert input text into synthesized canonical speech in the voice of one or more predefined speakers. Here, the personalized seed data sampled from the target speaker tunes/adapt the baseline TTS model to convert input text into output synthesized speech in the voice of the target speaker and having the atypical speech pattern of the target speaker.)
generate a conversion model converting input speech into synthesis speech using the real utterance data and the corresponding synthesis speech; ([0032] Here, the personalized seed data sampled from the target speaker tunes/adapt the baseline TTS model to convert input text into output synthesized speech in the voice of the target speaker and having the atypical speech pattern of the target speaker. The pre-trained baseline TTS model includes an encoder portion and a decoder portion, whereby adapting the TTS model may include tuning/re-training parameters of the decoder portion while parameters of the encoder portion remain fixed. By using the personalized seed data to adapt the TTS model in this manner, the adapted TTS model may be used to convert text utterances, including terms or phrases associated with the specific domain, into synthetic training utterances that include synthesized speech in the voice of the target speaker and having the associated atypical speech patterns of the target speaker.) [This describes creating the adapted TTS model (the conversion model) that converts text (input) into synthesized speech (output) in the target voice.]
and speech recognize the synthesis speech converted using the conversion model. ([0032] By using the personalized seed data to adapt the TTS model in this manner, the adapted TTS model may be used to convert text utterances, including terms or phrases associated with the specific domain, into synthetic training utterances that include synthesized speech in the voice of the target speaker and having the associated atypical speech patterns of the target speaker.)[ This portion describes the purpose of the adapted model—to create data that can then be used for speech recognition (training an ASR system to understand atypical speech), fulfilling the recognition aspect by generating the necessary test data.]

Regarding Claim 2, Biadsy disclose all of claim 1.  
Biadsy further discloses: wherein the at least one processor is configured to execute the instructions to: adjust parameters of the conversion model using the input speech and a recognition result of the speech recognizing. ([0033] The synthetic training utterances produced by the adapted TTS model and corresponding transcriptions are used to adapt/tune a baseline speech conversion model. Here, a “baseline speech conversion model” refers to either a reference/existing ASR model, pre-trained on a general corpus of transcribed acoustic data to recognize typical/canonical speech, or a reference/existing speech-to-speech conversion model, trained to map input audio waveforms (or spectrograms) for each of a plurality of utterances from a corpus spanning a variety of speakers and recording conditions to corresponding output audio waveforms (or spectrograms) in a voice of a predefined canonical speaker. Accordingly, the synthetic training utterances provide linguistic diversity and acoustic diversity sufficient for adapting/tuning the general speech conversion model to recognize and/or convert atypical speech spoken by the target speaker, and targeting a specific domain, into canonical text and/or canonical fluent synthesized speech.)

Regarding Claim 3, Biadsy disclose all of claim 1,
Biadsy further discloses: wherein the at least one processor is configured to execute the instructions to: generate a speech recognition model using data including the corresponding synthesis speech, and speech recognize using the speech recognition model. ([0015] the speech conversion model includes an automated speech recognition model configured to convert speech into corresponding text. In these implementations, after training the speech conversion model, the method may also include receiving audio data corresponding to an utterance spoken by the target speaker associated with atypical speech; and converting, using the trained speech conversion model, the audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into a canonical textual representation of the utterance spoken by the target speaker.) Also see para 0013. [the reference describes a specific use case for a speech recognition system handling atypical speech]



Regarding Claim 4, Biadsy disclose all of claim 3,
Biadsy further discloses: wherein the at least one processor is configured to execute the instructions to adjust parameters of the speech recognition model using the synthesis speech converted by using the conversion model and a recognition result of the speech recognizing. ([0072] As with the S2S speech conversion model 300a, training the ASR model 300b may include adapting a reference ASR model 300b that was previously trained on a general corpus of training utterances spoken by a variety of different speakers with different speaking styles. Here, the reference ASR model 300b may be adapted on the filtered set of synthetic speech representations 306A each paired with a corresponding one of the unspoken training text utterances 302b, and then further adapted/tuned on the non-synthetic speech representations 304 from the set of spoken training utterances 305 collected from the target speaker 104 during the personalized seed data collection stage 200a of FIG. 2A. On the other hand, the ASR model 300b may be trained from scratch using a mixture of the filtered set of synthetic speech representations 306A, each paired with a corresponding one of the unspoken training text utterances 302b, and the non-synthetic speech representations 304 in the set of spoken training utterances 305, each paired with a corresponding transcription 302a.) [The describes a specific, multi-stage method for personalizing an ASR model to a target speaker, which aligns with the general process outlined in the claim]

Regarding Claim 5, Biadsy disclose all of claim 1,
Biadsy further discloses: wherein the at least one processor is configured to execute the instructions to: acquire attribute information indicating attribute of the speaker, and generate the corresponding synthesis speech by performing speech synthesizing using the attribute information. ([0006] when the speech conversion model is not previously trained to convert audio waveforms of input utterances spoken by speakers having a same type of atypical speech as the atypical speech associated with the target speaker, adapting, by the data processing hardware, using the set of spoken training utterances, the speech conversion model to convert audio waveforms of input utterances spoken by the target speaker with atypical speech into audio waveforms of synthesized canonical fluent speech. Here, generating the corresponding audio waveform of synthesized canonical fluent speech includes generating, as output from the adapted speech conversation model, the corresponding audio waveform of synthesized canonical fluent speech in the voice of the target speaker. In some examples, the text decoder resides on the speech conversion model. In other examples, the text decoder resides on a reference automated speech recognition model separate from the speech conversion model.)

Regarding Claim 7, Biadsy disclose all of claim 1,
Biadsy further discloses: wherein the at least one processor is configured to execute the instructions to: give noise at least one of the text data and the corresponding synthesis speech. ([0056] Similarly, the reference S2S conversion model 301 is pre-trained on input audio data corresponding to a multitude of utterances spoken by various different speakers into corresponding output audio data that captures the same content in the voice of a single predefined speaker. Notably, the utterances from the various different speakers may include typical speech patterns, a variety of different types of atypical speech patterns (e.g., heavy accents spanning different dialects, irregular speech spanning different neurological conditions), as well as background noise.) [The reference discloses adding background noise to the input audio data (which is the corresponding synthesis speech in the context of the model's output)]
Claim 9 is a method claims that corresponds to claim 1 and is rejected under similar rationale.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Biadsy, in view of Applicant supplied reference Lin (US 20200312302).
Regarding Claim 6, Biadsy disclose all of claim 1,
Biadsy does not disclose the following feature.
Lin discloses: further comprising a plurality of real uttered speech corpus storing the real utterance data for each predetermined condition, wherein the at least one processor is configured to execute the instructions to acquire the real utterance data by selecting one from the plurality of real uttered speech corpus. ([0035] The speech disordering module 110 can receive a reference corpus 111 formed by a reference speaker's speech signal and a patient corpus 112 formed by a patient speaker's speech signal. For example, the patient speaker can be a dysarthria patient. The speech disordering module 110 can convert the set of paired corpus, including the reference corpus 111 and the patient corpus 112 that correspond to each other, into a synchronous corpus 113.) [The speaker type (reference or patient) serves as a predetermined condition by which the data is categorized and potentially selected. The module receives specific corpus, implying a selection or access mechanism based on the need for either a reference or patient signal] Also see para 00040.
Biadsy and Lin are considered analogous art.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Biadsy to combine the teaching of Lin for the above mentioned teachings, because the described training corpus can be used to complete training of a voice conversion model, thereby improving model training and conversion qualities (Lin, [0008]).

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Biadsy, in view of Lee (US 20110274311).
Regarding Claim 8, Biadsy discloses: A speech recognizing system comprising: at least one memory that is configured to store instructions; and at least one processor that is configured to execute the instructions to: ([0027] FIG. 4 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.) [fig. 4, contains memory hardware 420, processor 410, Also see para 0081 for computer device configuration details.]
acquire ([0032] Specifically, implementations include sampling initial personalized seed data corresponding to transcribed acoustic data of recorded utterances spoken by the target speaker with atypical speech and using the sampled seed data to adapt/tune a baseline text-to-speech (TTS) model. Here, the “baseline TTS model” simply refers to a reference/existing TTS model previously trained to convert input text into synthesized canonical speech in the voice of one or more predefined speakers. Here, the personalized seed data sampled from the target speaker tunes/adapt the baseline TTS model to convert input text into output synthesized speech in the voice of the target speaker and having the atypical speech pattern of the target speaker.)
convert the ([0032] Specifically, implementations include sampling initial personalized seed data corresponding to transcribed acoustic data of recorded utterances spoken by the target speaker with atypical speech and using the sampled seed data to adapt/tune a baseline text-to-speech (TTS) model.) Also see [0011] A first portion of the plurality of training text utterances includes a plurality of transcriptions in a set of spoken training utterances.
generate corresponding synthesis speech corresponding to ([0032] Specifically, implementations include sampling initial personalized seed data corresponding to transcribed acoustic data of recorded utterances spoken by the target speaker with atypical speech and using the sampled seed data to adapt/tune a baseline text-to-speech (TTS) model. Here, the “baseline TTS model” simply refers to a reference/existing TTS model previously trained to convert input text into synthesized canonical speech in the voice of one or more predefined speakers. Here, the personalized seed data sampled from the target speaker tunes/adapt the baseline TTS model to convert input text into output synthesized speech in the voice of the target speaker and having the atypical speech pattern of the target speaker.)
generate a conversion model converting input speech into synthesis speech using the ([0032] Here, the personalized seed data sampled from the target speaker tunes/adapt the baseline TTS model to convert input text into output synthesized speech in the voice of the target speaker and having the atypical speech pattern of the target speaker. The pre-trained baseline TTS model includes an encoder portion and a decoder portion, whereby adapting the TTS model may include tuning/re-training parameters of the decoder portion while parameters of the encoder portion remain fixed. By using the personalized seed data to adapt the TTS model in this manner, the adapted TTS model may be used to convert text utterances, including terms or phrases associated with the specific domain, into synthetic training utterances that include synthesized speech in the voice of the target speaker and having the associated atypical speech patterns of the target speaker.) [This describes creating the adapted TTS model (the conversion model) that converts text (input) into synthesized speech (output) in the target voice.]
and speech recognize the synthesis speech converted using the conversion model. ([0032] By using the personalized seed data to adapt the TTS model in this manner, the adapted TTS model may be used to convert text utterances, including terms or phrases associated with the specific domain, into synthetic training utterances that include synthesized speech in the voice of the target speaker and having the associated atypical speech patterns of the target speaker.)[ This portion describes the purpose of the adapted model—to create data that can then be used for speech recognition (training an ASR system to understand atypical speech), fulfilling the recognition aspect by generating the necessary test data.]
Biadsy does not explicitly disclose sign language data, but it discloses in para 0029 and 0036 deaf speech and other impaired speech due to physical or neurological conditions, such as ALS disease.
Lee in the relate field discloses: acquire sign language data; ([0014] The storage unit 12 includes a sign language system setting module 122, a sign language identification module 123, a recognition module 125, a voice conversion module 126, and a gesture storing module 128. The sign language system setting module 122, the sign language identification module 123, the recognition module 125, and the voice conversion module 126 may include one or more computerized instructions executed by the processor 15. [0027] In step S5, the sign language identification module 123 compares the gesture of the signer captured by the camera 10 with the plurality of types of gestures, to determine which type the gesture of the signer belongs to, and sets the work mode accordingly. For example, if the gesture of the signer captured by the camera 10 belongs to the first type of gestures, the sign language identification module 123 sets the work mode of the sign language recognition system 1 as the first work mode.)
convert the sign language data into text data; ([0014] a sign language identification module 123, a recognition module 125,)
generate corresponding synthesis speech corresponding to the sign language data by speech synthesizing using the text data; ([0014] a voice conversion module 126,)
Biadsy and Lee are considered analogous art.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Biadsy to combine the teaching of Lee for the above mentioned teachings, because the described method/system could enhance experience of hearing impaired users (Lee, [Abstract]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Geng (US 20230072352) – discloses an AI based speech recognition method for obtaining a target speech signal.  See Abstract, para 0042, 0075, 0098, 0100,011, 0123-4, 0133, 0147, 0161, 0205, 0227 and figs.7 for additional details.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Philip H Lam whose telephone number is (571)272-1721. The examiner can normally be reached 9 AM-3 PM Pacific time.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PHILIP H LAM/            Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

Jun 27, 2024
Application Filed
Jan 07, 2026
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/643,239
Patent 12591626
SEARCH STRING ENHANCEMENT
2y 5m to grant Granted Mar 31, 2026
18/329,990
Patent 12572735
DOMAIN-SPECIFIC DOCUMENT VALIDATION
2y 5m to grant Granted Mar 10, 2026
18/377,570
Patent 12572747
MULTI-TURN DIALOGUE RESPONSE GENERATION WITH AUTOREGRESSIVE TRANSFORMER MODELS
2y 5m to grant Granted Mar 10, 2026
18/119,007
Patent 12562158
ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF
2y 5m to grant Granted Feb 24, 2026
18/670,728
Patent 12561194
ROOT CAUSE PATTERN RECOGNITION BASED MODEL TRAINING
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+45.5%)
2y 8m
Median Time to Grant
Low
PTA Risk
Based on 129 resolved cases by this examiner. Grant probability derived from career allow rate.