Last updated: April 19, 2026

Application No. 18/644,894

DEVICE AND METHOD FOR EVALUATING SPEECH RECOGNITION SYSTEM

Non-Final OA §103§DP

Filed

Apr 24, 2024

Examiner

LEE, EUNICE SOMIN

Art Unit

2656

Tech Center

2600 — Communications

Assignee

Kia Corporation

OA Round

1 (Non-Final)

Interview Optional

— +27.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 27 resolved cases, 2023–2026

Examiner Intelligence

LEE, EUNICE SOMIN View full profile →

Grants 89% — above average

Career Allow Rate

24 granted / 27 resolved

+26.9% vs TC avg

Strong +27% interview lift

Without

With

+27.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

20 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

18.7%

-21.3% vs TC avg

§103

53.0%

+13.0% vs TC avg

§102

7.3%

-32.7% vs TC avg

§112

2.7%

-37.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 27 resolved cases

Office Action

§103 §DP

DETAILED ACTION
This communication is in response to the Application filed on April 24, 2024. 
Claims 1 - 16 are pending and have been examined. 
Claims 1 and 9 are independent.
Foreign priority: September 22, 2023.

           

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

	

Drawings
The drawings filed on April 24, 2024 have been accepted and considered by the Examiner.






Double Patenting Note
The Examiner notes that previously published patent application publications U.S. 2023/0267923, 2023/0076888 and 2025/0124915 were analyzed for Double Patenting. However, based on the current claim scope no Double patenting was found.



Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.


Claims 1, 3 -  5, 9, 11 - 13 are rejected under 35 U.S.C. 103(a) as being unpatentable over Jin et al., (U.S. Patent Application Publication 2022/0399006), hereinafter referred to as Jin, in view of Tian, (CN115240632A).
Regarding Claims 1 and 9, Jin teaches:
1. A device for evaluating a speech recognition system including a plurality of speech recognition engines and a plurality of natural language understanding (NLU) engines, the device comprising, and 9. A computer implemented method for evaluating a speech recognition system including a plurality of speech recognition engines and a plurality of natural language understanding (NLU) engines, the method comprising:
one or more processors; and [Jin, “Speech recognition system 102 has a processor 301 connected to various other components by system bus 302..” Par. 0073]
at least one storage device storing a program to be executed by the one or more processors, the program including instructions to: [Jin, “The computer readable storage medium can be a tangible device (i.e., the claimed “storage device”) that can retain and store instructions for use by an instruction execution device (i.e., the claimed “executed by the one or more processors”).” Par. 0036]
obtain speech recognition results in which the plurality of speech recognition engines recognize input audio; [Jin, “Furthermore, the method comprises reprocessing cached customer speech data (i.e., the claimed “input audio”) with a plurality of speech-to-text models (i.e., the claimed “speech recognition engines”) to perform speech recognition of the customer's spoken words in response to a confidence rate of a speech-to-text result (i.e., the claimed “speech recognition results”) performed by the first speech-to-text model not exceeding a threshold value.” Par. 0004]
evaluate the plurality of speech recognition engines based on a comparison between the speech recognition results; [Jin, “Furthermore, in one embodiment, analyzer 204 performs similarity analysis of the results (i.e., the claimed “speech recognition results”) of the speech-to-text models (i.e., the claimed “plurality of speech recognition engines”) with respect to the reference speech-to-text result. As discussed above, if the confidence rate of the speech-to-text processing of the customer's spoken words performed by the selected speech-to-text model is not satisfactory, such as not exceeding a threshold value, then the customer's speech data, which has been cached, is reprocessed by multiple speech-to-text models to perform speech recognition of the customer's spoken words. The results (i.e., the claimed “speech recognition results”) of such speech recognition performed by such speech-to-text models (i.e., the claimed “plurality of speech recognition engines”) is compared against the reference speech-to-text result.” Par. 0058]
obtain NLU results in which the plurality of NLU engines understand each speech recognition result; and [Jin, “As discussed above, in one embodiment, such an analysis (i.e., the claimed “speech recognition result”) is performed by classifier 206 using natural language processing (i.e., the claimed NLU engine”)”,” Par. 0136]
Jin fails to teach plurality of NLU engines. 
However, Tian teaches:
obtain NLU results in which the plurality of NLU engines understand each speech recognition result; and [Tian, “Multiple single models are used to determine the intent classification corpus data of the target object’s speech input information (i.e., the claimed “speech recognition”).” Par. n0010; “The multiple single models are different models (i.e., the claimed “plurality of NLU engines”) used for intent classification in the natural language processing process (i.e., the claimed “NLU”).” Par. n0010]
evaluate the plurality of NLU engines based on a comparison between the NLU results. [Tian, “Multiple model combinations are determined based on multiple single models (i.e., the claimed “plurality of NLU engines”), and multiple model evaluation indicators are determined (i.e., the claimed “comparison between NLU results”) based on multiple model combinations and multiple single model evaluation indicators (i.e., the claimed “comparison between NLU results”), each corresponding to one of the multiple model combinations.” Par. n0011; “For NLP (Natural Language Processing) technology, since the processing involves many individual models (i.e., the claimed “plurality of NLU engines”), it is necessary to first evaluate each individual model separately, then use multiple models to complete the combined evaluation, and finally determine the language evaluation index (i.e., the claimed “comparison between NLU results”) of the entire NLP natural language processing process based on the combined evaluation results (i.e., the claimed “comparison between NLU results”).” Par. n0072]
Jin and Tian pertain to integration of artificial intelligence technologies and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the integration of artificial intelligence technologies art to modify Jin’s teachings of “plurality of speech-to-text models (i.e., the claimed “speech recognition engines”)” (Jin, Par. 0004) with the teachings of “multiple single models (i.e., the claimed “plurality of NLU engines”) used for intent classification in the natural language processing process (i.e., the claimed “NLU”)” (Tian, Par. n0010) taught by Tian in order to “improve accuracy of AI evaluation” (Tian, Par. n0004).

Regarding Claims 3 and 11, Jin in view of Tian has been discussed above. The combination further teaches:
wherein the program further includes instructions to evaluate each NLU engine by comparing each NLU result with an NLU label for the speech recognition results. [Jin, see mapping applied to claim 1; Tian, see mapping applied to claim 1; “It can be provided to researchers, enabling them to collect more data of the same intent and type, label (i.e., the claimed “NLU label”) the data, and finally use it for model training to improve recognition performance.” Par. n0101; “The application comprehensive evaluation module is used to determine a comprehensive index based on the speech synthesis evaluation index, the speech recognition evaluation index (i.e., the claimed “speech recognition results”), and the language processing evaluation index (i.e., the claimed “NLU results”).” Par. n0040]

Regarding Claims 4 and 12, Jin in view of Tian has been discussed above. The combination further teaches:
wherein the program further includes instructions to detect recognition failure of the input audio by comparing NLU results of one NLU engine understanding the speech recognition results with an NLU label for the speech recognition results. [Jin, see mapping applied to claim 1; Tian, see mapping applied to claim 1; Tian, see mapping applied to claim 3; “percentage of correct/incorrect readings by the robot (i.e., the claimed “recognition failure of the input audio”).” Par. n0105; “If the error rate exceeds the first error rate threshold, return the first test failure  (i.e., the claimed recognition failure”) flag.” Par. n0108]

Regarding Claims 5 and 13, Jin in view of Tian has been discussed above. The combination further teaches:
wherein the program further includes instructions to determine that the recognition failure was caused by the one NLU engine in a case in which the speech recognition results are the same. [Jin, see mapping applied to claim 1; Tian, see mapping applied to claims 1, 3 - 4; Similarity score of 1 indicates an exact match (i.e., the claimed “speech recognition results are the same”).” Par. 0128]


Claims 2 and 10 are rejected under 35 U.S.C. 103(a) as being unpatentable over Jin in view of Tian and Graham, (U.S. Patent Application Publication 2016/0064008).
Regarding Claims 2 and 10, Jin in view of Tian has been discussed above. The combination further teaches:
wherein the program further includes instructions to: [Jin, see mapping applied to claim 1]
convert a pre-stored text sentence into a plurality of voice audios that differ in at least one of a speaking speed, a pitch, or an additional noise; and [Jin, “Due to having customers, perhaps many thousands of customers, with different dialects (i.e., the claimed “pitch”) and different background environments (i.e., the claimed “additional noise”), a speech recognition system may need to pre-build thousands of speech-to-text models to translate speech into text to handle such scenarios.” Par. 0087; “By performing speech recognition on the customer and/or agent cached speech data and then comparing the outputted text with the reference speech-to-text result, such models (e.g., IBM Watson® Speech-to-Text) may identify such discrepancies and learn from such discrepancies. In one embodiment, such discrepancies correspond to corrections in the transcription, which may be stored in a file (i.e., the claimed “pre-stored text sentence”) and used by the model (e.g., IBM Watson® Speech-to-Text) to improve its accuracy.” Par. 0061; Tian, “The application comprehensive evaluation module is used to determine a comprehensive index based on the speech synthesis evaluation index, the speech recognition evaluation index (i.e., the claimed “speech recognition results”), and the language processing evaluation index (i.e., the claimed “NLU results”).” Par. n0040]
identify the text sentence as a dangerous text based on a degree of discrepancy between voice audio recognition results in which one of the plurality of speech recognition engines recognizes the plurality of voice audios. [Jin, “Furthermore, the method comprises reprocessing cached customer speech data with a plurality of speech-to-text models (i.e., the claimed “speech recognition engines”) to perform speech recognition of the customer's spoken words,” Par. 0004; “By performing speech recognition on the customer and/or agent cached speech data and then comparing the outputted text with the reference speech-to-text result, such models (e.g., IBM Watson® Speech-to-Text) may identify such discrepancies (i.e., the claimed “identify the text sentence as a dangerous text based on a degree of discrepancy”) and learn from such discrepancies. In one embodiment, such discrepancies correspond to corrections in the transcription, which may be stored in a file and used by the model (e.g., IBM Watson® Speech-to-Text) to improve its accuracy.” Par. 0061; Referring to the Specification Par. 0074 of the instant Application, “dangerous text” refers to text “that is difficult to recognize based on a degree of discrepancy”.]
The combination fails to teach converting a pre-stored text sentence into a plurality of voice audios.
However, Graham teaches:
convert a pre-stored text sentence into a plurality of voice audios that differ in at least one of a speaking speed, a pitch, or an additional noise; and [Graham, “Pitch and timing (i.e., the claimed “speed”) modifications (i.e., the claimed “plurality”) may be included to make the speech (i.e., the claimed “plurality of voice audios”) sound more natural. Additionally, the synthetic speech module 212 may generate synthetic speech using the converted text (i.e., the claimed “convert a pre-stored text sentence”) of the received speech audio signal stored in the stored data repository 208 for the received speech audio signal.” Par. 0041; “The system additionally includes a noise reduction device,” Par. 0005; “In one embodiment, the speech data corpus for various service subscribers may be stored as recorded speech plus transcribed text in the stored data repository 208. (i.e., the claimed “pre-stored text sentence”);” Par. 0061]
Jin, Tian and Graham pertain to integration of artificial intelligence technologies and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the integration of artificial intelligence technologies art to modify Jin’s teachings of “plurality of speech-to-text models (i.e., the claimed “speech recognition engines”)” (Jin, Par. 0004) with the teachings of “multiple single models (i.e., the claimed “plurality of NLU engines”) used for intent classification in the natural language processing process (i.e., the claimed “NLU”)” (Tian, Par. n0010) taught by Tian, and the teachings of “generating synthetic speech using the converted text” (Graham, Par. 0041) taught by Graham in order to “improve accuracy of AI evaluation” (Tian, Par. n0004) and “enhance speech containing background noise in a diversity of applications” (Graham, Par. 0003).


Claims 8 and 16 are rejected under 35 U.S.C. 103(a) as being unpatentable over Jin in view of Tian and Schornig et al., (U.S. Patent Application Publication 2025/0291554), hereinafter referred to as Schornig.
Regarding Claims 8 and 16, Jin in view of Tian has been discussed above. The combination further teaches:
wherein the program further includes instructions to generate an NLU label for the speech recognition results by applying a language model to the NLU results. [Jin, see mapping applied to claim 1; Tian, see mapping applied to claims 1, 3 – 5]
The combination fails to explicitly teach language model.
However, Schornig teaches:
wherein the program further includes instructions to generate an NLU label for the speech recognition results by applying a language model to the NLU results. [Schornig, “According to various implementations, troubleshooting agent 502 may leverage one or more LLMs (i.e., the claimed “language model”) to troubleshoot an issue, find the actual root cause for the issue, and/or suggest a set of one or more actions to fix the issue.” Par. 0070; “For instance, troubleshooting agent 502 may solve some tasks with objective metrics such as reducing the processing time or improve accuracy even at the risk of involving more steps and tokens (cost). In the context of the techniques herein, the issue criticality may also drive the optimization criteria (e.g., time versus reliability versus cost). In one implementation, the optimization criteria may be unique and decided according to policy and criticality. In another implementation, troubleshooting agent 502 may trigger multiple actions in parallel, each with different optimization criterion. For example, for a given issue I, troubleshooting agent 502 may send a request to a first LLM (i.e., the claimed “language model”) with a first criteria (e.g., solve as quickly as possible, optimizing time) and send the same request to a second LLM (i.e., the claimed “language model”) with different optimization criteria (e.g., efficiency). In such a case, troubleshooting agent 502 may use the reply to the first request (set of resolution action Ai) to quickly fix the network, followed by using the second set of actions to optimize the resolution of the issue.” Par. 0073; “For instance, in the context of classification, the model M may be a straight line that separates the data into two classes (e.g., labels) (i.e., the claimed “NLU labels”), Par. 0039]
Jin, Tian and Schornig pertain to integration of artificial intelligence technologies and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the integration of artificial intelligence technologies art to modify Jin’s teachings of “plurality of speech-to-text models (i.e., the claimed “speech recognition engines”)” (Jin, Par. 0004) with the teachings of “multiple single models (i.e., the claimed “plurality of NLU engines”) used for intent classification in the natural language processing process (i.e., the claimed “NLU”)” (Tian, Par. n0010) taught by Tian, and the teachings of “language model” (Schornig, Par. 0070) taught by Graham in order to “improve accuracy of AI evaluation” (Tian, Par. n0004) and overcome current problem of “repeating the same mistakes over and over again” (Schornig, Par. 0002).


Allowable Subject Matter
Claims 6 - 7 and 14 - 15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Regarding Claim 6, although Jian teaches integrating plurality of speech-to-text models (i.e., the claimed “speech recognition engines”) to perform speech recognition (Jian, Par. 0004) and Tian teaches integrating “multiple single models are different models (i.e., the claimed “plurality of NLU engines”) used for intent classification in the natural language processing process (i.e., the claimed “NLU”)” (Tian, Par. n0010), none of the combination teach the case in which the NLU results of the one NLU engine are the same as the NLU label for the speech recognition results, but at least two of the speech recognition results are different, the program further includes instructions to determine that the NLU results of the one NLU engine contain a defect.  
Claim 14 is recited similar to Claim 6 and also contain similar allowable subject matter.
Claims 7 and 15 depend on Claims 6 and 14, respectively, and therefore are allowable by virtue of their dependency.




Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Stolfo et al., ("Speech recognition in parallel," Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod, Massachusetts, 1989) teaches combining multiple independent speech recognizers to improve recognition accuracies.
Lee et al., (KR20240057946A), teaches multiple natural language processing technologies.
Liu et al., (“An adversarial bidirectional serial–parallel LSTM-based QTD framework for product quality prediction,” Journal of Intelligent Manufacturing, 2020), teaches multiple natural language processing technologies.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to EUNICE LEE whose telephone number is 571-272-1886. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/EUNICE LEE/Examiner, Art Unit 2656

 /BHAVESH M MEHTA/ Supervisory Patent Examiner, Art Unit 2656

Read full office action

Prosecution Timeline

Apr 24, 2024

Application Filed

Jan 09, 2026

Non-Final Rejection — §103, §DP (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/449,809

Patent 12603078

GENERATING SPEECH DATA USING ARTIFICIAL INTELLIGENCE TECHNIQUES

2y 5m to grant Granted Apr 14, 2026

17/992,605

Patent 12597365

AUTOMATIC TRANSLATION BETWEEN SIGN LANGUAGE AND SPOKEN LANGUAGE

2y 5m to grant Granted Apr 07, 2026

18/205,615

Patent 12585876

METHOD OF TRAINING POS TAGGING MODEL, COMPUTER-READABLE RECORDING MEDIUM AND POS TAGGING METHOD

2y 5m to grant Granted Mar 24, 2026

18/518,786

Patent 12579385

EMBEDDED TRANSLATE, SUMMARIZE, AND AUTO READ

2y 5m to grant Granted Mar 17, 2026

18/140,389

Patent 12566928

READABILITY BASED CONFIDENCE SCORE FOR LARGE LANGUAGE MODELS

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

89%

Grant Probability

99%

With Interview (+27.3%)

2y 10m

Median Time to Grant

Low

PTA Risk

Based on 27 resolved cases by this examiner. Grant probability derived from career allow rate.