Last updated: April 19, 2026

Application No. 18/225,991

ELECTRONIC DEVICE FOR TRAINING SPEECH RECOGNITION MODEL AND CONTROL METHOD THEREOF

Non-Final OA §103§112

Filed

Jul 25, 2023

Examiner

SMITH, SEAN THOMAS

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Samsung Electronics Co., Ltd.

OA Round

3 (Non-Final)

Interview Optional

— +33.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 6 resolved cases, 2023–2026

Examiner Intelligence

SMITH, SEAN THOMAS View full profile →

Grants 83% — above average

Career Allow Rate

5 granted / 6 resolved

+21.3% vs TC avg

Strong +33% interview lift

Without

With

+33.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

37 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

27.9%

-12.1% vs TC avg

§103

50.7%

+10.7% vs TC avg

§102

12.9%

-27.1% vs TC avg

§112

8.6%

-31.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 6 resolved cases

Office Action

§103 §112

DETAILED ACTION
This communication is in response to Amendments and Arguments filed December 9th, 2025. Claims 1, 10, 15 and 16 are amended, claims 8, 9 and 20 are cancelled, claims 1, 3-7, 10 and 12-19 are pending and have been examined.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. KR10-2022-0113508, filed on September 7th, 2022. Claims 1, 3-10 and 12-20 have been afforded the benefit of this filing date.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on July 25th and December 22nd, 2023, and November 19th, 2024 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Response to Amendments and Arguments
With respect to rejections made under 35 U.S.C. 103, Applicant argues, “the cited references, alone or in combination, do not disclose ‘ignoring the detected EOS label based on a token comprising a text symbol being output during the threshold time’ where the EOS label is detected in a ‘text sequence’ obtained ‘by inputting the second speech sequence into the trained speech recognition model’ as recited in the independent claims,” (page 11 of Remarks).
Applicant’s argument is persuasive, in regards to rejections that rely on Cui and Faizakof; however, the limitations not taught by that combination can be found in reference Vaidya, which in combination with Cui and Faizakof, teaches all the limitations of the claims. Further details are provided below.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention. The independent claims 1, 10 and 16 recite, “obtaining a second speech sequence…” without disclosing a first speech sequence. Other limitations of the claims disclose a “first learning speech sequence” and a “second learning speech sequence” but do not clearly indicate if the “second speech sequence” is the same element as the “second learning speech sequence”.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-4, 6-7, 10, 12-13, 15-17 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent 11,972,754 to Cui et al. (hereinafter, "Cui") in view of U.S. Patent Application Publication 2022/0208176 to Faizakof et al. (hereinafter, "Faizakof"), further in view of U.S. Patent Application Publication 2022/0246167 to Vaidya et al. (hereinafter, “Vaidya”).
Regarding claims 1, 10 and 16, Cui teaches a method, system and computer readable medium comprising obtaining a first loss value by inputting, into a speech recognition model, a first learning speech sequence… (column 2, lines 15-21, “According to an aspect of the present disclosure, there is provided a method of performing sequence to sequence (Seq2Seq) speech recognition training by at least one processor, the Seq2Seq speech recognition training method comprising: acquiring, by the at least one processor, a training set comprising a plurality of pairs of input data and target data corresponding to the input data…”);obtaining a second loss value by inputting, into the speech recognition model, a second learning speech sequence… (column 2, lines 15-21, “According to an aspect of the present disclosure, there is provided a method of performing sequence to sequence (Seq2Seq) speech recognition training by at least one processor, the Seq2Seq speech recognition training method comprising: acquiring, by the at least one processor, a training set comprising a plurality of pairs of input data and target data corresponding to the input data…”);training the speech recognition model based on the first loss value and the second loss value… (column 2, lines 1-6, "“Provided are methods and apparatuses that improve the related art end to end recognition system by automatically and independently balancing the importance of two loss functions.”); andwherein the speech recognition model comprises an encoder, and the first loss value is obtained from an output of the encoder, and wherein the speech recognition model further comprises a decoder, and the second loss value is obtained from an output of the decoder (column 5, lines 39-46, "According to an embodiment, the encoder 111 may encode the input data into a sequence of hidden states hu. According to an embodiment, the encoder 111 may take all acoustic features and transform them into the sequence of hidden states hu. According to an embodiment, the attention based decoder 113 may decode the sequence of hidden states to generate target labels by independently performing a CTC model training and an attention model training."). 
Cui does not explicitly teach “a first learning speech sequence comprising an end-of-sentence (EOS) label,” or “a second learning speech sequence that does not include the EOS label,” and thus, Faizakof is introduced.
Faizakof teaches a first learning speech sequence comprising an end-of-sentence (EOS) label (paragraph [0014], "In some embodiments the second text corpus is preprocessed, before said re-training, by including end-of-sentence (EOS) embeddings."); anda second learning speech sequence that does not include the EOS label (paragraph [0019], "In some embodiments the joint training comprises training said capitalization prediction network and said punctuation prediction network jointly, at an initial training stage, on a first training set comprising: (i) a first text corpus comprising punctuated and capitalized text; and (ii) labels indicating a punctuation and a capitalization associated with each of said words in said first text corpus." See also Fig. 2A, wherein only one of the training datasets includes EOS preprocessing.”).
Cui and Faizakof are considered analogous because they are each concerned with speech recognition. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Cui with the teachings of Faizakof for the purpose of improving speech recognition model performance. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
The combination of Cui and Faizakof does not teach a method, system or computer readable medium further comprising “obtaining a second speech sequence by changing the EOS label to a preset first symbol; obtaining a text sequence by inputting the second speech sequence into the trained speech recognition model; detecting the EOS label in the obtained text sequence; outputting the obtained text sequence by recognizing the detected EOS label based on identifying a token comprising a preset second symbol output during a threshold time; and ignoring the detected EOS label based on a token comprising a text symbol being output during the threshold time,” and thus, Vaidya is introduced.
Vaidya teaches obtaining a second speech sequence by changing the EOS label to a preset first symbol (paragraph [0035], "In at least one embodiment, the EOS detector 210 can be used to detect a start of speech (SOS) and an EOS segment for output of an automatic speech recognition (ASR) model based on the output of the CTC function 208. In an embodiment, the CTC function 208 utilizes a defined number of characters (e.g., 29 characters) where those characters include 26 characters of this English language (and/or numbers or other characters for other languages) and a set of other characters each indicating a different grammatical and/or functional aspect, such as a space character for word separation, an apostrophe symbol, and a blank symbol used to indicate that no other character was detected for a given audio frame. As described above, a blank symbol can represent silence, and a probability of blank characters can be used for EOS detection.");
obtaining a text sequence by inputting the second speech sequence into the trained speech recognition model (paragraph [0039], “In one example, an Argmax decoder is applied to each time step, such that an output of the greedy decoder 308 will be a string of n characters, where the selected character has the highest probability as indicated in the probability distribution data. In at least one embodiment, an output 350 of the greedy decoder 308 (e.g., Argmax function) is illustrated in FIG. 3B, where this output includes a string which includes alphanumeric characters and blank symbols represented by underscores FIGS. 3A and 3B.");
detecting the EOS label in the obtained text sequence (paragraph [0041], "In at least one embodiment, detecting of EOS and/or SOS is performed using a sliding window 352 on the output 350 of the greedy decoder 308. For example, the sliding window 352 includes a number of time steps X (e.g., 25) where each time step represents an interval of time (e.g., 20 milliseconds). As described above, a determination of EOS can be made based at least in part on a percentage of blank symbols (illustrated as underscores in FIGS. 3A and 3B) that are included within the sliding window 352 for any time step and/or range of time steps.");
outputting the obtained text sequence by recognizing the detected EOS label based on identifying a token comprising a preset second symbol output during a threshold time (paragraph [0040], "In at least one embodiment, the strings illustrated in FIGS. 3A and 3B (e.g., the output 350) is time aligned with a corresponding audio input based at least in part on the output of CTC function 304. As illustrated in FIG. 3A, in an embodiment, the string is provided as input to the component 310 to determine EOS and/or SOS. For example, if EOS or SOS is determined for the output 350, the EOS detector 306, as described above, flags the time step for a decoder of a speech recognition pipeline indicating that a transcript can be generated."); and
ignoring the detected EOS label based on a token comprising a text symbol being output during the threshold time (paragraph [0040], "In at least one embodiment, the strings illustrated in FIGS. 3A and 3B (e.g., the output 350) is time aligned with a corresponding audio input based at least in part on the output of CTC function 304. As illustrated in FIG. 3A, in an embodiment, the string is provided as input to the component 310 to determine EOS and/or SOS. For example, if EOS or SOS is determined for the output 350, the EOS detector 306, as described above, flags the time step for a decoder of a speech recognition pipeline indicating that a transcript can be generated."). In accordance with MPEP §2111.04, the broadest reasonable interpretation of a method claim having contingent limitations requires only those steps that must be performed and does not include steps that are not required to be performed because the condition precedent is not met. If the condition for performing a contingent step is not satisfied, the performance recited by the step need not be carried out in order for the claimed method to be performed, and for that reason, Vaidya further teaches the limitations of the claim.
Cui, Faizakof and Vaidya are considered analogous because they are each concerned with speech recognition. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Cui and Faizakof with the teachings of Vaidya for the purpose of improving speech recognition model performance. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Regarding claims 3, 12 and 17, Cui teaches information on a speech sequence at a time point of T outputted from the encoder and information on a text sequence corresponding to a speech sequence of a time point of time T-1 outputted from the decoder are input to the decoder (column 5, lines 51-61, "According to an embodiment, the attention based decoder 113 may perform the attention model training by operating at the target sequence time step and generating, for each step, a query si based on the input to the attention based decoder 113. The attention based decoder 113 may generate the query si based on a previous target label ŷi−1, a previous prediction vi−1, and previous query si−1 (as explained in detail in FIG. 2). According to an embodiment, context information ci, which is a summary of speech signals encoded in hidden layers of the encoder, is also used by the attention based decoder 113 to generate the query si.").
Regarding claims 4 and 13, Cui teaches the first loss value is a connectionist temporal classification (CTC) loss value (column 6, lines 20-23, "According to an embodiment, the CTC model training module 112 may perform the CTC model training independent of the attention model training to minimize the CTC loss.").
Regarding claims 6, 15 and 19, Cui teaches the speech recognition model comprises an attention- based encoder-decoder (AED) model, wherein the second loss value is a cross-entropy (CE) loss value, wherein the training further comprises training the speech recognition model reduces a final loss value obtained by an equation L = LCTC + LCE being reduced, and wherein L is the final loss value, LCTC is a CTC loss value, and LCE is a CE loss value (column 7, lines 14-23, "As compared to the related art speech recognition system using interpolation weights to combine the CTC loss and the original cross entropy loss used by the attention model, which is not only cumbersome, but takes a long time to individually train and test the models with different weights, the Seq2Seq speech recognition system 100 independently optimizes the CTC loss function and the cross entropy loss function. Since the Seq2Seq speech recognition system 100 has no specific interpolation weights in its formulation, the speed and efficiency of training the input data is improved.").
Regarding claim 7, Cui teaches the first learning speech sequence and the second learning speech sequence are obtained by a same learning speech (column 10, lines 56-67, and column 11, line 1, "According to an embodiment, the CTC loss function may be defined as a mean of normalized edit distance between hypothesis H(x) and the corresponding targets, where S=(x, t) is the training set containing all pairs of input x and its corresponding target t. That is, given data set S with input/target utterance pairs (x, t), the CTC loss function is defined as the difference between sequences."). Under the broadest reasonable interpretation, “first learning speech sequence” and “second learning speech sequence” are taken to indicate a first and second portion, or subset, of “a same learning speech”. For that reason, the training set comprised of subsets taught by Cui reads on the claim limitation.
Claim 5, 14 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Cui, Faizakof and Vaidya as applied to claim 4 above, and further in view of U.S. Patent Application Publication 2022/0139380 to Meng et al. (hereinafter, "Meng").
Regarding claims 5, 14 and 18, the combination of Cui, Faizakof and Vaidya does not teach a method, system or computer readable medium “wherein the speech recognition model comprises a recurrent neural network-transducer (RNN-T) model, wherein the second loss value is a transducer loss value, wherein the training further comprises training the speech recognition model reduces a final loss value obtained by an equation L = LCTC + LRNN-T being reduced, and wherein L is the final loss value, LCTC is a CTC loss value, and LRNN-T is a transducer loss value,” and thus, Meng is introduced. 
Meng teaches wherein the speech recognition model comprises a recurrent neural network-transducer (RNN-T) model (paragraph [0018], "To address these issues, FIG. 1 illustrates a computer system 10 that implements an internal LM estimation (ILME) method to integrate external LMs with pre-existing E2E models. Several example pre-existing E2E models include connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder (AED) models, and variants of these E2E models. However, it should be appreciated that the E2E models are not limited to these specific examples.");wherein the second loss value is a transducer loss value (paragraph [0031], "The RNN-T loss is computed by marginalizing over all possible blank-augmented token sequences aligned with each reference Y, i.e., A(X, Y), on the training corpus D."); andwherein the training further comprises training the speech recognition model in a manner that results in a final loss value obtained by an equation L = LCTC + LRNN-T being reduced, wherein L is the final loss value, LCTC is a CTC loss value, and LRNN-T is a transducer loss value (paragraph [0078], "From Eqs. (21) and (22), the RNN-T internal LM loss is conditioned only on the parameters of the prediction and joint networks, θpred and θjoint. For RNN-T, the internal LM training loss is constructed as a weighted sum of the RNN-T loss in Eq. (4) and the internal LM loss below…").Shown here is the equation reference by Meng:
    PNG
    media_image1.png
    74
    476
    media_image1.png
    Greyscale

Cui, Faizakof, Vaidya and Meng are considered analogous because they are each concerned with speech recognition. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have replaced the LILM of Meng with the LCTC of Cui for the purpose of improving speech recognition model performance given that the substitution of one known element for another yields predictable results.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Chinese Patent document CN 111400754 to Tao Xiong
Chinese Patent document CN 112002349 to Han et al.
Korean Patent document KR-20200109843 to Oh et al.
U.S. Patent 10,186,254 to Williams et al.
U.S. Patent 11,107,463 to Prabhavalkar et al.
U.S. Patent 12,211,517 to Maas et al.
U.S. Patent 12,243,517 to Mehrabani et al.
U.S. Patent Application Publication 2020/0335091 to Chang et al.
U.S. Patent Application Publication 2021/0358490 to Vaidya et al.
U.S. Patent Application Publication 2022/0068265 to Shao et al.
U.S. Patent Application Publication 2022/0122586 to Yu et al.
U.S. Patent Application Publication 2022/0230627 to Chang et al.
U.S. Patent Application Publication 2022/0270597 to Qui et al.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEAN T SMITH whose telephone number is (571)272-6643. The examiner can normally be reached Monday - Friday 8:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, PIERRE-LOUIS DESIR can be reached at (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SEAN THOMAS SMITH/Examiner, Art Unit 2659   


/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Jul 25, 2023

Application Filed

Jun 05, 2025

Non-Final Rejection — §103, §112

Jul 22, 2025

Interview Requested

Aug 12, 2025

Applicant Interview (Telephonic)

Aug 12, 2025

Examiner Interview Summary

Sep 15, 2025

Response Filed

Oct 01, 2025

Final Rejection — §103, §112

Dec 09, 2025

Request for Continued Examination

Jan 07, 2026

Response after Non-Final Action

Feb 17, 2026

Non-Final Rejection — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/393,807

Patent 12602540

LEVERAGING A LARGE LANGUAGE MODEL ENCODER TO EVALUATE PREDICTIVE MODELS

2y 5m to grant Granted Apr 14, 2026

18/092,987

Patent 12530534

SYSTEM AND METHOD FOR GENERATING STRUCTURED SEMANTIC ANNOTATIONS FROM UNSTRUCTURED DOCUMENT

2y 5m to grant Granted Jan 20, 2026

Study what changed to get past this examiner. Based on 2 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

83%

Grant Probability

99%

With Interview (+33.3%)

2y 8m

Median Time to Grant

High

PTA Risk

Based on 6 resolved cases by this examiner. Grant probability derived from career allow rate.