Last updated: May 29, 2026

Application No. 18/814,675

TRAINING A SPEECH RECOGNITION MODEL, AND SPEECH RECOGNITION

Non-Final OA §103

Filed

Aug 26, 2024

Priority

Feb 27, 2024 — CN 202410218041.1

Examiner

PATEL, YOGESHKUMAR G

Art Unit

2691

Tech Center

2600 — Communications

Assignee

Mashang Consumer Finance Co. Ltd.

OA Round

1 (Non-Final)

Interview Optional

— +3.3% interview lift. Interview lift (+3.3%) is below the 15.0% threshold. A written response is recommended.

Based on 655 resolved cases, 2023–2026

Examiner Intelligence

PATEL, YOGESHKUMAR G View full profile →

Grants 83% — above average

Career Allowance Rate

543 granted / 655 resolved

+20.9% vs TC avg

Minimal +3% lift

Without

With

+3.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 3m

Avg Prosecution

20 currently pending

Career history

672

Total Applications

across all art units

Statute-Specific Performance

§101

1.5%

-38.5% vs TC avg

§103

91.9%

+51.9% vs TC avg

§102

3.6%

-36.4% vs TC avg

§112

1.0%

-39.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 655 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Allowable Subject Matter
Claims 5-12 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4 and 13-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US #2022/0172707) in view of Meng et al. (CN #117437904 A).
Regarding Claim 1, Wang discloses a method for training a speech recognition model (title, abstract, Figs. 1-12), the method comprising:
performing, by the speech recognition model, feature extraction on a speech sample to obtain a speech sample feature (Wang Figs. 2-3 and 9A-9C);
performing semantic extraction on the speech sample feature to obtain a semantic feature of the speech sample (Wang Fig. 3: target feature extraction model; target speech recognition model; ¶0051 discloses the target speech recognition model includes an encoding model and a decoding model. The encoding model is configured to perform feature encoding to obtain an encoded feature representing a semantic meaning. The decoding model is configured to decode the encoded feature to obtain a speech recognition result);
determining a first loss value based on the semantic feature (Wang Fig. 2: S208 obtain a target model loss value corresponding to the target feature extraction model according to the first model output result and the first reference speech recognition result and Fig. 5: S502 obtain a first model loss value corresponding to the target feature extraction model according to the first model output result and the first reference speech recognition result);
performing, by the speech recognition model, speech recognition on the speech sample based on the speech sample feature to obtain a speech sample recognition result (Wang Fig. 2: S209 update a model parameter of the target feature extraction model in the pre-update target model according to the target model loss value, to obtain a post-update target model, and perform speech recognition on speech data of the target user by using the post-update target model);
determining a second loss value based on the speech sample recognition result and a speech sample label corresponding to the speech sample (Wang Fig. 5: S504 obtain a second model loss value according to the first training speech data and first reference speech data corresponding to the first sample speech data); and
training the speech recognition model based on the first loss value and the second loss value to obtain a trained speech recognition model (Wang ¶0085 discloses the first model loss value is obtained according to the first model output result and the first reference speech recognition result, and therefore may be used for representing the model loss of the target speech recognition model. The second model loss value is obtained according to the first training speech data and the first reference speech data corresponding to the first sample speech data, and therefore may be used for representing the model loss of the target speech extraction model. Therefore, the two loss values are combined to obtain the model loss of the target feature extraction model, so that the target feature extraction model may take losses of both models into consideration to adjust a model parameter, and the model parameter obtained through learning may coordinate a difference between the target speech extraction model and the target speech recognition model, thereby improving the speech recognition effect of the entire target model).
Wang may not explicitly disclose determining a first loss value based on the semantic feature.
However, Meng (title, abstract, Figs. 1-10) teaches determining a first loss value based on the semantic feature (Meng ¶n0032 discloses first, a speech sample is acquired and encoded to obtain a semantic feature vector of the speech sample. … finally, based on the text sequence information and the speech recognition result, the training loss of the speech recognition model to be trained is generated, and the speech recognition model is trained based on the training loss).
Wang and Meng are analogous art as they pertain to speech recognition. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the speech recognition model (as taught by Wang) to acquire and encode speech sample to obtain sematic feature (as taught by Meng, ¶n0032) to achieve both good real-time performance and high recognition accuracy when performing speech recognition (Meng, ¶n0032).

Regarding Claim 2, Wang in view of Meng discloses the method of claim 1, further comprising:
before performing the feature extraction on the speech sample, obtaining speech data in a target period (Wang ¶0046 discloses speech of the target user can be recognized in time by using the model after update, thereby improving the accuracy of a speech recognition task, to better recognize speech of the user);
performing speech segment recognition on the speech data to obtain a target speech segment of the speech data, the target speech segment having no audio or having noise (Wang ¶0048 discloses the speech extraction model is configured to extract speech to obtain valid speech data. For example, the speech extraction model may have at least one of a Speech Enhancement [SE] function or a Speech Separation [SS] function. The SS is to extract the speech data of the target user from speech data of a plurality of users. The SE is to extract a useful speech signal from various noise to suppress noise); and
deleting the target speech segment from the speech data to obtain the speech sample (Wang ¶0049 discloses the target feature extraction model is configured to extract features of the speech data of the target user. After speech extraction of the speech extraction model, valid speech data of the target user is obtained. However, because there may be partial distortion, features may be extracted using a feature extraction model, thereby reducing the impact of speech distortion on the recognition effect of the speech recognition model).

Regarding Claim 3, Wang in view of Meng discloses the method of claim 2,
wherein the speech data comprises a plurality of speech segments respectively corresponding to a plurality of target moments within the target period (Wang ¶0048 discloses the SS is to extract the speech data of the target user from speech data of a plurality of users. The SE is to extract a useful speech signal from various noise to suppress noise); and
the performing of the speech segment recognition on the speech data to obtain the target speech segment comprises, for each of the speech segments, one of:
in response to determining that the each of the speech segments has no audio, determining the each of the speech segments as the target speech segment; and
in response to determining that the each of the speech segments has audio (Wang ¶0048 discloses according to a quantity of microphones, the speech extraction model includes a single-channel speech extraction model and a multi-channel speech extraction model. The single-channel speech extraction model is configured to extract speech data acquired by a single microphone. The multi-channel speech extraction model is configured to extract speech data acquired by a plurality of microphones. The speech extraction is a user-oriented speech processing task with the objective of obtaining valid speech data of the target user), performing noise recognition on the each of the speech segments to obtain a noise recognition result (Wang ¶0048 discloses the speech extraction model is configured to extract speech to obtain valid speech data. For example, the speech extraction model may have at least one of a Speech Enhancement [SE] function or a Speech Separation [SS] function. The SS is to extract the speech data of the target user from speech data of a plurality of users. The SE is to extract a useful speech signal from various noise to suppress noise) and, in response to determining that the noise recognition result indicates that the each of the speech segments has noise, determining the each of the speech segments as the target speech segment (Wang ¶0049 discloses the target feature extraction model is configured to extract features of the speech data of the target user. After speech extraction of the speech extraction model, valid speech data of the target user is obtained. However, because there may be partial distortion, features may be extracted using a feature extraction model, thereby reducing the impact of speech distortion on the recognition effect of the speech recognition model).

Regarding Claim 4, Wang in view of Meng discloses the method of claim 1, wherein the performing of the feature extraction on the speech sample to obtain the speech sample feature comprises:
performing speech signal processing on the speech sample to obtain frequency domain information of the speech sample (Wang ¶0065 discloses the DANet is configured to embed an attractor in a high-dimensional space to combine time-frequency information in the speech signal for training. One piece of speech data may include a plurality of audio frames. Therefore, the speech extraction model may obtain a plurality of spectra); and
performing, by the speech recognition model, the feature extraction on the speech sample based on the frequency domain information to obtain the speech sample feature (Wang ¶0065 discloses step S404: Input the first training speech data into the target feature extraction model to perform feature extraction, to obtain a first training feature. ¶0072 discloses step S406: Input the first training feature into the target speech recognition model to perform speech recognition, to obtain the first model output result).

Claims 13-20 are rejected for the same reasons as set forth in Claims 1-4.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2691

Read full office action

Prosecution Timeline

Aug 26, 2024

Application Filed

Mar 09, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/417,674

Patent 12640163

Method and System for Identifying Similarity Between Two Audio Tracks

2y 4m to grant Granted May 26, 2026

18/367,316

Patent 12626711

HIGH-QUALITY VOICE SIGNAL PROCESSING DEVICE AND METHOD THROUGH REMOVAL OF AMBIENT NOISE BASED ON MULTI-SENSOR SIGNAL FUSION

2y 8m to grant Granted May 12, 2026

18/401,292

Patent 12610167

DIRECTIONAL BILATERAL SOUND INTAKE-BASED MIC ASSEMBLY AND ELECTRONIC DEVICE

2y 3m to grant Granted Apr 21, 2026

18/420,157

Patent 12598426

CHANGE OF A MODE FOR CAPTURING IMMERSIVE AUDIO

2y 2m to grant Granted Apr 07, 2026

18/534,033

Patent 12596525

METHOD TO DETERMINE INTENDED DIRECTION OF A VOCAL COMMAND AND TARGET FOR VOCAL INTERACTION

2y 4m to grant Granted Apr 07, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

83%

Grant Probability

86%

With Interview (+3.3%)

2y 3m (~6m remaining)

Median Time to Grant

Low

PTA Risk

Based on 655 resolved cases by this examiner. Grant probability derived from career allowance rate.