Prosecution Insights
Last updated: April 19, 2026
Application No. 18/814,675

TRAINING A SPEECH RECOGNITION MODEL, AND SPEECH RECOGNITION

Non-Final OA §103
Filed
Aug 26, 2024
Examiner
PATEL, YOGESHKUMAR G
Art Unit
2691
Tech Center
2600 — Communications
Assignee
Mashang Consumer Finance Co. Ltd.
OA Round
1 (Non-Final)
83%
Grant Probability
Favorable
1-2
OA Rounds
2y 4m
To Grant
86%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allow Rate
538 granted / 650 resolved
+20.8% vs TC avg
Minimal +3% lift
Without
With
+3.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 4m
Avg Prosecution
17 currently pending
Career history
667
Total Applications
across all art units

Statute-Specific Performance

§101
4.7%
-35.3% vs TC avg
§103
61.9%
+21.9% vs TC avg
§102
14.4%
-25.6% vs TC avg
§112
14.2%
-25.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 650 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Allowable Subject Matter Claims 5-12 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-4 and 13-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US #2022/0172707) in view of Meng et al. (CN #117437904 A). Regarding Claim 1, Wang discloses a method for training a speech recognition model (title, abstract, Figs. 1-12), the method comprising: performing, by the speech recognition model, feature extraction on a speech sample to obtain a speech sample feature (Wang Figs. 2-3 and 9A-9C); performing semantic extraction on the speech sample feature to obtain a semantic feature of the speech sample (Wang Fig. 3: target feature extraction model; target speech recognition model; ¶0051 discloses the target speech recognition model includes an encoding model and a decoding model. The encoding model is configured to perform feature encoding to obtain an encoded feature representing a semantic meaning. The decoding model is configured to decode the encoded feature to obtain a speech recognition result); determining a first loss value based on the semantic feature (Wang Fig. 2: S208 obtain a target model loss value corresponding to the target feature extraction model according to the first model output result and the first reference speech recognition result and Fig. 5: S502 obtain a first model loss value corresponding to the target feature extraction model according to the first model output result and the first reference speech recognition result); performing, by the speech recognition model, speech recognition on the speech sample based on the speech sample feature to obtain a speech sample recognition result (Wang Fig. 2: S209 update a model parameter of the target feature extraction model in the pre-update target model according to the target model loss value, to obtain a post-update target model, and perform speech recognition on speech data of the target user by using the post-update target model); determining a second loss value based on the speech sample recognition result and a speech sample label corresponding to the speech sample (Wang Fig. 5: S504 obtain a second model loss value according to the first training speech data and first reference speech data corresponding to the first sample speech data); and training the speech recognition model based on the first loss value and the second loss value to obtain a trained speech recognition model (Wang ¶0085 discloses the first model loss value is obtained according to the first model output result and the first reference speech recognition result, and therefore may be used for representing the model loss of the target speech recognition model. The second model loss value is obtained according to the first training speech data and the first reference speech data corresponding to the first sample speech data, and therefore may be used for representing the model loss of the target speech extraction model. Therefore, the two loss values are combined to obtain the model loss of the target feature extraction model, so that the target feature extraction model may take losses of both models into consideration to adjust a model parameter, and the model parameter obtained through learning may coordinate a difference between the target speech extraction model and the target speech recognition model, thereby improving the speech recognition effect of the entire target model). Wang may not explicitly disclose determining a first loss value based on the semantic feature. However, Meng (title, abstract, Figs. 1-10) teaches determining a first loss value based on the semantic feature (Meng ¶n0032 discloses first, a speech sample is acquired and encoded to obtain a semantic feature vector of the speech sample. … finally, based on the text sequence information and the speech recognition result, the training loss of the speech recognition model to be trained is generated, and the speech recognition model is trained based on the training loss). Wang and Meng are analogous art as they pertain to speech recognition. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify the speech recognition model (as taught by Wang) to acquire and encode speech sample to obtain sematic feature (as taught by Meng, ¶n0032) to achieve both good real-time performance and high recognition accuracy when performing speech recognition (Meng, ¶n0032). Regarding Claim 2, Wang in view of Meng discloses the method of claim 1, further comprising: before performing the feature extraction on the speech sample, obtaining speech data in a target period (Wang ¶0046 discloses speech of the target user can be recognized in time by using the model after update, thereby improving the accuracy of a speech recognition task, to better recognize speech of the user); performing speech segment recognition on the speech data to obtain a target speech segment of the speech data, the target speech segment having no audio or having noise (Wang ¶0048 discloses the speech extraction model is configured to extract speech to obtain valid speech data. For example, the speech extraction model may have at least one of a Speech Enhancement [SE] function or a Speech Separation [SS] function. The SS is to extract the speech data of the target user from speech data of a plurality of users. The SE is to extract a useful speech signal from various noise to suppress noise); and deleting the target speech segment from the speech data to obtain the speech sample (Wang ¶0049 discloses the target feature extraction model is configured to extract features of the speech data of the target user. After speech extraction of the speech extraction model, valid speech data of the target user is obtained. However, because there may be partial distortion, features may be extracted using a feature extraction model, thereby reducing the impact of speech distortion on the recognition effect of the speech recognition model). Regarding Claim 3, Wang in view of Meng discloses the method of claim 2, wherein the speech data comprises a plurality of speech segments respectively corresponding to a plurality of target moments within the target period (Wang ¶0048 discloses the SS is to extract the speech data of the target user from speech data of a plurality of users. The SE is to extract a useful speech signal from various noise to suppress noise); and the performing of the speech segment recognition on the speech data to obtain the target speech segment comprises, for each of the speech segments, one of: in response to determining that the each of the speech segments has no audio, determining the each of the speech segments as the target speech segment; and in response to determining that the each of the speech segments has audio (Wang ¶0048 discloses according to a quantity of microphones, the speech extraction model includes a single-channel speech extraction model and a multi-channel speech extraction model. The single-channel speech extraction model is configured to extract speech data acquired by a single microphone. The multi-channel speech extraction model is configured to extract speech data acquired by a plurality of microphones. The speech extraction is a user-oriented speech processing task with the objective of obtaining valid speech data of the target user), performing noise recognition on the each of the speech segments to obtain a noise recognition result (Wang ¶0048 discloses the speech extraction model is configured to extract speech to obtain valid speech data. For example, the speech extraction model may have at least one of a Speech Enhancement [SE] function or a Speech Separation [SS] function. The SS is to extract the speech data of the target user from speech data of a plurality of users. The SE is to extract a useful speech signal from various noise to suppress noise) and, in response to determining that the noise recognition result indicates that the each of the speech segments has noise, determining the each of the speech segments as the target speech segment (Wang ¶0049 discloses the target feature extraction model is configured to extract features of the speech data of the target user. After speech extraction of the speech extraction model, valid speech data of the target user is obtained. However, because there may be partial distortion, features may be extracted using a feature extraction model, thereby reducing the impact of speech distortion on the recognition effect of the speech recognition model). Regarding Claim 4, Wang in view of Meng discloses the method of claim 1, wherein the performing of the feature extraction on the speech sample to obtain the speech sample feature comprises: performing speech signal processing on the speech sample to obtain frequency domain information of the speech sample (Wang ¶0065 discloses the DANet is configured to embed an attractor in a high-dimensional space to combine time-frequency information in the speech signal for training. One piece of speech data may include a plurality of audio frames. Therefore, the speech extraction model may obtain a plurality of spectra); and performing, by the speech recognition model, the feature extraction on the speech sample based on the frequency domain information to obtain the speech sample feature (Wang ¶0065 discloses step S404: Input the first training speech data into the target feature extraction model to perform feature extraction, to obtain a first training feature. ¶0072 discloses step S406: Input the first training feature into the target speech recognition model to perform speech recognition, to obtain the first model output result). Claims 13-20 are rejected for the same reasons as set forth in Claims 1-4. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2691
Read full office action

Prosecution Timeline

Aug 26, 2024
Application Filed
Mar 05, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12598426
CHANGE OF A MODE FOR CAPTURING IMMERSIVE AUDIO
2y 5m to grant Granted Apr 07, 2026
Patent 12596525
METHOD TO DETERMINE INTENDED DIRECTION OF A VOCAL COMMAND AND TARGET FOR VOCAL INTERACTION
2y 5m to grant Granted Apr 07, 2026
Patent 12592675
AUDIO DEVICE WITH MICROPHONE AND MEDIA MIXING
2y 5m to grant Granted Mar 31, 2026
Patent 12593010
COMMUNICATION ASSEMBLY
2y 5m to grant Granted Mar 31, 2026
Patent 12587448
AI-BASED NETWORK TROUBLESHOOTING WITH EXPERT FEEDBACK
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
86%
With Interview (+3.4%)
2y 4m
Median Time to Grant
Low
PTA Risk
Based on 650 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month