Prosecution Insights
Last updated: May 29, 2026
Application No. 18/020,198

SPEECH SYNTHESIS METHOD, APPARATUS, READABLE MEDIUM AND ELECTRONIC DEVICE

Non-Final OA §103
Filed
Feb 07, 2023
Priority
Nov 20, 2020 — CN 202011315115.1 +1 more
Examiner
SHAIKH, ZEESHAN MAHMOOD
Art Unit
2658
Tech Center
2600 — Communications
Assignee
BEIJING YOUZHUJU NETWORK TECHNOLOGY CO., LTD.
OA Round
2 (Non-Final)
53%
Grant Probability
Moderate
2-3
OA Rounds
0m
Est. Remaining
99%
With Interview

Examiner Intelligence

Grants 53% of resolved cases
53%
Career Allowance Rate
18 granted / 34 resolved
-9.1% vs TC avg
Strong +53% interview lift
Without
With
+52.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
21 currently pending
Career history
63
Total Applications
across all art units

Statute-Specific Performance

§101
6.9%
-33.1% vs TC avg
§103
88.4%
+48.4% vs TC avg
§102
4.8%
-35.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 34 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Amendment This communication is responsive to the applicant’s response dated 10/17/2025. The applicant amended claims 1, 2-3, 7, 10-11, 13, 17, 18, 21, and 23. Additionally, the applicant cancelled claim 9. Response to Arguments Applicant’s arguments, see Remarks (pg. 12, line 8 – pg. 12, line 13), filed 10/17/2025, with respect to the abstract objection have been fully considered and are persuasive. The objection of the abstract has been withdrawn. Applicant’s arguments, see Remarks (pg. 12, line 14 – pg. 12, line 20), filed 10/17/2025, with respect to the specification/title objection have been fully considered and are persuasive. The objection of the specification/title has been withdrawn. Applicant’s arguments with respect to 35 U.S.C. 112, see Remarks (pg. 12, line 21 – pg. 13, line 6), filed 10/17/2025, with respect to claim 1 have been fully considered and are persuasive. The rejection of claim 1 has been withdrawn. Applicant’s arguments with respect to 35 U.S.C. 101, see Remarks (pg. 13, line 7 – pg. 17, line 18), filed 10/17/2025, with respect to claim 1-13, 17-18, and 20-24 have been fully considered and are persuasive. The rejection of claim 1-13, 17-18, and 20-24 has been withdrawn. Applicant's arguments with respect to 35 U.S.C. 102, see Remarks (pg. 17, line 19 – pg. 21, line 9) filed 10/17/2025 have been fully considered but they are not persuasive. Given the amendments to the claims, a new ground of rejection is provided below. Applicant's arguments with respect to 35 U.S.C. 103, see Remarks (pg. 21, line 10 – pg. 22, line 6) filed 10/17/2025 have been fully considered but they are not persuasive. Given the amendments to the claims, a new ground of rejection is provided below. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-4, 7, 17-18, 20-21, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Yang et al. US 20200035215 A1(hereinafter Yang) in view of Shechtman et al. US 20190172443 A1 (hereinafter Shechtman) in view of Luan et al. US 20160078859 A1 (hereinafter Luan) in view of Monge Alvarez et al. US 20210225358 A1 (hereinafter Monge Alvarez). Regarding independent claims 1, 17, and 18, Yang teaches a method for improving synthesis of speech based on emotion types, comprising; a non-transitory computer readable medium on which a computer program is stored, wherein the program, when executed by a processing device, implements operations comprising: an electronic device (FIG. 4, 11-13), comprising: a storage device on which a computer program is stored (FIG. 7, 150); a processing device configured to execute the computer program in the storage device to implement operations comprising: (FIG. 7, 140): acquiring a text to be synthesized (FIG. 5, [0141] “the audio book may receive emotion information related to situation explanation information from the network system through a wireless communication unit in order to generate emotion information”, examiner interprets the script to include both the text and emotion type), Yang fails to teach training a speech synthesis model based on extracting real acoustic features from training audio corresponding to training texts that do not have specified emotion types, inputting the real acoustic features and the training texts into the speech synthesis model, and modifying parameters of the speech synthesis model based on a difference between the training audio and output of the speech synthesis model; generating a plurality of association relationships, wherein each of the plurality of association relationships indicates acoustic features associated with one of a plurality of emotion types; user input indicating a specified emotion type of the plurality of emotion types; determining specified acoustic features corresponding to the specified emotion type based on an association relationship of the plurality of association relationships that corresponds to the specified emotion type; synthesizing target audio with the specified emotion type using the trained speech synthesis model, wherein the trained speech synthesis model is configured to generate the synthesized target audio with the specified emotion type based on receiving the text and the specified acoustic features as input. However, Shechtman teaches training a speech synthesis model based on extracting real acoustic features from training audio corresponding to training texts that do not have specified emotion types ([0040] “normalized prosody vector sequences are used when training the expressive prosody model, to reduce prosody prediction errors and speed up training”, examiner interprets prosody vectors as acoustic features), Yang in view of Shechtman is considered to be analogous to the claimed invention because both are the same field of expressive speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of outputting speech having emotional contents of Yang with the technique of training a speech synthesis model with acoustic features taught by Shechtman in order to improve a system for speech synthesis from text (see Shechtman [0001]). Yang in view of Shechtman fails to teach inputting the real acoustic features and the training texts into the speech synthesis model, and modifying parameters of the speech synthesis model based on a difference between the training audio and output of the speech synthesis model; generating a plurality of association relationships, wherein each of the plurality of association relationships indicates acoustic features associated with one of a plurality of emotion types; user input indicating a specified emotion type of the plurality of emotion types; determining specified acoustic features corresponding to the specified emotion type based on an association relationship of the plurality of association relationships that corresponds to the specified emotion type; synthesizing target audio with the specified emotion type using the trained speech synthesis model, wherein the trained speech synthesis model is configured to generate the synthesized target audio with the specified emotion type based on receiving the text and the specified acoustic features as input. However, Luan teaches inputting the real acoustic features and the training texts into the speech synthesis model (FIG. 8A, 8B), and modifying parameters of the speech synthesis model based on a difference between the training audio and output of the speech synthesis model ([0039] “emotion-specific adjustments 334a are provided as adjustments to the output parameters 332a of neutral voice model 332, rather than as emotion-specific statistical or acoustic parameters independently sufficient to produce an acoustic trajectory for each emotion type”) generating a plurality of association relationships, wherein each of the plurality of association relationships indicates acoustic features associated with one of a plurality of emotion types (FIG. 6, 620, [0056] “a decision tree may be independently built for each emotion type to cluster emotion-specific adjustments. It will be appreciated that providing independent emotion-specific decision trees in this manner may more accurately model the specific prosody characteristics associated with a target emotion type, as the questions used to cluster emotion-specific states may be specifically chosen and optimized for each emotion type”): determining specified acoustic features corresponding to the specified emotion type based on an association relationship of the plurality of association relationships that corresponds to the specified emotion type (FIG. 6, 620) Yang in view of Shechtman in view of Luan is considered to be analogous to the claimed invention because all are the same field of expressive speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of outputting speech having emotional contents of Yang in view of Shechtman with the technique of generating relationships between acoustic features and emotion types taught by Luan in order to improve techniques for text-to-speech conversion with emotional content from text (see Luan [0002]). Yang in view of Shechtman in view of Luan fails to teach user input indicating a specified emotion type of the plurality of emotion types; synthesizing target audio with the specified emotion type using the trained speech synthesis model, wherein the trained speech synthesis model is configured to generate the synthesized target audio with the specified emotion type based on receiving the text and the specified acoustic features as input However, Monge Alvarez teaches user input indicating a specified emotion type of the plurality of emotion types ([0042] “The step of receiving the at least one of the plurality of generated expression vectors may include receiving a user-selected expression vector. That is, a user may specify that they wish the synthesised speech to have a particular style (e.g. “happy”) or a particular speech characteristic (e.g. “slow speaking rate”)”) synthesizing target audio with the specified emotion type using the trained speech synthesis model, wherein the trained speech synthesis model is configured to generate the synthesized target audio with the specified emotion type based on receiving the text and the specified acoustic features as input (FIG. 3, S110”) Yang in view of Shechtman in view of Luan in view of Monge Alvarez is considered to be analogous to the claimed invention because all are the same field of expressive speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of outputting speech having emotional contents of Yang in view of Shechtman in view of Luan with the technique of allowing a user to input an emotion type taught by Monge Alvarez in order to improve a system which enables expressive speech to be synthesised from input text (see Monge Alvarez [0002]). Regarding claim 2, Yang in view of Shechtman in view of Luan in view of Monge Alvarez teaches all of the limitations of claim 1, upon which claim 2 depends. Additionally, Yang teaches wherein acoustic features of the target audio match with the specified acoustic features ([0141] “the audio book 14 may synthesize speech corresponding to the script on the basis of the emotion related information received from the network system 16 and output the speech.”; [0198] “The TTS module 170 may include an acoustic model that can transform symbolic linguistic representation into a synthetic acoustic waveform…The rules may be used to calculate a score indicating a probability that a specific audio output parameter (frequency, volume, etc.) may correspond to input symbolic linguistic representation”) Regarding claim 3, Yang in view of Shechtman in view of Luan in view of Monge Alvarez teaches all of the limitations of claim 1, upon which claim 3 depends. Additionally, Yang teaches wherein the specified acoustic features include at least one of fundamental frequency, volume and speech speed ([0196] “synthesis parameters such as frequency, volume, and noise can be varied by a parameter synthesis engine 175”) Regarding claims 4 and 20, Yang in view of Shechtman in view of Luan in view of Monge Alvarez teaches all of the limitations of claims 1 and 18, upon which claims 4 and 20 depend. Additionally, Yang teaches wherein the speech synthesis model is used to: obtain text features corresponding to the text to be synthesized (FIG. 9, S122, FIG. 14-17 [0018] “calculating a first emotion vector on the basis of an emotion element included in the data from which an emotion can be inferred through semantic analysis of the data; calculating a second emotion vector on the basis of the entire context of the data through context analysis of the data”, examiner interprets the vectors as text features; and predicted acoustic features corresponding to the text to be synthesized from the text to be synthesized ([0196] “synthesis parameters such as frequency, volume, and noise can be varied by a parameter synthesis engine 175, a digital signal processor, or a different audio generating device in order to generate artificial speech waveforms.”, examiner interprets synthesis parameters are predicted acoustic features)); obtain the target audio with the specified emotion type based on the specified acoustic features, predicted acoustic features and text features (FIG. 9, [0240] “The speech synthesis engine can perform speech synthesis on the basis of at least one of the first emotion information and the second emotion information (S130)”). Regarding claims 7, 21, and 23, Yang in view of Shechtman in view of Luan in view of Monge Alvarez teaches all of the limitations of claim 1, 17, and 18, upon which claims 7, 21, and 23 depend. Additionally, Monge Alvarez teaches wherein the speech synthesis model includes a first encoder, a second encoder and a synthesizer (FIG. 4A, 130, 124, 126); and wherein synthesizing the target audio with the specified emotion type using the trained speech synthesis model comprises: ([0069] “FIG. 4A is a block diagram of an expressive acoustic model 106 and an expressivity characterisation module 104 of the system of FIG. 2”): extracting text features corresponding to the text to be synthesized by the first encoder (FIG. 4A, 130, [0045] “The expressive acoustic model may include a text encoder sub-module for generating the keys and values for a guided attention module based on the linguistic features of the input text”, examiner interprets the text encoder as the first encoder); extracting the predicted acoustic features corresponding to the text to be synthesized by the second encoder (FIG. 4A, 124, [0040] “The expressive acoustic model may include an audio encoder sub-module for receiving pre-recorded or pre-synthesised speech features, and generating a vector corresponding to the received speech.”, examiner interprets audio encoder as the second encoder); and generating the target audio by the synthesizer based on the specified acoustic features, the predicted acoustic features and the text features (FIG. 4A, 126, [0047] “The audio decoder sub-module may receive the generated expression vector used by the audio encoder, and generate acoustic features corresponding to an output of the guided attention sub-module, conditioned by the received expression vector. This enables the audio decoder sub-module to take into account the expressivity represented by the received expression vector, so that the output of the expressive acoustic model includes expressivity information before it is sent to the vocoder to produce synthesised speech.”, examiner interprets audio decoder as the synthesizer). Claims 5-6, 8, 22, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Shechtman in view of Luan in view of Monge Alvarez, as shown in claim 1 above, in further view of Peng et al. CN 110379409 A (hereinafter Peng). Regarding claim 5, Yang in view of Shechtman in view of Luan in view of Monge Alvarez teaches all of the limitations of claim 4, upon which claim 5 depends. Yang in view of Shechtman in view of Luan in view of Monge Alvarez fails to teach wherein the specified acoustic features and the predicted acoustic features are superimposed to obtain an acoustic feature vector, and then the target audio is generated based on the acoustic feature vector and the text vector. However, Peng teaches wherein the specified acoustic features and the predicted acoustic features are superimposed to obtain an acoustic feature vector, and then the target audio is generated based on the acoustic feature vector and the text vector ([page 4, paragraph 4] “emotion tag of the present invention wants to target speech synthesis to be expressed by confirming, and generating emotion tag vector according to the emotion tag, then combining the text vector and the emotion tag vector generating spectrum”) Yang in view of Shechtman in view of Luan in view of Monge Alvarez in view of Peng considered to be analogous to the claimed invention because both are the same field of speech synthesis having emotional content. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques synthesizing speech with emotional information of Yang in view of Shechtman in view of Luan in view of Monge Alvarez with the technique of combining feature and text vectors taught by Peng in order to improve speech synthesis techniques (see Peng [page 1, paragraph 6]). Regarding claim 6, Yang teaches all of the limitations of claim 4, upon which claim 6 depends. Yang in view of Shechtman in view of Luan in view of Monge Alvarez fails to teach wherein the specified acoustic feature, the predicted acoustic feature and the text vector are superimposed to obtain a combined vector, and then the target audio is generated based on the combined vector. However, Peng teaches wherein the specified acoustic feature, the predicted acoustic feature and the text vector are superimposed to obtain a combined vector, and then the target audio is generated based on the combined vector ([page 4, paragraph 4] “combining the text vector and the emotion tag vector generating spectrum; and generating the target voice according to the spectrum”) Yang in view of Shechtman in view of Luan in view of Monge Alvarez in view of Peng considered to be analogous to the claimed invention because both are the same field of speech synthesis having emotional content. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques synthesizing speech with emotional information of Yang in view of Shechtman in view of Luan in view of Monge Alvarez with the technique of combining feature and text vectors taught by Peng in order to improve speech synthesis techniques (see Peng [page 1, paragraph 6]). Regarding claims 8, 22, and 24, Yang in view of Shechtman in view of Luan in view of Monge Alvarez teaches all of the limitations of claim 7, 21, and 23, upon which claims 8, 22, and 24 depend. Yang in view of Shechtman in view of Luan in view of Monge Alvarez fails to teach wherein the text features include a plurality of text elements, and the generating the target audio by the synthesizer based on the specified acoustic features, the predicted acoustic features and the text features, comprises: determining Mel spectrum features at the current moment through the synthesizer based on current text elements, historical Mel spectrum features, the specified acoustic features and the predicted acoustic features, wherein the current text elements are text elements in the text features input to the synthesizer at the current moment, and the historical Mel spectrum features are Mel spectrum features at the previous moment determined by the synthesizer; and generating the target audio through the synthesizer based on the Mel spectrum features at each moment. However, Peng teaches wherein the text features include a plurality of text elements, and the generating the target audio by the synthesizer based on the specified acoustic features, the predicted acoustic features and the text features, comprises: determining Mel spectrum features at the current moment through the synthesizer based on current text elements, historical Mel spectrum features, the specified acoustic features and the predicted acoustic features, wherein the current text elements are text elements in the text features input to the synthesizer at the current moment, and the historical Mel spectrum features are Mel spectrum features at the previous moment determined by the synthesizer ([page 5, paragraph 6] “combining the text vector and the emotion tag vector generating spectrum in the process of the text vector as the local condition, the mood label vector as the global condition, and to sequence model through a sequence of pre-training (seq2seq) mapping, and then generates the spectrum (also called the Mel frequency spectrum graph)”); and generating the target audio through the synthesizer based on the Mel spectrum features at each moment ([page 5, paragraph 5] S108. The spectrum to generate the target voice.). Yang in view of Shechtman in view of Luan in view of Monge Alvarez in view of Peng are considered to be analogous to the claimed invention because all are the same field of speech synthesis having emotional content. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques synthesizing speech with emotional information of Yang in view of Shechtman in view of Luan in view of Monge Alvarez with the technique of determining Mel spectrum features taught by Peng in order to improve speech synthesis techniques (see Peng [page 1, paragraph 6]). Allowable Subject Matter Claims 10-13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Stanton et al. (US 20210035551 A1) teaches a system for generating an output audio signal includes a context encoder, a text-prediction network, and a text-to-speech (TTS) model. The context encoder is configured to receive one or more context features associated with current input text and process the one or more context features to generate a context embedding associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech The TTS model is configured to process the current input text and the style embedding to generate an output audio signal of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding. Raikar et al. (US 20210142820 A1) teaches a method for speech emotion recognition for enriching speech to text communications between users in speech chat sessions including: implementing a speech emotion recognition model to enable converting observed emotions in speech samples to enrich text with visual emotion content by: generating a data set of speech samples with labels of a plurality of emotion classes; extracting a set of acoustic features from each of the emotion classes; generating a machine learning (ML) model based on the acoustic features and data set; training the ML model from acoustic features from speech samples during speech chat sessions; predicting emotion content based on a trained ML model in the observed speech; generating enriched text based on predicted emotion content of the trained ML model; and presenting the enriched text in speech to text communications between users in the chat session for visual notice of an observed emotion in the speech sample. Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZEESHAN SHAIKH whose telephone number is (703)756-1730. The examiner can normally be reached Monday-Friday 7:30AM-5:00PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /ZEESHAN MAHMOOD SHAIKH/Examiner, Art Unit 2658 /RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action

Prosecution Timeline

Feb 07, 2023
Application Filed
Jul 23, 2025
Non-Final Rejection mailed — §103
Oct 17, 2025
Response Filed
Jan 26, 2026
Final Rejection mailed — §103
Mar 05, 2026
Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12633299
LINEAR PREDICTION CODING PARAMETER CODING METHOD AND CODING APPARATUS
3y 6m to grant Granted May 19, 2026
Patent 12579373
SYSTEM AND METHOD FOR SYNTHETIC TEXT GENERATION TO SOLVE CLASS IMBALANCE IN COMPLAINT IDENTIFICATION
3y 3m to grant Granted Mar 17, 2026
Patent 12555575
Wakeup Indicator Monitoring Method, Apparatus and Electronic Device
3y 4m to grant Granted Feb 17, 2026
Patent 12518090
LOGICAL ROLE DETERMINATION OF CLAUSES IN CONDITIONAL CONSTRUCTIONS OF NATURAL LANGUAGE
3y 10m to grant Granted Jan 06, 2026
Patent 12511318
MULTI-SYSTEM-BASED INTELLIGENT QUESTION ANSWERING METHOD AND APPARATUS, AND DEVICE
3y 4m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3
Expected OA Rounds
53%
Grant Probability
99%
With Interview (+52.8%)
3y 1m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 34 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month