Last updated: April 19, 2026
Application No. 17/987,034
METHOD, ELECTRONIC DEVICE, AND COMPUTER PROGRAM PRODUCT FOR SPEECH SYNTHESIS

Final Rejection §101§103
Filed
Nov 15, 2022
Examiner
SHAIKH, ZEESHAN MAHMOOD
Art Unit
2658
Tech Center
2600 — Communications
Assignee
DELL PRODUCTS, L.P.
OA Round
4 (Final)
Interview Optional

— +55.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 31 resolved cases, 2023–2026
Examiner Intelligence

SHAIKH, ZEESHAN MAHMOOD View full profile →
Grants 52% of resolved cases
Career Allow Rate
16 granted / 31 resolved
-10.4% vs TC avg
Strong +55% interview lift
Without
With
+55.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
32 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
25.7%
-14.3% vs TC avg
§103
45.8%
+5.8% vs TC avg
§102
17.3%
-22.7% vs TC avg
§112
5.8%
-34.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 31 resolved cases
Office Action

§101 §103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This communication is responsive to the applicant’s amendment dated 1/15/2026.  The applicant amended independent claims 1, 10, and 19.  Additionally, the applicant amended dependent claims 3 and 12.  

Response to Arguments
Applicant's arguments with respect to 35 U.S.C. 101 (see Remarks, pg. 8, line 29 – pg. 10, line 15) filed 1/15/2026 have been fully considered but they are not persuasive.
The applicant has amended the independent claims to include the limitation that discusses that the speech synthesis model is being trained by performing potential energy minimization.  While this is a step in the right direction to overcome the 35 U.S.C. 101 rejection, the applicant is encouraged to incorporate what the model is being trained to do.  Merely stating how the model is trained is not enough, but also of its usage.  The examiner has removed the 35 U.S.C. 101 rejection of dependent claims 6-7 which the examiner believes discusses the usage of the trained model.  The examiner believes the incorporation of these claims into the independent claim will help move prosecution forward.  
Applicant's arguments with respect to 35 U.S.C 103 filed 1/15/2026 have been fully considered but they are not persuasive. 
The applicant first argues that Scodary fails to teach potential energy measures in a loss function for speech synthesis model generation.  In particular, the applicant argues that Scodary teaches signal energy and not potential energy measures between transformed pairs of voice feature vectors.  The examiner uses Wieman to teach the potential energy portion of the limitation in the teaching of the use of a centroid in a cost function.  The examiner in particular pointed to paragraph [0061] explaining how he was interpreting potential energy to relate to center of mass.  The applicant merely states that he believes that Wieman didn’t teach it but does not provide any further supporting arguments.  Therefore, the examiner is not convinced and the 35 U.S.C. 103 rejection is maintained.  

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-5 and 8-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
	Independent claims 1, 10, and 19 recite, “extracting a plurality of voice feature vectors of a plurality of speakers from a plurality of audios corresponding to the plurality of speakers”, “calculating a first loss function based on distances between the plurality of voice feature vectors of the plurality of speakers, wherein the first loss function includes a combination of a plurality of potential energy measures computed between respective transformed pairs of the voice feature vectors in accordance with respective ones of the distances, each of the potential energy measures comprising a difference between a first function of a corresponding one of the distances and a second function of the corresponding one of the distances, the second function being different than the first function, the first function and the second function collectively defining, for each of the transformed pairs of the voice feature vectors, a corresponding one of the potential energy measures for that transformed pair based on the distance between the voice feature vectors of that pair”, “calculating a second loss function according to a plurality of texts and a plurality of corresponding real audios”, and “training a speech synthesis model based on the first loss function and the second loss function, wherein training the speech synthesis model comprises performing potential energy minimization utilizing at least the first loss function”.  
The claims recite a method of generating a speech synthesis model through the extraction of vectors and through the calculations of a first and second loss function.  The examiner views the majority of the limitations as a mathematical concept.  In particular the limitations of calculating a first and second loss function recite mathematical calculations.  The equation for the loss function is found in the specification as equation (3).  
The judicial exception is not integrated into a practical application.  In particular, the claims only recite the additional elements, “at least one processor”, “memory coupled to the at least one processor”, and “a computer program product”.  These elements in these steps are recited at a high-level of generality such that is amounts no more than mere instructions to apply the exception using generic computer components.  Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.  The claims are directed to an abstract idea.  
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception.  As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using “at least one processor”, “memory coupled to the at least one processor”, and “a computer program product” to perform the extracting, calculating, and generating steps amounts to no more than mere instructions to apply the exception using generic computer components.  Mere instructions to apply an exception using generic computer components cannot provide an inventive concept.  The claim is not patent eligible.   
	Dependent claims 2-5, 8-9, 11-18, and 20 are also rejected for the same reasons provided in independent claim 1, 10, and 19 above.  The dependent claim, including the further recited limitation, does not integrate the abstract idea into a practical application and the additional elements, taken individually and in combination do not contribute to an inventive concept.  In other words, the dependent claim is directed to an abstract idea without significantly more.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 10-11, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Oplustil Gallegos et al. US 20240087558 A1 (hereinafter Oplustil Gallegos) in view of Scodary et al. US 10861436 B1 (hereinafter Scodary) in view of Wieman et al. US 20210174783 A1 (hereinafter Wieman).

Regarding independent claims 1, 10, and 19, Oplustil Gallegos teaches a method for speech synthesis, the method comprising, an electronic device for speech synthesis, a computer program product tangibly stored in a non-transitory computer-readable medium and comprising machine-executable instructions, wherein the machine-executable instructions, when executed by a machine, cause the machine to perform a method for speech synthesis, the method comprising: 
at least one processor [0066] “a system for modifying a speech signal generated by a text-to-speech synthesiser is provided, the system comprising a processor”;
and a memory coupled to the processor and having instructions stored therein, wherein the instructions, when executed by the processor, cause the electronic device to perform actions comprising: (FIG. 9, [0251] “The computer program 5 stored in the non-volatile memory can be accessed by the processor 3 so that the processor 3 executes the computer program 5. The processor 3 may comprise logic circuitry that responds to and processes the computer program instructions”) 
extracting a plurality of voice feature vectors of a plurality of speakers from a plurality of audios corresponding to the plurality of speakers ([0173] “FIG. 5 shows a flow chart illustrating how a control feature vector is derived in S105 of FIG. 4. In S105-1, the speech signal obtained in S103 of FIG. 4 is analyzed. The analysis in S105-1 relates to the extraction of one or more properties of the obtained speech signal”; [0170] “the training set 41 comprises audio samples from different speakers. When the audio samples are from different speakers, the prediction network 21 comprises a speaker ID input (e.g. an integer or learned embedding), where the speaker ID inputs correspond to the audio samples from the different speakers”; [0267] For the prominence model a dataset of text audio pairs is obtained for a single speaker (or multiple speakers if training a multi-speaker model). );
calculating a first loss function based on distances between the plurality of voice feature vectors of the plurality of speakers ([0242] “To train the duration prediction network, the duration of all phonemes in a text audio pair dataset is obtained. This can be done using a standard TTS model with attention. From the standard TTS model trained on the audio text pair dataset, the ground truth aligned attentions may be taken… The final output vector of the duration predictor is compared with the duration vector obtained above and a mean squared error loss is computed. The weights of the duration prediction network are then updated via back propagation”, examiner interprets the duration of phonemes as voice features.  Additionally, examiner interprets the mean squared error loss as the first loss function, which is calculated loss based on the difference between feature vectors(output and duration vector) from different speakers; [0170] “the training set 41 comprises audio samples from different speakers. When the audio samples are from different speakers, the prediction network 21 comprises a speaker ID input (e.g. an integer or learned embedding), where the speaker ID inputs correspond to the audio samples from the different speakers”, examiner interprets the audio data to train the prediction network come from samples from multiple of speakers)
calculating a second loss function according to a plurality of texts and a plurality of corresponding real audios (FIG. 17(b), [0388] “The target audio 1720 is also analyzed in 1722 to obtain one or more target acoustic features. The difference between the target acoustic feature resulting from the analysis 1722 and the predicted acoustic feature vector from the acoustic prediction network 1608 is then obtained using an L1 or L2 loss and a second loss is obtained”, examiner interprets the second loss is based off the 1700 (text) and 1720 (audio));
and training a speech synthesis model based on the first loss function and the second loss function ([0388] “The obtained first and second losses are added 1730 and the total loss is then backward propagated to update the weights of the acoustic prediction network”; [0065] “The method of training may use a training loss function such as a mean squared error. The training loss may be computed by comparing the speech output by the controllable model with the training speech signal generated by the pre-trained first model”; [0272] Once the prominence vector is obtained from the training data, the model is trained as usual, feeding in the text and the prominence vector and learning to produce the Mel spectrogram of the audio by back-propagating the mean squared error loss between the synthesised Mel-spectrogram and the real Mel-spectrogram).
Oplustil Gallegos fails to teach wherein the first loss function includes a combination of a plurality of potential energy measures computed between respective transformed pairs of the voice feature vectors in accordance with respective ones of the distances, each of the potential energy measures comprising a difference between a first function of a corresponding one of the distances and a second function of the corresponding one of the distances, the second function being different than the first function; the first function and the second function collectively defining, for each of the transformed pairs of the voice feature vectors, a corresponding one of the potential energy measures for that transformed pair based on the distance between the voice feature vectors of that pair.  wherein training the speech synthesis model comprises performing potential energy minimization utilizing at least the first loss function.  wherein training the speech synthesis model comprises performing potential energy minimization utilizing at least the first loss function
However, Scodary teaches wherein the first loss function includes a combination of a plurality of potential energy measures computed between respective transformed pairs of the voice feature vectors in accordance with respective ones of the distances, each of the potential energy measures comprising a difference between a first function of a corresponding one of the distances and a second function of the corresponding one of the distances, the second function being different than the first function, the first function and the second function collectively defining, for each of the transformed pairs of the voice feature vectors, a corresponding one of the potential energy measures for that transformed pair based on the distance between the voice feature vectors of that pair; ([Column 6, line 23-32] “Call similarity may be performed by embedding the sequence of words into a sequence of vectors, with several signal features (i.e., energy, variance, spectral coefficients) appended to the word embedding. The distance function between two similarity matrices may minimize the distance between paired word/signal vectors”, examiner interprets distance function as the loss function and energy to include potential energy and the conversion of words into a sequence of vectors as the transformation);
	Oplustil Gallegos in view of Scodary considered to be analogous to the claimed invention because both are the same field of speech processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of modifying speech generated by a text-to-speech synthesiser of Oplustil Gallegos with the technique of calculating loss through distance between vectors taught by Scodary in order to improve the manner of how an audio content is analyzed for speech semantic content and speech vocal content to determine audio content metrics  (see Scodary [Abstract]).
Oplustil Gallegos in view of Scodary fails to teach potential energy and wherein training the speech synthesis model comprises performing potential energy minimization utilizing at least the first loss function
However, Wieman teaches wherein the first loss function includes a combination of a plurality of potential energy measures….([0149] “analyze the generated speech audio to find its distance from the centroid [0150] 4. if the generated audio is beyond the natural range, discard the parameters set [0151] 5. else, compute a vector from the centroid to the feature values of the generated speech audio and save the parameter set and its vector value [0152] 6. apply a cost function to choosing a next parameter set that favors a great distance from the vectors of saved parameter sets but being within the natural range from the centroid”, examiner interprets the centroid to relate to center of mass and by extension relating to potential energy as defined in [0061] of the published specification and the call function as the loss function)
wherein training the speech synthesis model comprises performing potential energy minimization utilizing at least the first loss function ([0155] “9. for each variable value that the variable recognizer needs to recognize, for each saved parameter set, generate speech audio as an initial training set for training the variable recognizer.”)
	Oplustil Gallegos in view of Scodary in view of Wieman considered to be analogous to the claimed invention because both are the same field of speech processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of modifying the speech processing techniques of Oplustil Gallegos in view of Scodary with the technique of computing vectors from the centeroid taught by Wieman in order to improve how to train a speech recognizer, such as for recognizing variables in a neural speech-to-meaning system, compute, within an embedding space, a range of vectors of features of natural speech  (see Wieman [Abstract]).

	Regarding claims 2, 11, and 20, Oplustil Gallegos in view of Scodary in view of Wieman teaches all of the limitations of claims 1, 10, and 19, upon which claims 2, 11, and 20 depend.  
	Additionally, Oplustil Gallegos teaches wherein the calculating a second loss function according to a plurality of texts and a plurality of corresponding real audios comprises: obtaining the second loss function based on a difference between a synthesized audio for the plurality of texts and the plurality of real audios corresponding to the plurality of texts (FIG. 17(b) [0388] “The difference between the target mel spectrogram 1721 and the predicted mel spectrogram 1704 is then obtained using an L1 (based on the absolute difference) or L2 (based on the squared differences) loss function and a first loss is obtained. The target audio 1720 is also analyzed in 1722 to obtain one or more target acoustic features.  The difference between the target acoustic feature resulting from the analysis 1722 and the predicted acoustic feature vector from the acoustic prediction network 1608 is then obtained using an L1 or L2 loss and a second loss is obtained”, examiner interprets taking the difference between the target and the predicted analogous to taking the difference between the synthesized and real audio).


Claims 3 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Oplustil Gallegos in view of Scodary in view of Wieman, as shown above in claim 1, in further view of Song et al. WO 2024058573 A1-has priority foreign application date of 9/16/2022 (hereinafter Song).

Regarding claims 3 and 12, Oplustil Gallegos in view of Scodary in view of Wieman teaches all of the limitations of claims 1 and 10, upon which claims 3 and 12 depend.
Additionally, Oplustil Gallegos teaches wherein the training a speech synthesis model based on the first loss function and the second loss function comprises: summarizing the first loss function and the second loss function to obtain a third loss function (FIG. 17(b), [0388] “The obtained first and second losses are added 1730 and the total loss is then backward propagated to update the weights of the acoustic prediction network”, examiner interprets the sum of the first and second loss function to be the third loss function); 
	Oplustil Gallegos in view of Scodary in view of Wieman fails to teach generating the speech synthesis model based on minimizing the third loss function.
	However, Song teaches generating the speech synthesis model based on minimizing the third loss function (p.10, 2nd paragraph “The total loss may be calculated based on at least one of the above-described first to fourth losses. As a specific example, the total loss may be calculated as a weighted sum of the first to fourth losses. A speech synthesis model can be trained to minimize total loss”).
	Oplustil Gallegos in view of Scodary in view of Wieman in view of Song considered to be analogous to the claimed invention because both are the same field of speech processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of modifying speech of Oplustil Gallegos in view of Scodary in view of Wieman with the technique of generating a speech synthesis model based on minimized loss functions taught by Song in order to generate a self-supervised representation containing linguistic information from a text representation, and to generate acoustic features based on the self-supervised representation, thereby preventing linguistic information from being lost (see Song [page 2, 1st paragraph]).

Claims 4-7, and 13-16 are rejected under 35 U.S.C. 103 as being unpatentable over Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou et al. US 20200410976 A1 (hereinafter Zhou)

Regarding claims 4 and 13, Oplustil Gallegos in view of Scodary in view of Wieman teaches all of the limitations of claims 1 and 10, upon which claims 4 and 13 depend.
Oplustil Gallegos in view of Scodary in view of Wieman fails to teach inputting a first text and voice features of a first speaker into the speech synthesis model; and outputting a first audio corresponding to the first text.
	However, Zhou teaches inputting a first text and voice features of a first speaker into the speech synthesis model (FIG. 1, [0038] “FIG. 1, the training process involves providing audio data corresponding to the speech of one person (speaker A) being fed into a content extraction block. Speaker A may, in some disclosed examples, be referred to as a “target speaker”, ”FIG. 5, 525, [0068] “system 500 is configured for providing input audio data corresponding to speech of a target speaker”, [0069] “the phoneme sequence alignment estimator block 510 receives the input audio data and text corresponding to the input audio data… the received text may include interview transcript text, script text, transcription text, etc”, [0074] “the voice modeling neural network 525 is configured to generate a predicted audio signal 530”, examiner interprets 525 as the main component of the speech synthesis model); 
and outputting a first audio corresponding to the first text (FIG. 5, 530, [0080] “the voice modeling neural network 525 is configured to generate a predicted audio signal 530 (a/k/a “synthesized audio data”) that includes synthesized audio data corresponding to the first neural network output and the first identification data. In such instances, the synthesized audio data corresponds to words uttered by the source speaker according to speech characteristics of the target speaker”).
	Oplustil Gallegos in view of Scodary in view of Wieman considered to be analogous to the claimed invention because all are the same field of speech processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of modifying speech of Oplustil Gallegos in view of Scodary in view of Wieman with the technique of outputting audio through a speech synthesis model with text and voice features taught by Zhou in order to improve processing audio signals for speech style transfer implementations (see Zhou [0002]).
	
Regarding claims 5 and 14, Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou teaches all of the limitations of claims 4 and 13, upon which claims 5 and 14 depend.  
	Additionally, Oplustil Gallegos teaches wherein the plurality of speakers corresponding to training of the speech synthesis model do not comprise the first speaker ([0170] “According to an example, the prediction network 21 is trained from a first training dataset 41 of text data 41a and audio data 41b pairs as shown in FIG. 3 (c). The Audio data 41b comprises one or more audio samples. In this example, the training dataset 41 comprises audio samples from a single speaker. In an alternative example, the training set 41 comprises audio samples from different speakers”; [0143] “FIG. 1 shows a schematic illustration of a method of modifying a speech signal. In S01, a user provides input text which is provided to a text-to-speech (TTS) system 1”, examiner interprets the user provided input as the first speaker that is not part of the training data).

Regarding claims 6 and 15, Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou teaches all of the limitations of claims 4 and 13, upon which claims 6 and 15 depend.
Additionally, Zhou teaches determining whether the first audio has the voice features of the first speaker (FIG. 5, 530, 540, [0074] “the loss function determining block 535 is configured for comparing the predicted audio signals to test data 540 and for determining a loss function value for the predicted audio signals. According to this example, the test data 540 is audio data corresponding to speech of the target speaker”, examiner interprets 530 as the first audio data and 540 to contain voice features of the first speaker)
and synthesizing, if the first audio has the voice features of the first speaker, a second audio corresponding to a second text using the speech synthesis model, the second audio having the voice features of the first speaker ([0075] “training the voice modeling neural network 525 (and in some instances, training the conditioning neural network 520) may continue until the loss function is relatively “flat,” such that the difference between a current loss function value and a prior loss function value is at or below a threshold value”, examiner interprets the speech synthesis model to be when the loss function is “flat”; FIG. 2, [0042-0043] “the vocal model neural network that has been trained for the voice of speaker A is also provided with identification data (which may be a simple “ID” or more complex identification data) corresponding to speaker A, or corresponding to the speech of speaker A. Therefore, according to this example, the vocal model neural network outputs the words of speaker B in the voice of speaker A with speaker B's speech tempo and intonation. Put another way, in this example the vocal model neural network outputs synthesized audio data includes the words uttered by speaker B according to speech characteristics of speaker A that have been learned by the vocal model neural network”, FIG. 6, [0080] “the synthesized audio data corresponds to words uttered by the source speaker according to speech characteristics of the target speaker”, examiner interprets this synthesized audio as the second audio)

	Regarding claims 7 and 16, Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou teaches all of the limitations of claims 6 and 15, upon which claims 7 and 16 depend.  
	Additionally, Zhou teaches wherein the speech synthesis model is a first speech synthesis model, and the method further comprises: generating a second speech synthesis model based on the voice features of the first speaker if the first audio does not have the voice features of the first speaker (FIG. 4, 445, [0065] “training the second neural network (and in some instances, training the first neural network) may continue until the loss function is relatively “flat,” such that the difference between a current loss function value and a prior loss function value (such as the previous loss function value) is at or below a threshold value. In the example shown in FIG. 4, block 445 involves repeating at least some of blocks 405 through 440 until a difference between a current loss function value for the first predicted audio signals and a prior loss function value for the first predicted audio signals is less than or equal to a predetermined value, for example 1.90, 1.92, 1.94, 1.96, 1.98, 2.00, etc. As described below, repeating some blocks (such as repeating block 420 and/or repeating block 430) may involve changing a physical state of at least one tangible storage medium location corresponding with at least one weight of the second neural network”, examiner interprets a second neural network with a loss function that is not “flat” as the second speech synthesis model which shows that the voice features in the first audio (530) do not exactly reflect voice features of the first speaker (540)) 
and synthesizing a third audio corresponding to a third text using the second speech synthesis model, the third audio having the voice features of the first speaker (FIG. 6, [0080] “the synthesized audio data corresponds to words uttered by the source speaker according to speech characteristics of the target speaker”, examiner interprets the third synthesized audio to be when the voice modeling network (525) has a loss function that is not flat and produces a third audio with some degree of features of the first speaker)

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou, as shown above in claim 4, in further view of Yang et al. US 20210074261 A1 (hereinafter Yang).

Regarding claims 8 and 17, Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou teaches all of the limitations of claims 4 and 13, upon which claims 8 and 17 depend.  
Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou fails to teach wherein the speech synthesis model is generated by training at a cloud, and the first audio, corresponding to the first text, for the first speaker is locally generated.
However, Yang teaches wherein the speech synthesis model is generated by training at a cloud, and the first audio, corresponding to the first text, for the first speaker is locally generated (FIG. 6 [0163] “The context fusion and learning module 91 may learn a user's intent based on at least one data. The at least one data may further include at least one sensing data acquired by a client device or a cloud environment”; [0124] “FIG. 6 shows an example in which, while a speech can be received in a device 50, a procedure of processing the received speech and thereby synthesize the speech, that is, overall operations of speech synthesis, is performed in a cloud environment 60”). 
Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou in view of Yang considered to be analogous to the claimed invention because both are the same field of speech processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of modifying speech of Oplustil Gallegos in view of Scodary in view of Wieman in view of Zhou with the technique of training a speech synthesis model at the cloud and generating speech locally on a client device taught by Yang in order to correcting emotion information of synthesized speech by using a deep-learning model (see Yang [0002]).

Claims 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Oplustil Gallegos in view of Scodary in view of Wieman, as shown in claim 1 above, in further view of Chang et al. US 20230076239 A1 (hereinafter Chang).

Regarding claims 9 and 18, Oplustil Gallegos in view of Scodary in view of Wieman teaches all of the limitations of claim 1 and 10, upon which claims 9 and 18 depend. 
Oplustil Gallegos in view of Scodary in view of Wieman fails to teach wherein the plurality of speakers comprise a second speaker and a third speaker; and a first distance between a first voice feature vector and a second voice feature vector for the second speaker is less than a second distance between the first voice feature vector for the second speaker and a third voice feature vector for the third speaker.
	However, Chang teaches wherein the plurality of speakers comprise a second speaker and a third speaker; and a first distance between a first voice feature vector and a second voice feature vector for the second speaker is less than a second distance between the first voice feature vector for the second speaker and a third voice feature vector for the third speaker ([0117] “a similarity score between two speakers and a 11-distance between speaker vectors of two speakers according to the calculated similarity score are calculated. Here, as the similarity score increases, the 11-distance decreases, and as a decreased deviation increases, a method is a better scoring method. FIG. 7 shows results of measuring a 11-distance between two speakers for four similarity scores”; [0134] Subsequently, the device for synthesizing a multi-speaker speech may determine a third speaker vector having the highest correlation with the first speaker vector (the highest similarity score) among the plurality of second speaker vectors (S905). )
	Oplustil Gallegos in view of Scodary in view of Wieman in view of Chang considered to be analogous to the claimed invention because all are the same field of speech processing.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the techniques of modifying speech of Oplustil Gallegos in view of Scodary in view of Wieman with the technique of comparing feature vectors by Chang in order to generating a speech of a new speaker, quickly and accurately generating a speech learning model of the new speaker using some of a plurality of speech vectors of a plurality of speakers (see Chang [0002]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Gao et al. (US 20220238116 A1) teaches a computer implemented method of sequence to sequence data processing, comprising: inputting a first input comprising a first input data sequence into a model, the model outputting a first output data sequence, a first part of the model generating an intermediate state comprising information relating to an alignment relationship between the first input data sequence and the first output data sequence, the intermediate state being used in the model to generate the first output data sequence; storing the intermediate state; modifying the model to replace the first part with the stored intermediate state; inputting a second input comprising a second input data sequence into the modified model, the modified model outputting a second output data sequence using the intermediate state.  
Fanelli et al. (US 20240160849 A1) teaches speaker diarization supporting episodical content. In an embodiment, a method comprises: receiving media data including one or more utterances; dividing the media data into a plurality of blocks; identifying segments of each block of the plurality of blocks associated with a single speaker; extracting embeddings for the identified segments in accordance with a machine learning model, wherein extracting embeddings for identified segments further comprises statistically combining extracted embeddings for identified segments that correspond to a respective continuous utterance associated with a single speaker; clustering the embeddings for the identified segments into clusters; and assigning a speaker label to each of the embeddings for the identified segments in accordance with a result of the clustering. In some embodiments, a voiceprint is used to identify a speaker and the speaker identity for a speaker label.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZEESHAN SHAIKH whose telephone number is (703)756-1730. The examiner can normally be reached Monday-Friday 7:30AM-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ZEESHAN MAHMOOD SHAIKH/Examiner, Art Unit 2658                                                                                                                                                                                                        

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Nov 15, 2022
Application Filed
Feb 08, 2025
Non-Final Rejection — §101, §103
May 14, 2025
Response Filed
Jul 08, 2025
Final Rejection — §101, §103
Sep 15, 2025
Response after Non-Final Action
Sep 15, 2025
Examiner Interview Summary
Sep 15, 2025
Applicant Interview (Telephonic)
Sep 24, 2025
Request for Continued Examination
Oct 01, 2025
Response after Non-Final Action
Oct 10, 2025
Non-Final Rejection — §101, §103
Jan 15, 2026
Response Filed
Jan 15, 2026
Examiner Interview Summary
Jan 15, 2026
Applicant Interview (Telephonic)
Mar 18, 2026
Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/992,340
Patent 12579373
SYSTEM AND METHOD FOR SYNTHETIC TEXT GENERATION TO SOLVE CLASS IMBALANCE IN COMPLAINT IDENTIFICATION
2y 5m to grant Granted Mar 17, 2026
17/915,465
Patent 12555575
Wakeup Indicator Monitoring Method, Apparatus and Electronic Device
2y 5m to grant Granted Feb 17, 2026
17/682,177
Patent 12518090
LOGICAL ROLE DETERMINATION OF CLAUSES IN CONDITIONAL CONSTRUCTIONS OF NATURAL LANGUAGE
2y 5m to grant Granted Jan 06, 2026
17/820,285
Patent 12511318
MULTI-SYSTEM-BASED INTELLIGENT QUESTION ANSWERING METHOD AND APPARATUS, AND DEVICE
2y 5m to grant Granted Dec 30, 2025
17/914,033
Patent 12512088
METHOD AND SYSTEM FOR USER-INTERFACE ADAPTATION OF TEXT-TO-SPEECH SYNTHESIS
2y 5m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
52%
Grant Probability
99%
With Interview (+55.0%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 31 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD, ELECTRONIC DEVICE, AND COMPUTER PROGRAM PRODUCT FOR SPEECH SYNTHESIS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email