Last updated: May 29, 2026

Application No. 18/404,568

METHOD AND APPARATUS FOR SYNTHESIZING UNIFIED VOICE WAVE BASED ON SELF-SUPERVISED LEARNING

Non-Final OA §103

Filed

Jan 04, 2024

Priority

Apr 12, 2023 — RE 10-2023-0047906

Examiner

PATEL, SHREYANS A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Supertone, Inc.

OA Round

3 (Non-Final)

Interview Optional

— +7.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 89% grant rate with +7.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 406 resolved cases, 2023–2026

Examiner Intelligence

PATEL, SHREYANS A View full profile →

Grants 89% — above average

Career Allowance Rate

361 granted / 406 resolved

+26.9% vs TC avg

Moderate +8% lift

Without

With

+7.7%

Interview Lift

resolved cases with interview

Fast prosecutor

2y 0m

Avg Prosecution

26 currently pending

Career history

449

Total Applications

across all art units

Statute-Specific Performance

§101

11.0%

-29.0% vs TC avg

§103

67.1%

+27.1% vs TC avg

§102

11.6%

-28.4% vs TC avg

§112

0.8%

-39.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 406 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Examiner used inventors own prior art against himself. The rejection has been withdrawn. A new Non-Final Office Action is being issued. See detailed 103 rejection of claims 6-11 below. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 6-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wu et al. (“A Unified Model for Zero-Shot Singing Voice Conversion and synthesis”; Dec. 4, 2022) in view of Soo et al. (TW 202309875).

Claim 6,
Wu teaches a self-supervised learning-based singing voice synthesis method, the self- supervised learning-based singing voice synthesis method ([Abstract] the model incorporates self-supervised joint training of the phonetic encoder and the acoustic encoder) being performed by a voice synthesis apparatus, including a voice analysis module configured to be trained to output voice features for training voice signals by using the training voice signals representing training voices, and to output voice features for the training voices ([2. Method] it contains an acoustic encoder (Ea) which encodes the spectrogram and a phonetic encoder (Ep) which encodes the phoneme sequence), and 
a voice synthesis module configured to train to synthesize voice signals from the voice features for the training voices by using the output voice features, and to synthesize synthesized voice signals, representing synthesized voices, from the output voice features ([2. Method] where the decoder (D) fuses it into the content information and generates the final output), 
the self-supervised learning-based singing voice synthesis method comprising: obtaining a singing voice synthesis request including a synthesis target song and a synthesis target singer ([Abstract] a model that generate the singing voice of any target singer from any source singing content in either text or audio format); 
obtaining a voice signal associated with the synthesis target singer based on the singing voice synthesis request ([2.2 Target Encoder] the target encoder Et takes the log-scale mel-spectrogram xt of the audio segments from the target singer’s corpus as input); 
generating, in a singing voice synthesis (SVS) module, singing voice features including a and the voice signal associated with the synthesis target singer ([Abstract] [2.1 Source Encoder] we extract F0 contours; phonetic encoder (Ep) which encodes the phoneme sequence (linguistic)); 
generating, in the voice analysis module, timbre features of the synthesis target singer based on the voice signal associated with the synthesis target singer ([2.2 Target encoder] outputs timbre features with different granularity levels from each Conv Block [of the target encoder]); and 
synthesizing, in the voice synthesis module, a singing voice signal, representing a voice in which the synthesis target song is sung using a voice of the synthesis target singer, based on the singing voice features and the timbre features ([2. Method] the decoder (D) fuses it (target singer information) into the content information and generates the final output).
The difference between the prior art and the claimed invention is that Wu does not explicitly teach generating, in a singing voice synthesis (SVS) module, singing voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], 
Soo teaches generating, in a singing voice synthesis (SVS) module, singing voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], ([Fig. 3 step S1] using a vocoder 41 (e.g. WORLD vocoder) performs acoustic modeling on the audio data; analyzing the audio signal 34 according to the fundamental frequency 42, the harmonic spectrum envelope (periodic) and aperiodic envelope).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wu with teachings of Soo by modifying the unified model for zero-shot singing voice conversion and synthesis as taught by Wu to include generating, in a singing voice synthesis (SVS) module, singing voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], ([Abstract] Soo).

Claim 7,
Wu further teaches the self-supervised learning-based singing voice synthesis method of claim 6, wherein the SVS module is an artificial neural network that is pre-trained to output singing voice features for an input synthesis target song ([Abstract] deep learning facilitate the implementation of zero-shot singing voice synthesis (SVS) and singing voice conversion (SVC) tasks and also provides the opportunity to unify these two tasks into one generalized model; the two-phase training process [7] to train the whole system; generate any target signer from any source singing content in either text or audio) and synthesis target singer by using a training dataset including training songs, training singer voices, and training singing voice features ([3.1] Datasets] First, the MPOP600 [32] dataset contains 600 Mandarin pop songs sung by two male and two female vocalists. Second, the NUS-48E[33] data set consists of 48 English popular songs performed by 12 singers. Third, the VCTK corpus [34] is a multi-speaker speech dataset for TTS and voice conversion tasks and has been widely used in many singing voice generation systems; the log scale mel-spectrograms with 80 bins are computed by short-time Fourier  transformation (STFT) using the size of Fast Fourier Transform, window size, and hop size of 2,048, 1,200, and 300, respectively).

Claim 8,
Wu teaches a self-supervised learning-based modified voice synthesis method, the self-supervised learning-based modified voice synthesis method being performed by a voice synthesis apparatus, including a voice analysis module configured to be trained to output voice features for training voice signals by using the training voice signals representing training voices, and to output voice features for the training voices ([Abstract] [2. Method] the model incorporates self-supervised joint training of the phonetic encoder and the acoustic encoder; Figure 1 illustrates the whole diagram of the proposed system; the left side of the diagram (blue part) is the source encoder, which encodes the content information of the desired output from either audio or text input; it contains an acoustic encoder (Ea) which encodes the spectrogram and a phonetic encoder (Ep) which encodes the phoneme sequence), and 
a voice synthesis module configured to train to synthesize voice signals from the voice features for the training voices by using the output voice features, and to synthesize synthesized voice signals, representing synthesized voices, from the output voice features ([2. Method] [2.3 Decoder] the decoder (D) fuses it into the content information and generates the final output; the decoder is trained to minimize the average mel-spectrogram l2 reconstruction loss (see eq. 5)), 
the self-supervised learning-based modified voice synthesis method comprising: obtaining a pre-conversion voice that is a voice conversion target ([Abstract] a model that generate the singing voice of any target singer from any source singing content in either text or audio format); 
outputting, in the voice analysis module, pre-conversion voice features including a ([Abstract] [2.1 Source Encoder] we extract F0 contours; phonetic encoder (Ep) which encodes the phoneme sequence (linguistic)); 
obtaining voice attributes for a converted voice ([2.2 Target encoder] the target encoder Et encodes the spectrograms of the target singer’s audios into signer information); 
outputting, in a voice design (VOD) module, converted voice features including a fundamental frequency Fo and timbre features for the converted voice based on the voice attributes for the converted voice ([2.2 Target encoder] the target encoder Et outputs timbre features, see eq. 11, which calculates the converted F0); and 
synthesizing, in the voice synthesis module, the converted voice based on the pre-conversion voice features and the converted voice features ([2. Method] the decoder (D) fuses it into the content information and generates the final output; fs is then converted to a sequence of pitch embedding and concatenated with z as the decoder input).
The difference between the prior art and the claimed invention is that Wu does not explicitly teach outputting, in the voice analysis module, pre-conversion voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], 
Soo teaches outputting, in the voice analysis module, pre-conversion voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], ([Fig. 3 step S1] using a vocoder 41 (e.g. WORLD vocoder) performs acoustic modeling on the audio data; analyzing the audio signal 34 according to the fundamental frequency 42, the harmonic spectrum envelope (periodic) and aperiodic envelope).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wu with teachings of Soo by modifying the unified model for zero-shot singing voice conversion and synthesis as taught by Wu to include outputting, in the voice analysis module, pre-conversion voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], and linguistic features for the pre-conversion voice based on the obtained pre-conversion voice as taught by Soo for the benefit of customizing a fixed timbre virtual singer to generate multiple personalized virtual singers ([Abstract] Soo).

Claim 9,
Soo further teaches the self-supervised learning-based modified voice synthesis method of claim 8, wherein the VOD module is an artificial neural network that is pre-trained ([Fig. 4] [end of pg. 5] two-dimensional convolution neural network; network can use Adam optimizer (Adam optimizer) to train the network for 200k steps) to output a fundamental frequency Fo and timbre features of the converted voice ([under claims] [pg. 8] vocoder to decompose the waveform of the audio data into fundamental frequencies, spectral envelopes and non-periodic envelopes) based on input voice attributes by using a training dataset including training voice attributes, training basic frequencies Fo, and training timbre features ([Abstract] [Fig. 4] output features conditioned on target attributes and input acoustic feature x and the target domain attribute).

Claim 10,
Wu teaches a self-supervised learning-based text to speech (TTS) synthesis method, the self-supervised learning-based TTS synthesis method being performed by a voice synthesis apparatus ([Abstract] the model incorporates self-supervised joint training of the phonetic encoder and the acoustic encoder to generate the singing voice of any target singer from any source singing content in either text or audio format; it utilizes advanced zero-shot voice conversion (VC) and text-to-speech (TTS) approach), 
including a voice analysis module configured to be trained to output voice features for training voice signals by using the training voice signals representing training voices, and to output voice features for the training voices ([2. Method] It contains an acoustic encoder (Ea) which encodes the spectrogram and a phonetic encoder (Ep) which encodes the phoneme sequence), and 
a voice synthesis module configured to train to synthesize voice signals from the voice features for the training voices by using the output voice features, and to synthesize synthesized voice signals, representing synthesized voices, from the output voice features (([2. Method] [2.3 Decoder] the decoder (D) fuses it into the content information and generates the final output), 
the self-supervised learning-based TTS synthesis method comprising: obtaining a synthesis target text and a synthesis target voice subject for which TTS synthesis is desired ([2. Method] [2.1 Source encoder] the text input ps, which is corresponding to xs is a phoneme index sequence with the length of tp; the target encoder (Et) encodes the spectrograms of the target singer’s audios into singer information); 
obtaining a voice associated with the synthesis target voice subject based on the synthesis target voice subject ([2.2 Target encoder] Et takes the log-scale mel-spectrogram xt of the audio segments from the target singer’s corpus as input);
outputting, in the voice analysis module, voice features of the synthesis target voice subject, including timbre features of the synthesis target voice subject, based on the voice associated with the synthesis target voice subject ([2.2 Target encoder] the target encoder Et outputs timbre features with difference granularity level from each Conv Block); 
and synthesizing the text voice based on ([2. Method] the decoder (D) fuses it (timbre features) int the content information and generates the final output).
The difference between the prior art and the claimed invention is that Wu does not explicitly teach synthesizing the text voice based on the fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice
Soo teaches synthesizing the text voice based on the fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice([Fig. 3 step S1] analyzing the audio signal 34 according to the fundamental frequency 42 For the harmonic spectrum envelope (envelope) 43 and the aperiodic envelope 44).
outputting, in a TTS module, voice features of a text voice, including a fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice in which the synthesis target text is read using a voice of the synthesis target voice subject, based on the synthesis target text and the voice associated with the synthesis target voice subject ([Fig. 3 step S1] decomposing the waveform of the audio material into a fundamental frequency, a spectral envelope (periodic), and an aperiodic envelope using the vocoder).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wu with teachings of Soo by modifying the unified model for zero-shot singing voice conversion and synthesis as taught by Wu to include synthesizing the text voice based on the fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice of the synthesis target voice subject; outputting, in a TTS module, voice features of a text voice, including a fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice in which the synthesis target text is read using a voice of the synthesis target voice subject, based on the synthesis target text and the voice associated with the synthesis target voice subject as taught by Soo for the benefit of customizing a fixed timbre virtual singer to generate multiple personalized virtual singers ([Abstract] Soo).

Claim 11,
Soo further teaches the self-supervised learning-based TTS synthesis method of claim 10, wherein the TTS module is an artificial neural network ([Fig. 4] [end of pg. 5] two-dimensional convolution neural network; network can use Adam optimizer (Adam optimizer) to train the network for 200k steps) that is pre-trained to output the fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] of the text voice based on an input text ([under claims] [pg. 8] vocoder to decompose the waveform of the audio data into fundamental frequencies, spectral envelopes and non-periodic envelopes) and voice by using a training dataset including training synthesized texts, training voices, and training voice features ([Abstract] [Fig. 4] output features conditioned on target attributes and input acoustic feature x and the target domain attribute).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
KR 102639322 – A speech synthesis system and method that can quickly and accurately synthesize speech according to various prosody and speaker timbre styles using deep learning-based end-to-end speech synthesis technology using global style tokens and capable of replicating tones and prosody styles in real time are disclosed. The speech synthesis method includes extracting the fundamental frequency from the input reference audio and transmitting it to the input of the reference encoder, encoding the fundamental frequency by the reference encoder to generate a prosody embedding, and generating a first style embedding from the prosody embedding. Converting the reference audio to a reference Mel spectrogram by Fourier transform, encoding the reference Mel spectrogram to generate a speaker embedding, generating a second style embedding from the speaker embedding, combining the first style embedding and the first style embedding. A step of inputting the integrated style embedding that combines the two style embeddings into the attention of the TTS model together with the output of the encoder of the speech synthesis (TTS) model, and the audio of the speech synthesis in which timbre and prosody are combined for the input text by the TTS model. Includes creation steps.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Primary Examiner
Art Unit 2653



/SHREYANS A PATEL/               Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Jan 04, 2024

Application Filed

Sep 10, 2025

Non-Final Rejection mailed — §103

Nov 17, 2025

Response Filed

Nov 28, 2025

Non-Final Rejection mailed — §103

Jan 29, 2026

Examiner Interview (Telephonic)

Jan 29, 2026

Examiner Interview Summary

Feb 02, 2026

Response Filed

Mar 04, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/132,165

Patent 12608559

METHOD AND SYSTEM FOR ENHANCING A MUTIMODAL INPUT CONTENT

3y 0m to grant Granted Apr 21, 2026

18/696,802

Patent 12609128

METHOD FOR IMPROVING FAR-FIELD SPEECH INTERACTION PERFORMANCE, AND FAR-FIELD SPEECH INTERACTION SYSTEM

2y 0m to grant Granted Apr 21, 2026

17/934,906

Patent 12586597

ENHANCED AUDIO FILE GENERATOR

3y 6m to grant Granted Mar 24, 2026

18/744,449

Patent 12586561

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE

1y 9m to grant Granted Mar 24, 2026

17/983,671

Patent 12548549

ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)

3y 3m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

89%

Grant Probability

97%

With Interview (+7.7%)

2y 0m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 406 resolved cases by this examiner. Grant probability derived from career allowance rate.