Prosecution Insights
Last updated: April 19, 2026
Application No. 18/404,568

METHOD AND APPARATUS FOR SYNTHESIZING UNIFIED VOICE WAVE BASED ON SELF-SUPERVISED LEARNING

Non-Final OA §103
Filed
Jan 04, 2024
Examiner
PATEL, SHREYANS A
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Supertone, Inc.
OA Round
3 (Non-Final)
89%
Grant Probability
Favorable
3-4
OA Rounds
2y 3m
To Grant
96%
With Interview

Examiner Intelligence

Grants 89% — above average
89%
Career Allow Rate
359 granted / 403 resolved
+27.1% vs TC avg
Moderate +7% lift
Without
With
+7.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 3m
Avg Prosecution
46 currently pending
Career history
449
Total Applications
across all art units

Statute-Specific Performance

§101
21.3%
-18.7% vs TC avg
§103
36.0%
-4.0% vs TC avg
§102
22.6%
-17.4% vs TC avg
§112
8.8%
-31.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 403 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Examiner used inventors own prior art against himself. The rejection has been withdrawn. A new Non-Final Office Action is being issued. See detailed 103 rejection of claims 6-11 below. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 6-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wu et al. (“A Unified Model for Zero-Shot Singing Voice Conversion and synthesis”; Dec. 4, 2022) in view of Soo et al. (TW 202309875). Claim 6, Wu teaches a self-supervised learning-based singing voice synthesis method, the self- supervised learning-based singing voice synthesis method ([Abstract] the model incorporates self-supervised joint training of the phonetic encoder and the acoustic encoder) being performed by a voice synthesis apparatus, including a voice analysis module configured to be trained to output voice features for training voice signals by using the training voice signals representing training voices, and to output voice features for the training voices ([2. Method] it contains an acoustic encoder (Ea) which encodes the spectrogram and a phonetic encoder (Ep) which encodes the phoneme sequence), and a voice synthesis module configured to train to synthesize voice signals from the voice features for the training voices by using the output voice features, and to synthesize synthesized voice signals, representing synthesized voices, from the output voice features ([2. Method] where the decoder (D) fuses it into the content information and generates the final output), the self-supervised learning-based singing voice synthesis method comprising: obtaining a singing voice synthesis request including a synthesis target song and a synthesis target singer ([Abstract] a model that generate the singing voice of any target singer from any source singing content in either text or audio format); obtaining a voice signal associated with the synthesis target singer based on the singing voice synthesis request ([2.2 Target Encoder] the target encoder Et takes the log-scale mel-spectrogram xt of the audio segments from the target singer’s corpus as input); generating, in a singing voice synthesis (SVS) module, singing voice features including a and the voice signal associated with the synthesis target singer ([Abstract] [2.1 Source Encoder] we extract F0 contours; phonetic encoder (Ep) which encodes the phoneme sequence (linguistic)); generating, in the voice analysis module, timbre features of the synthesis target singer based on the voice signal associated with the synthesis target singer ([2.2 Target encoder] outputs timbre features with different granularity levels from each Conv Block [of the target encoder]); and synthesizing, in the voice synthesis module, a singing voice signal, representing a voice in which the synthesis target song is sung using a voice of the synthesis target singer, based on the singing voice features and the timbre features ([2. Method] the decoder (D) fuses it (target singer information) into the content information and generates the final output). The difference between the prior art and the claimed invention is that Wu does not explicitly teach generating, in a singing voice synthesis (SVS) module, singing voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], Soo teaches generating, in a singing voice synthesis (SVS) module, singing voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], ([Fig. 3 step S1] using a vocoder 41 (e.g. WORLD vocoder) performs acoustic modeling on the audio data; analyzing the audio signal 34 according to the fundamental frequency 42, the harmonic spectrum envelope (periodic) and aperiodic envelope). Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wu with teachings of Soo by modifying the unified model for zero-shot singing voice conversion and synthesis as taught by Wu to include generating, in a singing voice synthesis (SVS) module, singing voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], ([Abstract] Soo). Claim 7, Wu further teaches the self-supervised learning-based singing voice synthesis method of claim 6, wherein the SVS module is an artificial neural network that is pre-trained to output singing voice features for an input synthesis target song ([Abstract] deep learning facilitate the implementation of zero-shot singing voice synthesis (SVS) and singing voice conversion (SVC) tasks and also provides the opportunity to unify these two tasks into one generalized model; the two-phase training process [7] to train the whole system; generate any target signer from any source singing content in either text or audio) and synthesis target singer by using a training dataset including training songs, training singer voices, and training singing voice features ([3.1] Datasets] First, the MPOP600 [32] dataset contains 600 Mandarin pop songs sung by two male and two female vocalists. Second, the NUS-48E[33] data set consists of 48 English popular songs performed by 12 singers. Third, the VCTK corpus [34] is a multi-speaker speech dataset for TTS and voice conversion tasks and has been widely used in many singing voice generation systems; the log scale mel-spectrograms with 80 bins are computed by short-time Fourier transformation (STFT) using the size of Fast Fourier Transform, window size, and hop size of 2,048, 1,200, and 300, respectively). Claim 8, Wu teaches a self-supervised learning-based modified voice synthesis method, the self-supervised learning-based modified voice synthesis method being performed by a voice synthesis apparatus, including a voice analysis module configured to be trained to output voice features for training voice signals by using the training voice signals representing training voices, and to output voice features for the training voices ([Abstract] [2. Method] the model incorporates self-supervised joint training of the phonetic encoder and the acoustic encoder; Figure 1 illustrates the whole diagram of the proposed system; the left side of the diagram (blue part) is the source encoder, which encodes the content information of the desired output from either audio or text input; it contains an acoustic encoder (Ea) which encodes the spectrogram and a phonetic encoder (Ep) which encodes the phoneme sequence), and a voice synthesis module configured to train to synthesize voice signals from the voice features for the training voices by using the output voice features, and to synthesize synthesized voice signals, representing synthesized voices, from the output voice features ([2. Method] [2.3 Decoder] the decoder (D) fuses it into the content information and generates the final output; the decoder is trained to minimize the average mel-spectrogram l2 reconstruction loss (see eq. 5)), the self-supervised learning-based modified voice synthesis method comprising: obtaining a pre-conversion voice that is a voice conversion target ([Abstract] a model that generate the singing voice of any target singer from any source singing content in either text or audio format); outputting, in the voice analysis module, pre-conversion voice features including a ([Abstract] [2.1 Source Encoder] we extract F0 contours; phonetic encoder (Ep) which encodes the phoneme sequence (linguistic)); obtaining voice attributes for a converted voice ([2.2 Target encoder] the target encoder Et encodes the spectrograms of the target singer’s audios into signer information); outputting, in a voice design (VOD) module, converted voice features including a fundamental frequency Fo and timbre features for the converted voice based on the voice attributes for the converted voice ([2.2 Target encoder] the target encoder Et outputs timbre features, see eq. 11, which calculates the converted F0); and synthesizing, in the voice synthesis module, the converted voice based on the pre-conversion voice features and the converted voice features ([2. Method] the decoder (D) fuses it into the content information and generates the final output; fs is then converted to a sequence of pitch embedding and concatenated with z as the decoder input). The difference between the prior art and the claimed invention is that Wu does not explicitly teach outputting, in the voice analysis module, pre-conversion voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], Soo teaches outputting, in the voice analysis module, pre-conversion voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], ([Fig. 3 step S1] using a vocoder 41 (e.g. WORLD vocoder) performs acoustic modeling on the audio data; analyzing the audio signal 34 according to the fundamental frequency 42, the harmonic spectrum envelope (periodic) and aperiodic envelope). Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wu with teachings of Soo by modifying the unified model for zero-shot singing voice conversion and synthesis as taught by Wu to include outputting, in the voice analysis module, pre-conversion voice features including a fundamental frequency Fo, periodic amplitude Ap[n], aperiodic amplitude Aap[n], and linguistic features for the pre-conversion voice based on the obtained pre-conversion voice as taught by Soo for the benefit of customizing a fixed timbre virtual singer to generate multiple personalized virtual singers ([Abstract] Soo). Claim 9, Soo further teaches the self-supervised learning-based modified voice synthesis method of claim 8, wherein the VOD module is an artificial neural network that is pre-trained ([Fig. 4] [end of pg. 5] two-dimensional convolution neural network; network can use Adam optimizer (Adam optimizer) to train the network for 200k steps) to output a fundamental frequency Fo and timbre features of the converted voice ([under claims] [pg. 8] vocoder to decompose the waveform of the audio data into fundamental frequencies, spectral envelopes and non-periodic envelopes) based on input voice attributes by using a training dataset including training voice attributes, training basic frequencies Fo, and training timbre features ([Abstract] [Fig. 4] output features conditioned on target attributes and input acoustic feature x and the target domain attribute). Claim 10, Wu teaches a self-supervised learning-based text to speech (TTS) synthesis method, the self-supervised learning-based TTS synthesis method being performed by a voice synthesis apparatus ([Abstract] the model incorporates self-supervised joint training of the phonetic encoder and the acoustic encoder to generate the singing voice of any target singer from any source singing content in either text or audio format; it utilizes advanced zero-shot voice conversion (VC) and text-to-speech (TTS) approach), including a voice analysis module configured to be trained to output voice features for training voice signals by using the training voice signals representing training voices, and to output voice features for the training voices ([2. Method] It contains an acoustic encoder (Ea) which encodes the spectrogram and a phonetic encoder (Ep) which encodes the phoneme sequence), and a voice synthesis module configured to train to synthesize voice signals from the voice features for the training voices by using the output voice features, and to synthesize synthesized voice signals, representing synthesized voices, from the output voice features (([2. Method] [2.3 Decoder] the decoder (D) fuses it into the content information and generates the final output), the self-supervised learning-based TTS synthesis method comprising: obtaining a synthesis target text and a synthesis target voice subject for which TTS synthesis is desired ([2. Method] [2.1 Source encoder] the text input ps, which is corresponding to xs is a phoneme index sequence with the length of tp; the target encoder (Et) encodes the spectrograms of the target singer’s audios into singer information); obtaining a voice associated with the synthesis target voice subject based on the synthesis target voice subject ([2.2 Target encoder] Et takes the log-scale mel-spectrogram xt of the audio segments from the target singer’s corpus as input); outputting, in the voice analysis module, voice features of the synthesis target voice subject, including timbre features of the synthesis target voice subject, based on the voice associated with the synthesis target voice subject ([2.2 Target encoder] the target encoder Et outputs timbre features with difference granularity level from each Conv Block); and synthesizing the text voice based on ([2. Method] the decoder (D) fuses it (timbre features) int the content information and generates the final output). The difference between the prior art and the claimed invention is that Wu does not explicitly teach synthesizing the text voice based on the fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice Soo teaches synthesizing the text voice based on the fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice([Fig. 3 step S1] analyzing the audio signal 34 according to the fundamental frequency 42 For the harmonic spectrum envelope (envelope) 43 and the aperiodic envelope 44). outputting, in a TTS module, voice features of a text voice, including a fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice in which the synthesis target text is read using a voice of the synthesis target voice subject, based on the synthesis target text and the voice associated with the synthesis target voice subject ([Fig. 3 step S1] decomposing the waveform of the audio material into a fundamental frequency, a spectral envelope (periodic), and an aperiodic envelope using the vocoder). Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wu with teachings of Soo by modifying the unified model for zero-shot singing voice conversion and synthesis as taught by Wu to include synthesizing the text voice based on the fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice of the synthesis target voice subject; outputting, in a TTS module, voice features of a text voice, including a fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] for the text voice in which the synthesis target text is read using a voice of the synthesis target voice subject, based on the synthesis target text and the voice associated with the synthesis target voice subject as taught by Soo for the benefit of customizing a fixed timbre virtual singer to generate multiple personalized virtual singers ([Abstract] Soo). Claim 11, Soo further teaches the self-supervised learning-based TTS synthesis method of claim 10, wherein the TTS module is an artificial neural network ([Fig. 4] [end of pg. 5] two-dimensional convolution neural network; network can use Adam optimizer (Adam optimizer) to train the network for 200k steps) that is pre-trained to output the fundamental frequency Fo, periodic amplitude Ap[n], and aperiodic amplitude Aap[n] of the text voice based on an input text ([under claims] [pg. 8] vocoder to decompose the waveform of the audio data into fundamental frequencies, spectral envelopes and non-periodic envelopes) and voice by using a training dataset including training synthesized texts, training voices, and training voice features ([Abstract] [Fig. 4] output features conditioned on target attributes and input acoustic feature x and the target domain attribute). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. KR 102639322 – A speech synthesis system and method that can quickly and accurately synthesize speech according to various prosody and speaker timbre styles using deep learning-based end-to-end speech synthesis technology using global style tokens and capable of replicating tones and prosody styles in real time are disclosed. The speech synthesis method includes extracting the fundamental frequency from the input reference audio and transmitting it to the input of the reference encoder, encoding the fundamental frequency by the reference encoder to generate a prosody embedding, and generating a first style embedding from the prosody embedding. Converting the reference audio to a reference Mel spectrogram by Fourier transform, encoding the reference Mel spectrogram to generate a speaker embedding, generating a second style embedding from the speaker embedding, combining the first style embedding and the first style embedding. A step of inputting the integrated style embedding that combines the two style embeddings into the attention of the TTS model together with the output of the encoder of the speech synthesis (TTS) model, and the audio of the speech synthesis in which timbre and prosody are combined for the input text by the TTS model. Includes creation steps. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. SHREYANS A. PATEL Primary Examiner Art Unit 2653 /SHREYANS A PATEL/ Examiner, Art Unit 2659
Read full office action

Prosecution Timeline

Jan 04, 2024
Application Filed
Sep 06, 2025
Non-Final Rejection — §103
Nov 17, 2025
Response Filed
Nov 25, 2025
Non-Final Rejection — §103
Jan 29, 2026
Examiner Interview Summary
Jan 29, 2026
Examiner Interview (Telephonic)
Feb 02, 2026
Response Filed
Feb 25, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12586597
ENHANCED AUDIO FILE GENERATOR
2y 5m to grant Granted Mar 24, 2026
Patent 12586561
TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE
2y 5m to grant Granted Mar 24, 2026
Patent 12548549
ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)
2y 5m to grant Granted Feb 10, 2026
Patent 12548583
ACOUSTIC CONTROL APPARATUS, STORAGE MEDIUM AND ACCOUSTIC CONTROL METHOD
2y 5m to grant Granted Feb 10, 2026
Patent 12536988
SPEECH SYNTHESIS METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
89%
Grant Probability
96%
With Interview (+7.4%)
2y 3m
Median Time to Grant
High
PTA Risk
Based on 403 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month