Last updated: April 19, 2026

Application No. 17/694,703

SYSTEM AND METHOD FOR VOICE UNIDENTIFIABLE MORPHING

Non-Final OA §102§103

Filed

Mar 15, 2022

Examiner

MAUNG, THOMAS H

Art Unit

2692

Tech Center

2600 — Communications

Assignee

Soundhound AI Ip LLC

OA Round

5 (Non-Final)

Interview Optional

— +38.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 382 resolved cases, 2023–2026

Examiner Intelligence

MAUNG, THOMAS H View full profile →

Grants 63% of resolved cases

Career Allow Rate

242 granted / 382 resolved

+1.4% vs TC avg

Strong +38% interview lift

Without

With

+38.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

24 currently pending

Career history

406

Total Applications

across all art units

Statute-Specific Performance

§101

6.4%

-33.6% vs TC avg

§103

54.5%

+14.5% vs TC avg

§102

13.7%

-26.3% vs TC avg

§112

12.9%

-27.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 382 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claims 1-8 and 13 have been considered but are not persuasive.
Regarding the amendment, Examiner respectfully submits one of ordinary skill in the art would recognize that voice morphing is in itself the process of removing “voice characteristics” of a source user, such that only features representing content of the spoken phrase are processed to morph or regenerate the content of the phrase in the target user’s voice, removing any hint of the original user’s voice. 
In regards to the limitation of morphing using extracted speech content features without using the excluded user voice characteristics, Pan teaches only using prosody features and not the voice timbre of the source user.  [0015],  The computing system then generates a spectrogram based on (i) the PPG data, (ii) the extracted one or more additional prosody features and (iii) a voice timbre of the target speaker. Using this technique, the generated spectrogram is characterized by the prosody style of the source speaker and the voice timbre of the target speaker. [0117] of Pan, more closely aligned to the prosody style of the source speaker and sound more like the voice timbre of the target speaker). [0004] of Pan: In the same manner, a human ear can distinguish between different voices having different timbres. [0107] The voice conversion model 930 also includes an MFCC to PPG model configured to convert the speech data into PPG data 936 (e.g., PPG data 143). In some embodiments, the MFCC to PPG model 934 is speaker independent, wherein this component 934 is pre-trained using multi-speaker data. Beneficially, this model does not need to be further refined or adapted to the source speaker's audio. [0111] In some alternate embodiments, it is possible to train a speaker-independent multi-speaker source PPG-to-spectrogram model, wherein the generated spectrograms are generated for an averaged voice. 
Examiner further notes any voice feature not mentioned in the prior art as being used to morph the user’s spoken phrase is contemplated to read on the claimed “voice characteristics” excluded from use in morphing, for example, the voice timbres of source speakers ([0056] of Pan).
For at least these reasons, Examiner respectfully submits that prior art of record fully teaches the limitations of the claims.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 and 7 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Pan et al. (US 2022/0068259).
Claim 1
Pan teaches a method for training an audio morpher (See for example Fig. 8; [0071], In some examples, the training engine 154 is configured to train the voice conversion model 147, or components of the voice conversion model, on speaker data (e.g., target speaker data 141, source speaker data 142, or multi-speaker data).), the method comprising: 
receiving a target speaker spoken phrase (141 of Fig. 1; [0006], Training data for TTS models usually comprises audio data obtained by recording the particular target speaker while they speak and a set of text corresponding to the audio data (i.e., the textual representation of what the target speaker is saying to produce the audio data; [0043] In some embodiments, target speaker data 141 comprises electronic content/data obtained from a target speaker…In some instances, the target speaker data 141 and/or source speaker data 142 comprise(s) audio data, text data and/or visual data. [0044] In some embodiments, the target speaker data 141 and/or the source speaker data 142 is raw data (e.g., direct recordings). Additionally, or alternatively, in some embodiments, the target speaker data 141 and/or source speaker data 142 comprise processed data (e.g., waveform format of the speaker data and/or PPG data (e.g., PPG data 143) corresponding to the target and/or source speaker). See also [0054]); 
receiving a user spoken phrase (142 of Fig. 1; [0043] the source speaker data 142 comprises electronic content/data from a source speaker. In some instances, the target speaker data 141 and/or source speaker data 142 comprise(s) audio data, text data and/or visual data.)); 
extracting, from the user spoken phrase, speech content features representative of linguistic content of the user spoken phrase, including phoneme- related information (for example, PPG data [0066]; [0045] In some embodiments, the PPG data 143 comprises phonetic information about the speech data from a particular speaker; [0046] In some embodiments, the frame length for each piece of phonetic information comprises whole phrases of speech, whole words of speech, particular phonemes of speech; [0052 ]For example, a portion of an email containing one or more sentences (e.g., a particular number of machine-recognizable words) is applied to the neural TTS model, wherein the model is able to recognize words or parts of words (e.g., phonemes) and is trained to produce a corresponding sound to the phonemes or words.),
excluding from use in morphing the user spoken phrase voice characteristics associated with the user, including speaker-specific characteristics used to represent the user’s voice (Pan teaches only using prosody features and not the voice timbre of the originating user in morphing the user spoken phrase as shown at least in 240 and 250 of Fig. 2.  [0015],  The computing system then generates a spectrogram based on (i) the PPG data, (ii) the extracted one or more additional prosody features and (iii) a voice timbre of the target speaker. Using this technique, the generated spectrogram is characterized by the prosody style of the source speaker and the voice timbre of the target speaker. [0117] of Pan, more closely aligned to the prosody style of the source speaker and sound more like the voice timbre of the target speaker). [0004] of Pan: In the same manner, a human ear can distinguish between different voices having different timbres.)), 
computing a speaker representation of the target speaker ([0056] In some embodiments, a database is provided which stores a plurality of voice timbre profiles (e.g., voice timbres 149) corresponding to a plurality of target speakers; [0067] In some embodiments, computing system 110 stores and/or access a feature extraction engine 153. The feature extraction engine 153 is configured to extract features and/or attributes from the target speaker data 141…In such embodiments, the extracted attributes are included in the training data sets configured to train the machine learning models.); 
morphing, using the audio morpher, the user spoken phrase based on the target speaker representation (voice timbre of the target speaker; 240 of Fig. 2 ;See also 830 of Fig. 8 for example), and the extracted speech content features (for example, PPG data in 220 of Fig. 2), without using the excluded user voice characteristics (Fig. 2 shows that the voice conversion is performed based on speech data characterized by the prosody style of the source speaker and voice timbre of the target speaker, and not the voice characteristics/timbre of the source speaker); 
creating a spectrogram of the target speaker spoken phrase ([0070] In some embodiments, based on the attributes extracted by the feature extraction engine 153, the training engine 154 is able to adapt the training processes and methods such that the training process produces a trained model configured to generate specialized training data that reflects the inclusion of particular features and attributes that contribute to the desired prosody style. For example, the inclusion of the pitch attributes will help determine a fundamental frequency at which to generate the spectrogram data;); 
creating a spectrogram of the morphed user spoken phrase (See 940 of Fig. 9 for example, which shows the target speaker’s spectrogram is as a result of morphing the source speaker’s audio via the voice conversion module 930); and 
training the audio morpher with an objective function of minimizing the differences between the spectrogram of the morphed user spoken phrase with the spectrogram of the target speaker spoken phrase ([0073] In some embodiments, the computing system 110 includes a refinement engine 157. In some instances, the refinement engine 157 is in communication with the training engine. The refinement engine 157 is configured to refine the voice conversion model, or a component of the voice conversion model (e.g., the PPG-spectrogram component) by adapting the model component (or sub-model) to a target speaker using target speaker data 141. [0076] In some embodiments, the decoding engine 158 is in communication with a refinement engine 157 that is configured to refine the encoder/decoder network of the neural TTS model 146 by employing a feedback loop between the encoder and decoder. The neural TTS model 146 is then trained and refined by iteratively minimizing the reconstruction loss from transforming the input text into speech data, and speech data back into text data. In some embodiments, the refinement engine 157 is also configured to refine and/or optimize any one or a combination of machine learning engines/models included in the computing system 110 to facilitate an increase in efficiency, efficacy, and accuracy of the engines/models.).
training the morpher to reduce retention of speaker-specific characteristics associated with the user in the morphed user spoken phrase by excluding user speaker representations from the morphing process ([0054] In some instances, the target speaker data 141 is formatted as training data, wherein the neural TTS model 146 is trained (or pre-trained) on the target speaker training data such that the neural TTS model 146 is able to produce speech data in the voice timbre and prosody style of the target speaker based on input text. In some embodiments, the neural TTS model 146 is speaker independent, meaning that it produces arbitrary speech data based on one or a combination of target speaker datasets (e.g., target speaker data 141). In some embodiments, the neural TTS model 146 is a multi-speaker neural network, meaning that the model is configured to produce speech data corresponding to a plurality of discrete speakers/speaker profiles. In some embodiments, the neural TTS model 146 is speaker dependent, meaning that the model is configured to produce speech primarily for a particular target speaker. [0073] In some embodiments, the computing system 110 includes a refinement engine 157. In some instances, the refinement engine 157 is in communication with the training engine. The refinement engine 157 is configured to refine the voice conversion model, or a component of the voice conversion model (e.g., the PPG-spectrogram component) by adapting the model component (or sub-model) to a target speaker using target speaker data 141.  Fig. 8 and [0105]  In step two, the data (target speaker data 830) is used to train a neural TTS model (e.g., TTS model 146) (see neural TTS training 840), wherein the neural TTS model is able to generate speech data 850 from text input. The speech data is TTS data in the voice timbre of the target speaker with the style that was transferred from the source speaker. Examiner notes the above citations for example, clearly shows the TTS model is solely based on the target speaker and excludes user/source speaker representations (source speaker timbre), and that one of ordinary skill in the art would recognize voice morphing is in itself the process of excluding “voice characteristics” of a source user so that the result of the voice morphing sounds like it is from a target speaker.).
Claim 7
This claim recites substantially the same limitations as those provided in claim 1 above, and therefore it is rejected for the same reasons.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 2-4,6, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Pan et al. (US 2022/0068259) in view of Biadsy et al. (US 2022/0068257).
Claim 2
Pan discloses in Fig. 5 and in [0066], The conversion engine 152 is configured for converting between data types and to configure data raw data into training data usable for training any one of the machine learning models described herein. Pan may not clearly detail wherein the target speaker spoken phrase and the user spoken phrase use the same words. 
Biadsy teaches wherein the target speaker spoken phrase and the user spoken phrase use the same words (See Fig. 1A for example of Biadsy).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate WER comparison as taught by Biadsy with the cross-speaker style training data of Pan, because doing so would have provided a way to filter speech representations to be used for training speech conversion models ([0005] of Biadsy). 
Claim 3
The combination teaches the method of claim 2, wherein training of the audio morpher includes aligning the target speaker spoken phrase and the user spoken phrase ([0056] of Biadsy, The reference S2S conversion model 301 can use a sequence-to-sequence to normalize arbitrary speech, potentially including background noise, and generate the same content in the voice of the single predefined target speaker. The source speech can be from any speaker or accent, and may contain complex prosodic patterns, imperfections, atypical speech, and background noise, all of which are removed through the normalization process as the first audio data is converted into clean second audio data with a fixed accent and consistent articulation and prosody. In other words, the system may be used to project away all non-linguistic information, including speaker characteristics, and to retain only what is been said, not who, how, or where it is said.). 
 Claim 4
Pan discloses in Fig. 5 and in [0066], The conversion engine 152 is configured for converting between data types and to configure data raw data into training data usable for training any one of the machine learning models described herein. Biadsy teaches wherein the target speaker spoken phrase and words in the user spoken phrase are similar ([0056] of Biadsy, The reference S2S conversion model 301 can use a sequence-to-sequence to normalize arbitrary speech, potentially including background noise, and generate the same content in the voice of the single predefined target speaker. The source speech can be from any speaker or accent, and may contain complex prosodic patterns, imperfections, atypical speech, and background noise, all of which are removed through the normalization process as the first audio data is converted into clean second audio data with a fixed accent and consistent articulation and prosody. In other words, the system may be used to project away all non-linguistic information, including speaker characteristics, and to retain only what is been said, not who, how, or where it is said.).
Claim 6
This claim recites substantially the same limitations as those provided in claim 3 above, and therefore it is rejected for the same reasons.
Claim 8
Pan of the combination the system of claim 7, wherein the target speaker voice characteristics are selected at random from an array of target speaker representations ([0070] In some embodiments, based on the attributes extracted by the feature extraction engine 153, the training engine 154 is able to adapt the training processes and methods such that the training process produces a trained model configured to generate specialized training data that reflects the inclusion of particular features and attributes that contribute to the desired prosody style. [0071] For example, in some embodiments, the training engine 154 is configured to train a model (e.g., a neural TTS model 146, see also model 1000 of FIG. 10) with training data (e.g., spectrogram data 144) such that the machine learning model is configured to generate speech from arbitrary text as according to embodiments described herein. In some examples, the training engine 154 is configured to train the voice conversion model 147, or components of the voice conversion model, on speaker data (e.g., target speaker data 141, source speaker data 142, or multi-speaker data).).  

Claims 5 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Pan et al. (US 2022/0068259) in view of Biadsy et al. (US 2022/0068257) and Meng et al. (US 2009/0089063).
Claim 5
The combination teaches the method of claim 4, except wherein the percent difference between the number of phonemes in the target speaker spoken phrase and the user spoken phrase is below a predefined percentage.
Meng teaches wherein the percent difference between the number of phonemes in the target speaker spoken phrase and the user spoken phrase is below a predefined percentage ([0031] of Meng, Preferably, this target cost can be a weighted sum of spectral distance, prosodic distance and phonetic distance.; [0034] of Meng, If two code words belong to the same phoneme but have different neighboring phonemes, their distance is set to a small value. However if two code words belong to different phoneme, their distance will be set to a large value. [0035] Besides the target cost, the transition cost between two candidate code words further needs to be defined. This transition cost can be a weighted sum of spectral distance, prosodic distance and phonetic distance, which is similar to the target cost.[0036] Thus, the set of code words in the target speaker's corpus which match the converted first spectrum and the f.sub.0 contour most can be determined through the selection procedure.).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate training data as taught by Meng with the cross-speaker style training data of Pan in view of Biadsy, because doing so would have reduced the difference between the converted speech and the target speaker caused by the detailed spectral difference between speakers' speech ([0012] of Meng). 
Claim 13
The combination further indicates the method of claim 5, wherein the target speaker spoken phrase and the user spoken phrase are different by less than a predefined number of phonemes (This claim similarly claims the limitations of claim 5, essentially indicating the difference in phonemes between the target speaker spoken phrase and the user spoken phrase are below a certain value; [0031] of Meng, Preferably, this target cost can be a weighted sum of spectral distance, prosodic distance and phonetic distance.; [0034] of Meng, If two code words belong to the same phoneme but have different neighboring phonemes, their distance is set to a small value. However if two code words belong to different phoneme, their distance will be set to a large value. [0035] Besides the target cost, the transition cost between two candidate code words further needs to be defined. This transition cost can be a weighted sum of spectral distance, prosodic distance and phonetic distance, which is similar to the target cost.[0036] Thus, the set of code words in the target speaker's corpus which match the converted first spectrum and the f.sub.0 contour most can be determined through the selection procedure.).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THOMAS H MAUNG whose telephone number is (571)270-5690. The examiner can normally be reached Monday-Friday, 9am-6pm, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Carolyn R. Edwards can be reached on 1-(571) 2707136. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THOMAS H MAUNG/Primary Examiner, Art Unit 2692

Read full office action

Prosecution Timeline

Mar 15, 2022

Application Filed

Nov 04, 2023

Non-Final Rejection — §102, §103

Feb 22, 2024

Interview Requested

Feb 28, 2024

Examiner Interview Summary

Feb 28, 2024

Applicant Interview (Telephonic)

May 09, 2024

Response Filed

May 18, 2024

Final Rejection — §102, §103

Oct 23, 2024

Request for Continued Examination

Oct 26, 2024

Response after Non-Final Action

Feb 10, 2025

Non-Final Rejection — §102, §103

Jul 14, 2025

Response Filed

Aug 08, 2025

Final Rejection — §102, §103

Dec 12, 2025

Request for Continued Examination

Jan 13, 2026

Response after Non-Final Action

Jan 22, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/508,120

Patent 12602446

DATA COMMUNICATION SYSTEM

2y 5m to grant Granted Apr 14, 2026

17/878,697

Patent 12602196

Audio Playback Adjustment

2y 5m to grant Granted Apr 14, 2026

17/478,948

Patent 12585653

PARSING IMPLICIT TABLES

2y 5m to grant Granted Mar 24, 2026

17/658,807

Patent 12586562

ANIMATED SPEECH REFINEMENT USING MACHINE LEARNING

2y 5m to grant Granted Mar 24, 2026

18/136,779

Patent 12578918

STREAMING AUDIO TO DEVICE CONNECTED TO EXTERNAL DEVICE

2y 5m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

5-6

Expected OA Rounds

63%

Grant Probability

99%

With Interview (+38.2%)

2y 11m

Median Time to Grant

High

PTA Risk

Based on 382 resolved cases by this examiner. Grant probability derived from career allow rate.