Last updated: April 19, 2026

Application No. 18/676,743

Unsupervised Learning of Disentangled Speech Content and Style Representation

Non-Final OA §102§103

Filed

May 29, 2024

Examiner

SUBRAMANI, NANDINI

Art Unit

2656

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

1 (Non-Final)

This examiner grants 63% of cases after interview

— +49.4% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 87 resolved cases, 2023–2026

Examiner Intelligence

SUBRAMANI, NANDINI View full profile →

Grants 63% of resolved cases

Career Allow Rate

55 granted / 87 resolved

+1.2% vs TC avg

Strong +49% interview lift

Without

With

+49.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 2m

Avg Prosecution

24 currently pending

Career history

111

Total Applications

across all art units

Statute-Specific Performance

§101

15.6%

-24.4% vs TC avg

§103

60.4%

+20.4% vs TC avg

§102

10.0%

-30.0% vs TC avg

§112

11.6%

-28.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 87 resolved cases

Office Action

§102 §103

DETAILED ACTION
Introduction
Applicant's submission filed on 05/29/2024 has been entered. Claims 1-20 are pending in the application and have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-4 and 11-14 are rejected under 35 U.S.C. 102 a(1)/a(2) as being anticipated by Garbacea et. al. US PgPub. 2020/0234725.
Regarding claim 1, Garbacea teaches computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: receiving input speech (see Garbacea, [0020] encoder system received input audio data 102); processing, using an encoder of an autoencoder model, the input speech to predict a latent representation of linguistic content for the input speech (see Garbacea, [0020] encodes the input audio data 102 to generate a discrete latent representation 122 of the input audio data 102, [0035]), wherein the encoder comprises: a plurality of convolutional layers configured to receive the input speech as input and generate an initial discrete per-timestep latent representation (see Garbacea, [0031] discusses discrete latent representation fixed duration (per timestamp), [0036] the encoder neural network 110 can be a dilated convolutional neural network that receives the sequence of audio data and generates the sequence of encoded vectors); and a vector quantization (VQ) layer configured to receive each initial discrete per-timestep latent representation and predict the latent representation of linguistic as a sequence of latent variables representing the linguistic content from the input speech(see Garbacea, [0038] a content latent embedding vector that is nearest to the encoded vector for the latent variable is determined ); determining a VQ loss for the encoder based on the latent representation of linguistic content(see Garbacea, [0093, 0112] the commitment loss ( VQ loss)); processing, using a decoder of the autoencoder model, the latent representation of linguistic content for the input speech to predict output speech, the output speech comprising a reconstruction of the input speech(see Garbacea, [0032]  the decoder system 150 generates an estimate of the input audio data 102 based on the discrete latent representation 122 of the input audio data 102; Fig 3); determining a reconstruction loss between the input speech and the reconstruction of the input speech (see Garbacea, [0098-0099] discusses reconstruction loss); and training the autoencoder model on the VQ loss and the reconstruction loss(see Garbacea, [0098, 0105, 0111] discusses training on the reconstruction error and content latent embedding error ).
Regarding claim 2, Garbacea teaches a method of claim 1, further teaches, wherein the decoder comprises a plurality of convolutional layers configured to receive the latent representation of linguistic content for the input speech (see Garbacea, [0056] the decoder neural network 170 is an auto-regressive neural network, e.g., a WaveNet or other auto-regressive convolutional neural network ).
Regarding claim 3, Garbacea teaches a method of claim 3, further teaches, a number of the plurality of convolutional layers of the decoder is equal to a number of the plurality of convolutional layers of the encoder (see Garbacea, [0055-0056] the decoder neural network 170 is the same type of neural network as the encoder neural network 110, but configured to generate a reconstruction from a decoder input rather than an encoder output (which is the same size as the decoder input) from an input audio data. the decoder neural network 170 is an auto-regressive neural network, e.g., a WaveNet or other auto-regressive convolutional neural network ).
Regarding claim 4, Garbacea teaches a method of claim 1, further teaches, wherein the VQ loss encourages the encoder to minimize a distance between an output and a nearest codebook (see Garbacea, [0038] discusses using content codebook to determine the content latent embedding vector using a nearest neighbor lookup ). 
Regarding claim 11, is directed to a system claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Regarding claim 12, is directed to a system claim corresponding to the method claim presented in claim 2 and is rejected under the same grounds stated above regarding claim 2.
Regarding claim 13, is directed to a system claim corresponding to the method claim presented in claim 3 and is rejected under the same grounds stated above regarding claim 3.
Regarding claim 14, is directed to a system claim corresponding to the method claim presented in claim 4 and is rejected under the same grounds stated above regarding claim 4.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5-7 and 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Garbacea et. al. US PgPub. 2020/0234725 in view of Qian, Kaizhi, et al. "Autovc: Zero-shot voice style transfer with only autoencoder loss." International Conference on Machine Learning. PMLR, 2019 (cited in IDS).
Regarding claim 5, Garbacea teaches a method of claim 1, further teaches, wherein processing the input speech to predict the latent representation of linguistic content for the input speech comprises processing the input speech to generate the latent representation of linguistic content as a discrete per-timestep latent representation of linguistic content that discards speaking style variations in the input speech (see Garbacea, [0040, 0041] discusses content codebook to determine the latent embedding vectors and only if speaker codebook is used uses the speaker codebook( when no speaker codebook: discards speaking styles) ).  Garbacea teaches latent representation of linguistic content, to further teach latent representation of linguistic content as a discrete per-timestep latent representation of linguistic content that discards speaking style variations in the input speech,  Quan further teaches processing the input speech to predict the latent representation of linguistic content for the input speech comprises processing the input speech to generate the latent representation of linguistic content as a discrete per-timestep latent representation of linguistic content that discards speaking style variations in the 
    PNG
    media_image1.png
    360
    356
    media_image1.png
    Greyscale
input speech (see Qian, sect 3.3 Therefore, as shown in Fig. 2(c), when the dimension of C1 is chosen such that the dimension reduction is just enough to get rid of all the speaker information(style variations) but no content information is harmed. See Qian, sect 4. As shown in Fig. 3, AUTOVC consists of three major modules: a speaker encoder, a content encoder, a decoder. AUTOVC works on the speech mel-spectrogram of size N-by T, where N is the number of mel-frequency bins and T is the number of time steps (frames)(discrete per timestep latent representation).
Garbacea and Qian are considered to be analogous to the claimed invention because they relate to speech coding using neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of  Garbacea on reconstruction of audio using discrete latent representation using neural networks with the a content and speaking style model that includes a content encoder, a style encoder, and a decoder teachings of Qian to provide a simpler and better voice conversion and general style transfer system ( see Qian, sect 1, pg. 2).
Regarding claim 6, Garbacea teaches a method of claim 1, further teaches, wherein the operations further comprise processing, using a style encoder, the same input speech to generate a latent representation of speaking style for the same input speech (see Garbacea, [0040, 0041] discusses content codebook to determine the latent embedding vectors and only if speaker codebook is used uses the speaker codebook( speaking style) ).  Garbacea teaches latent representation of speaking style, to further teach  using a style encoder,  Quan further teaches wherein the operations further comprise processing, using a style encoder, the same input speech to generate a latent representation of speaking style for the same input speech (see Qian, Fig. 1, Es (.), the Es is the style/speaker encoder which receives speech X2 ; the Es outputs the speaker embedding/style of the utterance or speech style)). The same motivation to combine as claim 5 applies here.
Regarding claim 7, Garbacea in view of Qian teaches a method of claim 6. Qian further teaches wherein processing, using the decoder of the autoencoder model, the latent representation of linguistic content for the input speech to predict the output speech comprises processing, using the decoder, the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same input speech to predict the output speech (see Qian, see Qian, Fig. 1 , sect 3.2 The framework consists of three modules, a content encoder Ec(·) that produces a content embedding from speech, a speaker encoder Es(·) that produces a speaker embedding from speech, and a decoder D(·, ·) that produce speech from content and speaker embeddings ). The same motivation to combine as claim 5 applies here.
Regarding claim 15, is directed to a system claim corresponding to the method claim presented in claim 5 and is rejected under the same grounds stated above regarding claim 5.
Regarding claim 16, is directed to a system claim corresponding to the method claim presented in claim 6 and is rejected under the same grounds stated above regarding claim 6.
Regarding claim 17, is directed to a system claim corresponding to the method claim presented in claim 7 and is rejected under the same grounds stated above regarding claim 7.
Claims 8-10 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Garbacea et. al. US PgPub. 2020/0234725 in view of Qian, Kaizhi, et al. "Autovc: Zero-shot voice style transfer with only autoencoder loss." International Conference on Machine Learning. PMLR, 2019 (cited in IDS) further in view of Poole et.al, US Patent 10,671,889(cited in IDS).
Regarding claim 8, Garbacea in view of Qian teaches a method of claim 6, Qian further teaches one or more convolutional layers configured to receive the input speech as input(see Qian, sect 4.2, As shown in Fig. (3)(b), the speaker encoder consists of a stack of two LSTM layers with cell size 768); and a variational layer with Gaussian posterior configured to summarize an output from the one or more convolutional layers with a global average pooling operation across the time-axis to extract a global latent style variable that corresponds to the latent representation of speaking style ((see Qian, sect 4.2 Only the output of the last time is selected and projected down to dimension 256 with a fully connected layer. The resulting speaker embedding is a 256-by-1 vector. The speaker encoder is pre-trained on the GE2E loss (Wan et al., 2018) (the SoftMax loss version), which maximizes the embedding similarity among different utterances of the same speaker, and minimizes the similarity among different speakers; interpreted as the variational layer with Gaussian posterior to summarize a global averaging pooling operation). Qian teaches generating the speaker encoding, to further teach variational layer with Gaussian posterior, Poole is used to further teach a variational layer with Gaussian posterior configured to summarize an output from the one or more convolutional layers with a global average pooling operation across the time-axis to extract a global latent style variable that corresponds to the latent representation of speaking style (see Poole, col 1 lines 19-35 a variational layer with Gaussian posterior configured to summarize an output from the one or more convolutional layers with a global average pooling operation across the time-axis to extract a global latent style variable that corresponds to the latent representation of  speaking style. See Poole, col. 8 lines 13-27 teaches the system is configured to sample values for a set of latent variables from the posterior distribution for a set of latent variables which may define values for a latent variable data structure such as a latent variable vector z; the vector z is interpreted as a latent representation of speaking style). . 
Garbacea in view of Qian teaches a content and speaking style model that includes a content encoder, a style encoder, and a decoder. Using the known technique of training a variational autoencoder neural network system to keep the posterior distribution close to a prior p(z), using a standard Gaussian to generate an output data item teachings by Poole (see Poole, col 1 lines 19-35), to provide the variational layer with Gaussian posterior in speaking style encoder in the reference Garbacea in view of Qian and to global latent style variable that corresponds to the latent representation of speaking style would have been obvious to one of ordinary skill in the art.
Regarding claim 9, Garbacea in view of Qian further in view of Poole teaches a method of claim 8, Poole further teaches during training, the global style latent variable is sampled from a mean and variance of style latent variables predicted by the style encoder (see Poole, col. 8 lines 13-27 The data items are provided to an encoder neural network 104 which outputs a set of parameters 106 defining a posterior distribution of a set of latent variables, e.g., defining the mean and variance of a multivariate Gaussian distribution; encoder interpreted as style encoder); and during inference, the global style latent variable is sampled from the mean of the global latent style variables predicted by the style encoder(see Poole, col. 8 lines 13-27 & Poole, col 8 lines 38-48 The system is configured to sample values for a set of latent variables 108 from the posterior distribution; system is a VAE Neural network and the set of latent variables are interpreted as global style latent variable).
Regarding claim 10, Garbacea in view of Qian further in view of Poole teaches a method of claim 8, Poole further teaches wherein the style encoder is trained using a style regularization loss based on a mean and variance of style latent variables predicted by the style encoder, the style encoder using the style regularization loss to minimize a Kullback-Leibler (KL) divergence between a Gaussian posterior with a unit Gaussian prior (see Poole, col 8 lines 38-48 and Poole, col 1 lines 60-66, the variational autoencoder neural network system may configured for training with an objective function. This may have a first term, such as a cross-entropy term, dependent upon a difference between the input data item and the output data item and a second term, such as a KL divergence term, dependent upon a difference between the posterior distribution and a second, prior distribution of the set of latent variables).
Regarding claim 18, is directed to a system claim corresponding to the method claim presented in claim 8 and is rejected under the same grounds stated above regarding claim 8.
Regarding claim 19, is directed to a system claim corresponding to the method claim presented in claim 9 and is rejected under the same grounds stated above regarding claim 9.
Regarding claim 20, is directed to a system claim corresponding to the method claim presented in claim 10 and is rejected under the same grounds stated above regarding claim 10.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Trueba et. al. US Patent 11,735,156 teaches a speech-processing system determines vocal characteristics of the second voice and determines output corresponding to the first audio data and the vocal characteristics (see Trueba, Fig. 1).
Zhang et. al. US PgPub 2020/0365166 teaches a zero-shot voice conversion with non-parallel data includes receiving source speaker speech data as input data into a content encoder of a style transfer autoencoder system (see Zhang, Fig. 3).                                                                                                                
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 12:00pm - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached at (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/NANDINI SUBRAMANI/Examiner, Art Unit 2656                                                                                                                                                                                                        
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656

Read full office action

Prosecution Timeline

May 29, 2024

Application Filed

Mar 10, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/573,651

Patent 12562177

CONFERENCE ROOM SYSTEM AND AUDIO PROCESSING METHOD

2y 5m to grant Granted Feb 24, 2026

17/649,183

Patent 12561629

IDENTIFYING REGULATORY DATA CORRESPONDING TO EXECUTABLE RULES

2y 5m to grant Granted Feb 24, 2026

17/708,679

Patent 12505302

SYSTEMS AND METHODS RELATING TO MINING TOPICS IN CONVERSATIONS

2y 5m to grant Granted Dec 23, 2025

17/364,074

Patent 12468884

Machine Learning-Based Argument Mining and Classification

2y 5m to grant Granted Nov 11, 2025

17/504,026

Patent 12450434

NEURAL NETWORK BASED DETERMINATION OF EVIDENCE RELEVANT FOR ANSWERING NATURAL LANGUAGE QUESTIONS

2y 5m to grant Granted Oct 21, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

63%

Grant Probability

99%

With Interview (+49.4%)

3y 2m

Median Time to Grant

Low

PTA Risk

Based on 87 resolved cases by this examiner. Grant probability derived from career allow rate.