Last updated: May 29, 2026

Application No. 18/188,524

END-TO-END SPEECH CONVERSION

Non-Final OA §103

Filed

Mar 23, 2023

Priority

Feb 21, 2019 — provisional 62/808,627 +2 more

Examiner

SONIFRANK, RICHA MISHRA

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

5 (Non-Final)

Interview Optional

— +25.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 66% grant rate with +25.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 383 resolved cases, 2023–2026

Examiner Intelligence

SONIFRANK, RICHA MISHRA View full profile →

Grants 66% — above average

Career Allowance Rate

254 granted / 383 resolved

+4.3% vs TC avg

Strong +25% interview lift

Without

With

+25.0%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

15 currently pending

Career history

411

Total Applications

across all art units

Statute-Specific Performance

§101

3.4%

-36.6% vs TC avg

§103

90.0%

+50.0% vs TC avg

§102

2.7%

-37.3% vs TC avg

§112

3.4%

-36.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 383 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 4/13/2026  has been entered.
Priority
This U.S. patent application is a continuation of, and claims priority under 35 U.S.C. §120 from, U.S. Patent Application 17/310,732, filed on August 19, 2021, which is a national phase application of, and claims priority under 35 U.S.C. §371 from, international Application PCT/US2019/063334, filed on November 26, 2019, which claims priority under 35 U.S.C. §119(e) to U.S. 62/808,627, filed on February 21, 2019
Response to Amendment
Claims 1 and 11 are amended. Claims 2, 8, 12 and 18 are cancelled. Claims 1, 3-7, 9-11, 13-17 and 19-20 are presented for examination. 
Response to Arguments
Applicant arguments filed 4/13/2026 have been considered. Following is the response: 
Claim Rejections Under35 U.S.C. §103 

Applicant argues “Haque generally discloses a method for transforming audio from one style to another, with a focus on speech and music. See Haque at Abstract. Specifically, Haque discloses a fully- differentiable sequence-to-sequence model that processes audio inputs to generate transformed audio outputs. See Id. Haque generally notes that its audio transformations are different from speech recognition and speech synthesis. See Id. at 1. Introduction. While Haque's method is designed to transform audio styles, Applicant respectfully submits that it does not explicitly state that the transformation occurs without performing any speech recognition on a sequence of source audio frames. At best, Haque generally mentions that the model learns to predict output spectrograms conditioned on the speaker or instrument, but Haque does not provide details on whether speech recognition is explicitly avoided during this process, let alone whether any additional processing occurs between its encoder and decoder” However, Haque mentions the mapping of the embedding space to the latent feature; this process avoids mapping to speech recognition. Applicant is advised to review MPEP section 2112 which states   II. INHERENT FEATURE NEED NOT BE RECOGNIZED AT THE RELEVANT TIME -- There is no requirement that a person of ordinary skill in the art would have recognized the inherent disclosure at the relevant time, but only that the subject matter is in fact inherent in the prior art reference. Schering Corp. v. Geneva Pharm. Inc., 339 F.3d 1373, 1377, 67 USPQ2d 1664, 1668 (Fed. Cir. 2003) (rejecting the contention that inherent anticipation requires recognition by a person of ordinary skill in the art before the critical date and allowing expert testimony with respect to post-critical date clinical trials to show inherency). Additionally, applicant should also review 
MPEP 2112.01 Composition, Product, and  Apparatus Claims [R-10.2019]
I. PRODUCT AND APPARATUS CLAIMS — WHEN THE STRUCTURE RECITED IN THE REFERENCE IS SUBSTANTIALLY IDENTICAL TO THAT OF THE CLAIMS, CLAIMED PROPERTIES OR FUNCTIONS ARE PRESUMED TO BE INHERENT-- Where the claimed and prior art products are identical or substantially identical in structure or composition, or are produced by identical or substantially identical processes, a prima facie case of either anticipation or obviousness has been established. In re Best, 562 F.2d 1252, 1255, 195 USPQ 430, 433 (CCPA 1977).
2112.02 Process Claims [R-01.2024]
I. PROCESS CLAIMS — PRIOR ART DEVICE ANTICIPATES A CLAIMED PROCESS IF THE DEVICE CARRIES OUT THE PROCESS DURING NORMAL OPERATION
Under the principles of inherency, if a prior art device, in its normal and usual operation, would necessarily perform the method claimed, then the method claimed will be considered to be anticipated by the prior art device. When the prior art device is the same as a device described in the specification for carrying out the claimed method, it can be assumed the device will inherently perform the claimed process. In re King, 801 F.2d 1324, 231 USPQ 136 (Fed. Cir. 1986) 


Applicant further argues “Furthermore, Haque references various techniques that its model is based on, such as Listen-Attend-Spell and TacoTron, which are known to involve speech recognition components in their architectures. See Id. These references suggest that Haque's method may implicitly involve some form of speech recognition, even if not explicitly stated.”  However, listen-attend and spell models are used for the comparison with the seq2sec model, which does not perform symbolic representation and speech recognition while transferring. Furthermore, Tacotron is mentioned because  it can replace the seq2seq decoder if the user choses to perform symbolic representation by converting raw text into structured numerical or phonemic representations.. However, the cited seq2seq model clearly uses embedding in the latent space and does not require speech recognition or symbolic representation without that decoder for the TacoTron.

Applicant further contends “Absent a clear statement that no speech recognition is performed, Haque's disclosure cannot reasonably be construed as unambiguously disclosing processing, using an encoder of a voice conversion model trained to convert any input spectrogram directly to another spectrogram without any intermediate symbolic representation, a sequence of source audio frames to generate a sequence of source internal representations characterizing an utterance spoken in a first accent, and processing, using a decoder of the voice conversion model, the sequence of source internal representations to generate a sequence of target audio frames characterizing a synthesized speech representation of the utterance in a second accent different than the first accent without performing any speech recognition on the sequence of source audio frames.” However, see the explanation of MPEP § 2112 provided previously.

Applicant argues “Dirac generally discloses a system for translating accents and outputting translated speech. See Dirac at Paragraph [0036]. Dirac fails, however, to ever disclose or suggest processing, using an encoder of a voice conversion model trained to convert any input spectrogram directly to another spectrogram without any intermediate symbolic representation, a sequence of source audio frames to generate a sequence of source internal representations characterizing an utterance spoken in a first accent, and processing, using a decoder of the voice conversion model, the sequence of source internal representations to generate a sequence of target audio frames  characterizing a synthesized speech representation of the utterance in a second accent different than the first accent without performing any speech recognition on the sequence of source audio frames” However, Dirac was relied receiving and encoding in the first accent and outputting in the second accent as a style transfer process. 

Although Haque teaches a system operating 'without speech recognition,' the Examiner relies on the additional reference merely to illustrate that Haque’s seq2seq model inherently functions without symbolic representation or ASR processing. 


Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims  1, 3-7, 9-11, 13-17 and 19-20  are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1- 14 of US patent application 12300216  .  Although the claims at issue are  not identical, they are not patentably distinct from each other.
Regarding claim  1, 3-7, 9-11, 13-17 and 19-20 , claims  1- 14 of US patent application 12300216   claims all the limitations set forth in the application claim   1, 3-7, 9-11, 13-17 and 19-20. Although the claims are not exactly same the patent application reads on the current application and has additional limitations. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 3-7, 9-11, 13-17 and 19-20  are rejected under 35 U.S.C. 103 as being unpatentable over Haque ( Conditional End-to-End Audio Transforms) and further in view of Dirac (US Pub: 20180174595)   

Regarding claim 1, Haque teaches a computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
receiving a sequence of source audio frames characterizing an utterance spoken in a first accent( input spectrogram, Fig 1, Under 2.1 Encoder);
processing, using an encoder of a voice conversion model trained to convert any input spectrogram directly to another spectrogram without any intermediate symbolic representation ( clearly there is no symbolic representation- the model learns directly from the latent space --  small latent space to condition text in a sequence-to-sequence decoder  (This implies the models are bypassing symbolic representations like phonetics by map speech directly into embedding in a latent space ) , the sequence of source audio frames to generate a sequence of source internal representations characterizing the utterance spoken in the first accent  ( generate representation by the encoder using sequence to sequence model, Under 2.1 Encoder and Under 3. Experiments; here as can be seen there is no symbolic representation including textual representation and further there are latent features which are used, Fig 1; spectrogram transfer there is no speech recognition which is apparent from Fig 1-3); 
modifying the sequence of source internal representations by adjusting a time duration of the source internal representations to match a target cadence associated with a second accent different than the first accent (( the desired output transformation similar to [13]. The context vector ci is produced by an attention mechanism [11]. Specifically, we define:

    PNG
    media_image1.png
    55
    342
    media_image1.png
    Greyscale

 where attention αi,j is defined as the alignment between the current decoder time-slice i and a time-slice j from the encoder input. The score between the output of the encoder (i.e., hidden states), hj , and the previous state of the decoder cell, si−1 is computed with: ei,u = hφ(si), ϕ(hu)i where φ and ϕ are multi-layer perceptrons: ei,j = w T tanh(W si−1 + V hj + b) with learnable parameters w, W and V , Under 2.2 Decoder; further the transformation can be For all experiments, we focus on transformations on a word- or pitch-level., where  cadence in speech  is defined by the rise and fall of pitch, Under 3.1) 
processing, using a decoder of the voice conversion model, the modified sequence of source internal representations to generate a sequence of target audio frames characterizing a synthesized speech representation of the utterance in[[ a]] the second accent without performing any speech recognition on the sequence of source audio frames ( style transfer using the decoder, Fig 1 and Fig 2); and 
providing, for output by a computing device, the synthesized speech representation of the utterance in the second accent   ( transformed waveform, Fig 2, Under 2 Decoder) 
wherein the voice conversion model is trained by: receiving a sequence of input audio frames characterizing a speech utterance in the first accent data is conditioned on input-output types, Under Introduction; Speech examples from AudioSet  were used to pre-train the model as an autoencoder. We condition our model on the source and target style. For the case of speech, style refers to the speaker, Under Experiments; wherein it can be applied to accent transfer, Under introduction);
obtaining encoder latent representations of the sequence of input audio frames characterizing the speech utterance in the first accent  ( learning latent space for learning; we note that sequence to-sequence models are capable of encoding sentence level information in a small latent encoding vector, Under Experiments); obtaining a phoneme transcript characterizing the speech utterance in the second accent( Spectrograms remain the dominant acoustic representation for both phoneme and word level tasks since the high sampling rate and dimensionality of waveforms is difficult to model, Under Introduction ); and 
training the voice conversion model using the sequence of input audio frames, and the phoneme transcript (  Spectrograms remain the dominant acoustic representation for both phoneme and wordlevel tasks since the high sampling rate and dimensionality of waveforms is difficult to model; spectrogram using input and output, Under Introduction ; Our model is able to capture fundamental phonetic properties of each speaker or instrument and apply these properties to novel words and pitches, Under 3.1) 
Although Haque mentions accent transfer and the model can be applied in any of these scenarios it does not explicitly mentions that receiving and encoding in the first accent, however in the same field of endeavor Ryan teaches receiving and encoding in the first accent and outputting in the second accent 
However Dirac teaches  receiving and encoding in the first accent, however in the same field of endeavor Dirac teaches receiving and encoding in the first accent and outputting in the second accent ( receiving an utterance in the first accent and outputting in a second accent, Fig 6; wherein the model is LSTM model for transformation hence using internal representation, Para 0034, 0049) and matching the cadence of the second accent from the first accent ( accent translation model 321 may include instructions for adjusting first accent pitch 201A to more closely resemble the second accent pitch 201B, adjusting the first accent tone 202A to more closely resemble the second accent tone 202B, adjusting the first accent stress 203A to more closely resemble the second accent stress 203B, adjusting the first accent melody 204A to more closely resemble the second accent melody 204B, and so on, Para 0029) 
It would have been obvious having the teachings of style transfer model of Haque to specifically use for accent transfer as described in Dirac before effective filing date so that speaker can understand the conversation in their accents ( Para 0035, Dirac) 
Even though Haque process inherently showing the transfer which clearly does not require speech processing. Haque does not explicitly teach style transferring  without performing any speech recognition
Hirokazu teaches the concept of transferring style transferring without performing any speech recognition (  a voice conversion method based on fully convolutional sequence-to-sequence (seq2seq) learning. The present method, which we call “ConvS2S-VC”; it allows the direct conversion of a source speech feature sequence without relying on ASR modules and requires no text annotations for model training, thanks to our newly introduced idea of context preservation loss, Under abstract and Related word, Fig 1) 
It would have been obvious for Haque system to not perform a speech recognition since that’s what the seq2seq general model do  so the context is preserved ( Related work, Hirokazu) 

Regarding claim 3, Haque as above in claim 1,teaches wherein the sequence of source audio frames comprises a sequence of input spectrograms ( spectrogram, Fig. 1)   

Regarding claim 4, Haque as above in claim 1, teaches wherein the sequence of target audio frames comprises a sequence of output spectrograms ( fig 1-2) 


Regarding claim 5, Haque modified by Dirac as above in claim 1, teaches  wherein a cadence of the utterance spoken in the first accent is different than a cadence of the synthesized speech representation of the utterance in the second accent ( accent change wherein its known in the art that accent changes can influence the cadence, Further Dirac mention changes in pitch, stress etc., Para 0028-0029) 

  
Regarding claim 6, Haque as above in claim 1, teaches wherein the encoder comprises a bidirectional long short-term memory (LSTM) layer ( LSTM, Under 2 Encoder) 

Regarding claim 7, Haque as above in claim 1, teaches , wherein the decoder comprises a spectrogram decoder with attention ( attention, Under 2  Decoder) 


Regarding claim 9, Haque as above in claim 1, teaches  wherein the operations further comprise bypassing obtaining a transcription of the utterance ( Fig 1-3- as can be seen the there is a spectrogram transfer for the utterance and no transcription was obtained) 

Regarding claim 10, Haque as above in claim 1, teaches , wherein the speech conversion model is configured to adjust a time period between each term in the utterance spoken in the first accent ( the desired output transformation similar to [13]. The context vector ci is produced by an attention mechanism [11]. Specifically, we define:

    PNG
    media_image1.png
    55
    342
    media_image1.png
    Greyscale

 where attention αi,j is defined as the alignment between the current decoder time-slice i and a time-slice j from the encoder input. The score between the output of the encoder (i.e., hidden states), hj , and the previous state of the decoder cell, si−1 is computed with: ei,u = hφ(si), ϕ(hu)i where φ and ϕ are multi-layer perceptrons: ei,j = w T tanh(W si−1 + V hj + b) with learnable parameters w, W and V , Under 2.2 Decoder; further Ryan teaches adjusting time when conditioned for the prosody where prosody is related to the accents) 

Regarding claim 11, arguments analogous to claim 1, are applicable. 
Regarding claim 12, arguments analogous to claim 2, are applicable. 
Regarding claim 13, arguments analogous to claim 3, are applicable. 
Regarding claim 14, arguments analogous to claim 4, are applicable. 
Regarding claim 15, arguments analogous to claim 5, are applicable. 
Regarding claim 16, arguments analogous to claim 6, are applicable. 
Regarding claim 17, arguments analogous to claim 7, are applicable. 
Regarding claim 19, arguments analogous to claim 9, are applicable. 
Regarding claim 20, arguments analogous to claim 10, are applicable.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Policy Driven Generative Adversarial Networks For Accented Speech Generation mentions transfer learning We passed batches of 64 samples each from the VCTK dataset into the auto-encoder and then changed the accent being conditioned for. We computed the true L1 loss between the sample generated by accent transfer via both AccentGAN and one ablation of this model, ConditionedGAN.
Rhythm-Flexible Voice Conversion Without Parallel Data Using Cycle-Gan Over Phoneme Posteriorgram Sequences  proposes an approach utilizing sequence-to-sequence model trained with unsupervised Cycle-GAN to perform the transformation between the phoneme posteriorgram sequences for different speakers.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Richa Sonifrank whose telephone number is (571)272-5357. The examiner can normally be reached M-T 7AM - 5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Phan Hai can be reached at (571)272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Richa Sonifrank/Primary Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Show 7 earlier events

Jul 31, 2025

Response after Non-Final Action

Aug 22, 2025

Non-Final Rejection mailed — §103

Nov 21, 2025

Response Filed

Jan 14, 2026

Final Rejection mailed — §103

Feb 25, 2026

Response after Non-Final Action

Apr 13, 2026

Request for Continued Examination

Apr 15, 2026

Response after Non-Final Action

May 20, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/194,644

Patent 12602552

Machine-Learning-Based OKR Generation

3y 0m to grant Granted Apr 14, 2026

18/471,491

Patent 12603085

ENTITY LEVEL DATA AUGMENTATION IN CHATBOTS FOR ROBUST NAMED ENTITY RECOGNITION

2y 6m to grant Granted Apr 14, 2026

18/088,593

Patent 12585883

COMPUTER IMPLEMENTED METHOD FOR THE AUTOMATED ANALYSIS OR USE OF DATA

3y 3m to grant Granted Mar 24, 2026

18/317,953

Patent 12585877

GROUPING AND LINKING FACTS FROM TEXT TO REMOVE AMBIGUITY USING KNOWLEDGE GRAPHS

2y 10m to grant Granted Mar 24, 2026

17/876,848

Patent 12579988

METHOD AND APPARATUS FOR CONTROLLING AUDIO FRAME LOSS CONCEALMENT

3y 7m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

5-6

Expected OA Rounds

66%

Grant Probability

91%

With Interview (+25.0%)

3y 0m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 383 resolved cases by this examiner. Grant probability derived from career allowance rate.

END-TO-END SPEECH CONVERSION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email