Last updated: April 19, 2026
Application No. 18/710,301
Adaptation and training of neural speech synthesis

Non-Final OA §102§103
Filed
May 15, 2024
Examiner
YAMAMOTO, JOSEPH JEREMY
Art Unit
2656
Tech Center
2600 — Communications
Assignee
Cerence Operating Company
OA Round
1 (Non-Final)
Interview Optional

— +21.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 43 resolved cases, 2023–2026
Examiner Intelligence

YAMAMOTO, JOSEPH JEREMY View full profile →
Grants 72% — above average
Career Allow Rate
31 granted / 43 resolved
+10.1% vs TC avg
Strong +21% interview lift
Without
With
+21.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
17 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
23.1%
-16.9% vs TC avg
§103
47.6%
+7.6% vs TC avg
§102
8.2%
-31.8% vs TC avg
§112
19.7%
-20.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 43 resolved cases
Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Claims 1-20 are pending. Claims 1, 20, and 22 are independent.  
Claims 2-19 and 23-24 depend from Claim 1.
Claim 21 depend from Claim 20.
This Application was published as U.S. 2025/0006175.

Information Disclosure Statement
	The information disclosure statement (IDS) submitted on 29 Apr 2025 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
	The information disclosure statement (IDS) submitted on 15 May 2024 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner, except for the one reference that was not submitted.

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description:
Fig 3 refers to reference item “350” that is not mentioned in the specification.
Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 102
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1-2, 11, 16, and 20-21 are rejected under 35 U.S.C. 102(a)(1) and 102(a)(2) as being anticipated by Mandel et al.(US2022/0358904 hereinafter Mandel)

With regards to claim 1, Mandel teaches:
	A method for speech generation comprising: obtaining a speech sample for a target speaker; [Mandel Fig 1 teaches input noisy audio which is a speech sample that “includes a distorted target audio signal” (Par [0006]) used for multiple speakers or “single-speaker model” (Table 2, Par [0031])]
	processing, using a trained encoder, the speech sample for the target speaker to
produce a parametric representation of the speech sample for the target speaker; [Mandel Fig 1 teaches prediction model where the “encoder outputs three acoustic parameters” (Par [0024])] and “prediction model is trained with noisy audio features as input and clean acoustic features as output labels. “(Par [0022])]
	receiving configuration data for a speech synthesis system that accepts as an input
the parametric representation; [Mandel Fig 1 teaches “prediction model uses the noisy mel-spectrogram, Y(ω, t) as input and the clean mel-spectrogram, X(ω, t) from parallel clean speech as the target acoustic parameters that will be fed into the neural vocoder” (Par [0042])]
	adapting the configuration data for the speech synthesis system according to an
input comprising the parametric representation for the target speaker, and a time-domain
representation for the speech sample for the target speaker, [Mandel teaches using Wave-U-Net as a neural vocoder that “works in the time domain” … [by] “down-sample the audio signal progressively in multiple layers and then up-sample them to generate speech” (Par [0041]) which shows “this system can produce vocoder-synthesized high-quality and noise-free speech utilizing the prosody (timing, pitch contours, and pronunciation) observed in the real noisy speech” (Par [0018])]
	to generate adapted configuration data for the speech synthesis system representing the target speaker; and [Mandel Fig 1 where prediction model uses inputs to generate data for the vocoder for speech synthesis of target speaker]
	causing configuration of the speech synthesis system according to the adapted
configuration data, wherein the speech synthesis system comprising the adapted configuration data is implemented to generate synthesized speech output data with estimated voice and time-domain speech characteristics approximating actual voice and time-domain speech characteristics for the target speaker. [Mandel Fig 1 teaches output of synthesized audio]

	With regards to claim 2, Mandel teaches:
	All the limitations of claim 1
	wherein the configuration data comprises weights for neural-network-based implementation of the speech synthesis system. [Mandel teaches data to be sent to neural vocoder (Par [0042]) where WaveNet uses weights (Par [0054])]

	With regards to claim 11, Mandel teaches:
	All the limitations of claim 1
	wherein the received configuration data for the speech synthesis system is derived from training speech samples from multiple training speakers distinct from the target speaker. [Mandel teaches “model was trained with speech from two speakers and its effectiveness on both speaker datasets was tested” (Par [0030])]

	With regards to claim 16, Mandel teaches:
	All the limitations of claim 1
	wherein processing, using the trained encoder, the speech sample for the target speaker to produce the parametric representation comprises: transforming the speech sample for the target speaker into a spectral-domain vector representation. [Mandel Fig 1 teaches prediction model where the “encoder outputs three acoustic parameters” (Par [0024])] where acoustic parameters are and “i) spectral envelope, ii) log fundamental frequency (F0) and iii) aperiodic energy of the spectral envelope“(Par [0024])]

	With regards to claim 20, Mandel teaches:
	A speech generation system comprising: a speech acquisition section to obtain a speech sample for a target speaker; [Mandel Fig 1 teaches prediction model that obtains speech sample]
	an encoder, applied to the speech sample for the target speaker, to produce a
parametric representation of the speech sample for the target speaker; and [Mandel Fig 1 teaches prediction model where the “encoder outputs three acoustic parameters” (Par [0024])] and “prediction model is trained with noisy audio features as input and clean acoustic features as output labels. “(Par [0022])]
	a speech synthesis and cloning system comprising:
		a receiver to receive configuration data for the speech synthesis system,
wherein the speech synthesis system is configured to accept as an input the parametric
representation; and [Mandel Fig 1 teaches “prediction model uses the noisy mel-spectrogram, Y(ω, t) as input and the clean mel-spectrogram, X(ω, t) from parallel clean speech as the target acoustic parameters that will be fed into the neural vocoder” (Par [0042])]
		an adaptation module to adapt the configuration data for the speech
synthesis system according to an input comprising the parametric representation for the
target speaker, and a time-domain representation for the speech sample for the target
speaker, [Mandel teaches using Wave-U-Net as a neural vocoder that “works in the time domain” … [by] “down-sample the audio signal progressively in multiple layers and then up-sample them to generate speech” (Par [0041]) which shows “this system can produce vocoder-synthesized high-quality and noise-free speech utilizing the prosody (timing, pitch contours, and pronunciation) observed in the real noisy speech” (Par [0018])]
	to generate adapted configuration data for the speech synthesis system
representing the target speaker; [Mandel Fig 1 where prediction model uses inputs to generate data for the vocoder for speech synthesis of target speaker]
		wherein the adaptation module causes configuration of the speech
synthesis system according to the adapted configuration data, and wherein the speech
synthesis system comprising the adapted configuration data is implemented to generate
synthesized speech output data with estimated voice and time-domain speech characteristics approximating actual voice and time-domain speech characteristics for the
target speaker. [Mandel Fig 1 teaches output of synthesized audio]

	With regards to claim 21, Mandel teaches:
	All the limitations of claim 20
	wherein the speech acquisition section comprises one or more of: i) an audio collection unit to collect and record the speech sample, ii) a speech validation unit configured to perform audio validation analysis for the speech sample to determine whether the speech sample satisfies one or more audio quality criteria, and/or to apply filtering operations on the speech sample to enhance quality of the speech sample, or iii) an automatic audio transcription unit configured to generate an annotated speech sample from the collected speech sample. [Mandel Fig 1 teaches prediction model that collects and records speech sample for processing synthesized audio]
	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 3-4 and 6-8, and 22-24 are rejected under 35 U.S.C. 103 as being unpatentable over Mandel et al.(US2022/0358904) in view of Delfarah et al. (US2020/0135209 hereinafter Delfarah)

	With regards to claim 3, Mandel teaches:
	All the limitations of claim 2

	With regards to claim 3, Mandel fails to teach:
	wherein adapting the configuration data according to the time-domain representation comprises:
	matching the speech sample and corresponding linguistic annotation for the
speech sample to generate an annotated speech sample identifying phonetic and silent
portions, and respective time information, wherein the annotated speech sample
represents the time-domain speech attributes data for the target speaker; and
	adapting the configuration data for the speech synthesis system according to, at
least in part, the annotated speech sample representing the time-domain speech attributes
data for the target speaker.

	With regards to claim 3, Delfarah teaches:
	wherein adapting the configuration data according to the time-domain representation comprises: matching the speech sample and [Delfarah Fig 7B teaches matching speech sample input through matched phonemes (Par [0224]) to output text string through STT processing module (730)]
	corresponding linguistic annotation for the speech sample to generate an annotated speech sample identifying phonetic and silent portions, and respective time information, wherein the annotated speech sample represents the time-domain speech attributes data for the target speaker; and [Delfarah Fig 7B teaches outputting text strings which are an annotated speech sample, and identifying phonemes (Par [0224]) and phonetic alphabets (731) (Par [0222]) where phonetics are “associated with one or more candidate pronunciations” (Par [0222]) where pronunciations are time-domain speech attributes data, and system is trained on “audio frames or time periods associated with utterances of only one speaker, mixed utterances of multiple speakers, and silence (e.g., no utterance for some audio frames)” (Par [0290]) where training is used to identify silent portions of speech]
	adapting the configuration data for the speech synthesis system according to, at
least in part, the annotated speech sample representing the time-domain speech attributes
data for the target speaker. [Delfarah Fig 8, 9C teaches updating the weights or adapting the configuration data based in part on generating the output text (882) or annotated speech sample.
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the generation of text from speech as taught by Delfarah. The motivation to combine the teachings of Mandel with Delfarah is because Delfarah teaches “allow users to interact with devices or systems using natural language in spoken and/or text forms” (Par [0003]) which increases the capabilities of the invention of Mandel to better represent the speaker’s voice]

	With regards to claim 4, Mandel in view of Delfarah teaches:
	All the limitations of claim 3
	wherein the time-domain speech attributes data for the target speaker comprise one or more of: speech pronunciation by the target speaker, accent of the target speaker, speech style for the target speaker, or prosody characteristics for the target speaker. [Delfarah Fig 8 teaches “Target speaker representation 824 represents speech characteristics (e.g., tone, pitch, accent, etc.) of the target speaker” (Par [0258]) which is obtained by a pre-trained long short-term memory (LSTM) based speaker verification system that can be used as an “application (e.g., one of applications 724) or a sub-system that is part of a digital assistant module (e.g., module 726)” (Fig 7B, Par [0258])]

	With regards to claim 6, Mandel teaches:
	All the limitations of claim 1

	With regards to claim 6, Mandel fails to teach:
	further comprising generating the synthesized speech output data, including processing a target linguistic input by applying the speech synthesis system configured with the adapted configuration data to the target linguistic input to synthesize speech with the voice and time-domain speech characteristics approximating the actual voice and time-domain speech characteristics for the target speaker uttering the target linguistic input.

	With regards to claim 6, Delfarah teaches:
	further comprising generating the synthesized speech output data, including processing a target linguistic input by applying the speech synthesis system configured with the adapted configuration data to the target linguistic input to synthesize speech with the voice and time-domain speech characteristics approximating the actual voice and time-domain speech characteristics for the target speaker uttering the target linguistic input. [Delfarah Fig 8 teaches utterances of the target speaker (801A) which is the target linguistic input for the processing module.
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the generation of text from speech as taught by Delfarah. The motivation to combine the teachings of Mandel with Delfarah is because Delfarah teaches “allow users to interact with devices or systems using natural language in spoken and/or text forms” (Par [0003]) which increases the capabilities of the invention of Mandel to better represent the speaker’s voice]

	With regards to claim 7, Mandel teaches:
	All the limitations of claim 1

	With regards to claim 7, Mandel fails to teach:
	wherein obtaining the speech sample for the target speaker comprises obtaining a speech corresponding to a linguistic representation of spoken content of the speech sample.

	With regards to claim 7, Delfarah teaches:
	wherein obtaining the speech sample for the target speaker comprises obtaining a speech corresponding to a linguistic representation of spoken content of the speech sample.
[Delfarah Fig 7B teaches matching speech sample input through matched phonemes (Par [0224]) to output text string through STT processing module (730).
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the generation of text from speech as taught by Delfarah. The motivation to combine the teachings of Mandel with Delfarah is because Delfarah teaches “allow users to interact with devices or systems using natural language in spoken and/or text forms” (Par [0003]) which increases the capabilities of the invention of Mandel to better represent the speaker’s voice]

	With regards to claim 8, Mandel in view of Delfarah teaches:
	All the limitations of claim 7
	wherein obtaining the speech sample for the target speaker comprises conducting a scripted data collection session with the target speaker, including prompting the target speaker to utter the spoken content. [Delfarah teaches “pre-trained LSTM-based speaker verification system can generate the target speaker vector based on a trigger phrase uttered by the target speaker (e.g., “Hey Assistant”). The trigger phrase is a phrase to invoke a virtual assistant session” (Par [0260])]

	With regards to claim 22, Mandel teaches:
	obtain a speech sample for a target speaker; [Mandel Fig 1 teaches input noisy audio which is a speech sample that “includes a distorted target audio signal” (Par [0006]) used for multiple speakers or “single-speaker model” (Table 2, Par [0031])]
	process, using a trained encoder, the speech sample for the target speaker to
produce a parametric representation of the speech sample for the target speaker; [Mandel Fig 1 teaches prediction model where the “encoder outputs three acoustic parameters” (Par [0024])] and “prediction model is trained with noisy audio features as input and clean acoustic features as output labels. “(Par [0022])]
	receive configuration data for a speech synthesis system that accepts as an input
the parametric representation; [Mandel Fig 1 teaches “prediction model uses the noisy mel-spectrogram, Y(ω, t) as input and the clean mel-spectrogram, X(ω, t) from parallel clean speech as the target acoustic parameters that will be fed into the neural vocoder” (Par [0042])]
	adapt the configuration data for the speech synthesis system according to an input
comprising the parametric representation for the target speaker, and a time-domain
representation for the speech sample for the target speaker, [Mandel teaches using Wave-U-Net as a neural vocoder that “works in the time domain” … [by] “down-sample the audio signal progressively in multiple layers and then up-sample them to generate speech” (Par [0041]) which shows “this system can produce vocoder-synthesized high-quality and noise-free speech utilizing the prosody (timing, pitch contours, and pronunciation) observed in the real noisy speech” (Par [0018])]
	to generate adapted configuration data for the speech synthesis system representing the target speaker; and [Mandel Fig 1 where prediction model uses inputs to generate data for the vocoder for speech synthesis of target speaker]
	cause configuration of the speech synthesis system according to the adapted
configuration data, wherein the speech synthesis system comprising the adapted
configuration data is implemented to generate synthesized speech output data with
estimated voice and time-domain speech characteristics approximating actual voice and
time-domain speech characteristics for the target speaker. [Mandel Fig 1 teaches output of synthesized audio]

	With regards to claim 22, Mandel fails to teach:
	A non-transitory computer readable media storing a set of instructions, executable on at least one programmable device, to:

	With regards to claim 22, Delfarah teaches:
	A non-transitory computer readable media storing a set of instructions, executable on at least one programmable device, to: [Delfarah teaches “Device 200 includes memory 202 (which optionally includes one or more computer-readable storage mediums), memory controller 222, one or more processing units (CPUs) 220” (Par [0052])
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the generation of text from speech as taught by Delfarah. The motivation to combine the teachings of Mandel with Delfarah is because Delfarah teaches “allow users to interact with devices or systems using natural language in spoken and/or text forms” (Par [0003]) which increases the capabilities of the invention of Mandel to better represent the speaker’s voice]

	With regards to claim 23, Mandel teaches:
	All the limitations of claim 1
	A computing apparatus comprising: a speech acquisition section to obtain a speech sample for a target speaker; and [Mandel Fig 1 teaches prediction model that is a speech acquisition section]

	With regards to claim 23, Mandel fails to teach:
	one or more programmable processor-based devices to generate synthesized
speech according to the steps of claim 1.

	With regards to claim 23, Delfarah teaches:
	one or more programmable processor-based devices to generate synthesized
speech according to the steps of claim 1. [Delfarah teaches “Device 200 includes memory 202 (which optionally includes one or more computer-readable storage mediums), memory controller 222, one or more processing units (CPUs) 220” (Par [0052])
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the generation of text from speech as taught by Delfarah. The motivation to combine the teachings of Mandel with Delfarah is because Delfarah teaches “allow users to interact with devices or systems using natural language in spoken and/or text forms” (Par [0003]) which increases the capabilities of the invention of Mandel to better represent the speaker’s voice]

	With regards to claim 24, Mandel teaches:
	All the limitations of claim 1
	
	With regards to claim 24, Mandel fails to teach:
	A non-transitory computer readable media programmed with a set of computer instructions executable on a processor that, when executed, cause the operations comprising the method steps of claim 1.

	With regards to claim 24, Delfarah teaches:
	A non-transitory computer readable media programmed with a set of computer instructions executable on a processor that, when executed, cause the operations comprising the method steps of claim 1. [Delfarah teaches “Device 200 includes memory 202 (which optionally includes one or more computer-readable storage mediums), memory controller 222, one or more processing units (CPUs) 220” (Par [0052])
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the generation of text from speech as taught by Delfarah. The motivation to combine the teachings of Mandel with Delfarah is because Delfarah teaches “allow users to interact with devices or systems using natural language in spoken and/or text forms” (Par [0003]) which increases the capabilities of the invention of Mandel to better represent the speaker’s voice]

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Mandel et al.(US2022/0358904) and Delfarah et al. (US2020/0135209) in further view of Gupta et al. (US2022/0148561 hereinafter Gupta)

	With regards to claim 9, Mandel in view of Delfarah teaches:
	All the limitations of claim 8
	further comprising: performing audio validation analysis for the speech sample to determine whether the speech sample satisfies one or more audio quality criteria; and [Mandel teaches “intelligibility and quality of the speech generated by parametric resynthesis (PR) is compared against two speech enhancement systems, ideal-ratio mask and oracle Wiener mask (OWM)” (Par [0026]

	With regards to claim 9, Mandel in view of Delfarah fails to teach:
	obtaining a new speech sample in response to a determination that the speech
sample fails to satisfy the one or more audio quality criteria.

	With regards to claim 9, Gupta teaches:
	obtaining a new speech sample in response to a determination that the speech
sample fails to satisfy the one or more audio quality criteria. [Gupta Fig 5 teaches that when a quality threshold (550) is not satisfied the “method loops back to block 510, where a new audio stream is received” (Par [0037])
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel and Delfarah with the method of quality metrics as taught by Gupta. The motivation to combine the teachings of Mandel and Delfarah with Gupta is because Gupta teaches “Selecting the audio asset synthesizing pipeline based on the features of the available audio streams results in a higher quality of audio assets that are generated by the trained pipeline” (Par [0013]) which increases the capabilities of the invention of Mandel and Delfarah to better represent the speaker’s voice]

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Mandel et al.(US2022/0358904), Delfarah et al. (US2020/0135209), and Gupta et al. (US2022/0148561) in further view of Jain (US9437212 hereinafter Jain)

	With regards to claim 10, Mandel in view of Delfarah and Gupta teaches:
	All the limitations of claim 7

	With regards to claim 10, Mandel in view of Delfarah and Gupta fails to teach:
	further comprising: applying filtering and speech enhancement operations on the speech sample to enhance quality of the speech sample.

	With regards to claim 10, Jain teaches:
	further comprising: applying filtering and speech enhancement operations on the speech sample to enhance quality of the speech sample. [Jain Fig 1-2 teaches applying noise suppression filter (106) “to suppress noise in a noisy speech sample 202 to generate a noise-reduced output signal 220” (Col 3, lines 32-34)
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel, Delfarah, and Gupta with the method of noise suppression as taught by Jain. The motivation to combine the teachings of Mandel, Delfarah, and Gupta with Jain is because Jain teaches “in processing audio samples that include speech, it is desirable to improve the signal noise ratio (SNR) of the speech signal to enhance the intelligibility and/or perceived quality of the speech” (Col 1 lines 26-29) which increases the capabilities of the invention of Mandel, Delfarah, and Gupta to better represent the speaker’s voice]

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Mandel et al.(US2022/0358904) in view of Wang et al. (US2020/0142930 hereinafter Wang)

	With regards to claim 12, Mandel teaches:
	All the limitations of claim 1

	With regards to claim 12, Mandel fails to teach:
	wherein adapting the configuration data comprises: computing an adaptation stability metric representative of adaptation performance for adapting the configuration data; and
	aborting the adapting of the configuration data in response to a determination that
the computed adaptation stability metric indicates unstable adaptation of the configuration data.

	With regards to claim 12, Wang teaches:
	wherein adapting the configuration data comprises: computing an adaptation stability metric representative of adaptation performance for adapting the configuration data; and [Wang teaches computing “stability metric” (Par [0118]) to evaluate data]
	aborting the adapting of the configuration data in response to a determination that
the computed adaptation stability metric indicates unstable adaptation of the configuration data. [Wang teaches can “remove” (Par [0118]) data based on the metric showing the data is not stable.
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the method of evaluating data as taught by Jain. The motivation to combine the teachings of Mandel with Wang is because Wang teaches “data processing system can apply a weight to metrics stored in the metric data structure. The weight can refer to a value, a significance, or importance of a metric” (Par [0119]) which increases the capabilities of the invention of Mandel to better evaluate data]

Claims 13 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Mandel et al.(US2022/0358904) and Wang et al. (US2020/0142930) in further view of Gupta et al. (US2022/0148561 hereinafter Gupta)

	With regards to claim 13, Mandel in view of Wang teaches:
	All the limitations of claim 12
	
	With regards to claim 13, Mandel in view of Wang fails to teach:
	further comprising: re-starting the adapting of the configuration data using the speech sample for the target speaker.

	With regards to claim 13, Gupta teaches:
	further comprising: re-starting the adapting of the configuration data using the speech sample for the target speaker. [Gupta Fig 5 teaches that when a quality threshold (550) is not satisfied the “method loops back to block 510, where a new audio stream is received” (Par [0037])
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel and Wang with the method of quality metrics as taught by Gupta. The motivation to combine the teachings of Mandel and Wang with Gupta is because Gupta teaches “Selecting the audio asset synthesizing pipeline based on the features of the available audio streams results in a higher quality of audio assets that are generated by the trained pipeline” (Par [0013]) which increases the capabilities of the invention of Mandel and Wang to better represent the speaker’s voice]

	With regards to claim 14, Mandel in view of Wang teaches:
	All the limitations of claim 12
	
	With regards to claim 14, Mandel in view of Wang fails to teach:
	further comprising: obtaining, following the aborting, a new speech sample for the target speaker; and performing the adapting of the configuration data using the new speech sample.

	With regards to claim 14, Gupta teaches:
	further comprising: obtaining, following the aborting, a new speech sample for the target speaker; and performing the adapting of the configuration data using the new speech sample. [Gupta Fig 5 teaches that when a quality threshold (550) is not satisfied the “method loops back to block 510, where a new audio stream is received” (Par [0037])
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel and Wang with the method of quality metrics as taught by Gupta. The motivation to combine the teachings of Mandel and Wang with Gupta is because Gupta teaches “Selecting the audio asset synthesizing pipeline based on the features of the available audio streams results in a higher quality of audio assets that are generated by the trained pipeline” (Par [0013]) which increases the capabilities of the invention of Mandel and Wang to better represent the speaker’s voice]

	Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Mandel et al.(US2022/0358904) in view of Clark et al. (US2020/0074985 hereinafter Clark)

	With regards to claim 17, Mandel teaches:
	All the limitations of claim 16
	wherein transforming the speech sample into the spectral-domain vector representation comprises: transforming the speech sample into a plurality of mel spectrogram frames; and [Mandel teaches “parameters include a log mel spectrogram which includes a log mel spectrum of individual frames of audio” (Par [0042])]

	With regards to claim 17, Mandel fails to teach:
	mapping the plurality of mel spectrogram frame into a fixed-dimensional vector.

	With regards to claim 17, Clark teaches:
	mapping the plurality of mel spectrogram frame into a fixed-dimensional vector. [Clark Fig 5 teaches “mel spectral embedding 560 may be represented by a fixed-length numerical vector” (Par [0080])
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the variational encoder as taught by Clark. The motivation to combine the teachings of Mandel with Clark is because Clark teaches training so that “data associated with the reference and predicted mel-frequency spectrogram frames 520, 580 substantially match one another. The predicted mel-frequency spectrogram frames 580 may implicitly provide a prosodic representation of the reference audio signal 222” which increases the capabilities of the invention of Mandel to better represent the speaker’s voice]

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Mandel et al.(US2022/0358904) in view of Cardella et al. (US12148417 hereinafter Cardella)

	With regards to claim 18, Mandel teaches:
	All the limitations of claim 1
	wherein adapting the configuration data comprises adapting the configuration data
for the speech synthesis system based further on the parametric style representation. [Mandel Fig 1 teaches parametric resynthesis (Par [0022])]
	
	With regards to claim 18, Mandel fails to teach:
	further comprising: generating, using a variational autoencoder, a parametric style representation for the prosodic style associated with the speech sample;

	With regards to claim 18, Cardella teaches:
	further comprising: generating, using a variational autoencoder, a parametric style representation for the prosodic style associated with the speech sample; [Cardella teaches natural language processing system (120) includes “TTS component 180 may use a hierarchical variational autoencoder system (or other machine learning architecture) to generate prosody-rich synthesized speech for various applications” (Col 13, lines 26-29) for parametric synthesis.
	It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the method of parametric resynthesis using a neural vocoder as taught by Mandel with the variational autoencoder taught by Cardella. The motivation to combine the teachings of Mandel with Delfarah is because Cardella teaches “hierarchical autoencoder system(or other machine learning architecture)” (Col 13, lines 26-28) which increases the capabilities of the invention of Mandel to use other machine learning architecture to better produce speech]

Allowable Subject Matter
Claims 5, 15, and 19 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

	Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Joseph J Yamamoto whose telephone number is (571)272-4020. The examiner can normally be reached M-F 1000-1800 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached at 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JOSEPH J. YAMAMOTO
Examiner
Art Unit 2656



/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

May 15, 2024
Application Filed
Mar 05, 2026
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/081,410
Patent 12602546
KEY POINTS EXTRACTION FOR UNIFORM RESOURCE LOCATORS
2y 5m to grant Granted Apr 14, 2026
18/423,836
Patent 12602377
SYSTEMS AND METHODS FOR QUESTION ANSWERING WITH DIVERSE KNOWLEDGE SOURCES
2y 5m to grant Granted Apr 14, 2026
18/388,447
Patent 12592220
DEEPFAKE DETECTION
2y 5m to grant Granted Mar 31, 2026
18/305,896
Patent 12585875
DEVICE AND METHOD FOR PROCESSING TEMPORAL EXPRESSIONS FROM UNSTRUCTURED TEXTS FOR FILLING A KNOWLEDGE DATABASE
2y 5m to grant Granted Mar 24, 2026
18/318,327
Patent 12566888
MULTI-LINGUAL NATURAL LANGUAGE GENERATION
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
93%
With Interview (+21.2%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 43 resolved cases by this examiner. Grant probability derived from career allow rate.