Last updated: April 19, 2026
Application No. 18/300,871
AUDIO GENERATOR AND METHODS FOR GENERATING AN AUDIO SIGNAL AND TRAINING AN AUDIO GENERATOR

Non-Final OA §103
Filed
Apr 14, 2023
Examiner
LEE, JANGWOEN
Art Unit
2656
Tech Center
2600 — Communications
Assignee
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
OA Round
3 (Non-Final)
Interview Optional

— +24.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 44 resolved cases, 2023–2026
Examiner Intelligence

LEE, JANGWOEN View full profile →
Grants 82% — above average
Career Allow Rate
36 granted / 44 resolved
+19.8% vs TC avg
Strong +24% interview lift
Without
With
+24.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
23 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
26.5%
-13.5% vs TC avg
§103
54.6%
+14.6% vs TC avg
§102
11.0%
-29.0% vs TC avg
§112
4.1%
-35.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 44 resolved cases
Office Action

§103
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
1.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 03/10/2026 has been entered. 
This communication is in response to the Application filed on 03/10/2026. Claims 1, 6-7, 10-14, 16-26, 28-30, 33-37, 39-49 and 53 are pending and have been examined. Claims 1, 26 and 53 are independent. Claims 2-5, 8-9, 15, 27, 31-32, 38, 50-52 and 54 are canceled. This Application was published as U.S. Pub. No. 20230282202.
	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 10/08/2025, 11/18/2025, 12/30/2025, 02/02/2026 and 02/20/2026 were filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.


Response to Arguments
With respect to Interpretation of Claims 1 as a means plus function limitation Under 35 U.S.C §112(f), Applicant appears to be presenting following position on Remarks, pp 12-13, filed on 03/10/2026:
“…The terms "processing block" and "styling element" do not use the word "means" and therefore carry a strong presumption against § 112(f) interpretation. Furthermore, amended claims 1 and 26 do not merely recite these terms in a vacuum, the structural composition of each element is explicitly recited in the claims themselves…”.

	In response, Examiner respectfully notes that limitations. “a first processing block, configured to receive first data derived from the noise and to output first output data”, “a second processing block, configured to receive, as second data, the first output data”, and “a styling element, configured to apply the conditioning feature parameters to the first data or normalized first data”, as disclosed in claim 1, clearly invoke 35 U.S.C. 112(f) as claim limitations meet the following 3-prong analysis:
Claim limitations use “block” and “element” as a substitute for "means" that is a generic placeholder for performing the claimed function;
the generic placeholders, “block” and “element”, are modified by functional language, typically, by the transition word "for" (e.g., "means for"), "configured to" or "so that”;
the terms “block” and “element” are not modified by sufficient structure, material, or acts for performing the claimed function. The modifiers, “first/second processing” and “styling” preceding “block” and “element”, do not provide sufficiently definite meaning as the name for structure. 
Applicant further presents “structural composition” for “first processing block” and “block” as a structural term, but they all fail to recite sufficient structure beyond functional composition. For the provided reasons, Examiner respectfully disagrees, and therefore, Interpretation of Claims 1 as a means plus function limitation Under 35 U.S.C §112(f) is sustained.
Interpretation of Claims 26 as a means plus function limitation Under 35 U.S.C §112(f) is withdrawn because claim limitations with respect to “first/second processing block” and “styling element” do not meet the 3-prong analysis.
Applicant’s arguments with respect to claims 1, 6-7, 10-14, 16-26, 28-30, 33-37, 39-49 and 53 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Please note Skordilis et al. (US Pub No. 2021/0074308) and Koishida et al. (US Pub No. 2021/0134312). For at least the supra provided reasons, Applicant's arguments have been fully considered but they are not persuasive.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: a first processing block, a second processing block, a styling element  in Claim 1. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 6-7, 10-14, 16-26, 28-30, 33-37, 39-49 and 53 are rejected under 35 U.S.C. 103 as being unpatentable over Skordilis et al. (US Pub No. 2021/0074308, hereinafter, Skordilis) in view of references Binkowski et al. (US Pub. No. 2021/0089909, hereinafter, Binkowski), Peng et al. (US Pub. No. 2020/0066253, hereinafter, Peng) and Koishida et al. (US Pub No. 2021/0134312, hereinafter, Koishida). 
Regarding Claim 1,
Skordilis discloses  Audio generator (Abstract, Fig.1,par [047], voice decoder 104), configured to generate an audio signal (Fig.1, reconstructed speech signal 105) from noise (Fig.3, par [074], "…The residual data generated by the neural network model 334...") and target data, the target data representing the audio signal (Fig.1, speech signal 101; Fig.5A/B, voice signal s(n)),  the audio generator comprising:
a first processing block (Figs.4-5, par [078-079, 101-105], neural network model 534 including frame rate network 442 and a sample rate network 452 in Fig. 4 and part of a neural network model 534 in the voice decoder 504 in Fig. 5A/5B), configured to receive first data derived from the noise and to output first output data (Fig.5A, par [104], "…The neural network model 534 generates an LTP residual ê ( n ) by effectively decoding the input features into the LTP residual ê ( n )..."), wherein the first output data comprises a plurality of channels (Fig.5A, par [064], "…the feature 541 includes linear prediction LP coefficients, line spectral pairs LSPs, line spectral frequencies LSFs, pitch lag with integer or fractional accuracy, pitch gain, pitch correlation"; i.e., channels with different feature data), and
a second processing block (Fig.5A, including LTP Engine 522 and Short-term LP Engine 520), configured to receive, as second data (Fig.5A, the residual signal e(n) to LTP prediction), the first output data (Fig.5A, para [082, 088], "…input e(n) to adder in LTP Engine 822 and then to short-term LP Engine 820, similar to "+" in Figs. 5A/5B), wherein the first processing block comprises a neural network which applies, for each channel of the first output data (Fig.4, para [078-079], "…frame rate network 442...the sample rate network 452..."):
a conditioning set of learnable layers (Fig.4, par [078], a first convolutional 1x3 layer 444, the second convolutional layer 446) configured to process the target data to obtain conditioning features parameters (Fig.4, par [078, 104], 128-dimensional conditioning vector f processed by convolution layer 444 and convolution layer 446), 
a styling element (Fig.4, par [079], concatenation layer 454, GRU 456, 458), configured to apply the conditioning feature parameters to the first data or normalized first data (Fig.4 and 5A/5B, par [079, 104], applying 128-dimensional conditioning vector f, through at least 420 and p(n) returned from short-term LP engine 520 in fig. 5A/5B and also pitch estimation engine 566 and pitch gain estimation engine 564 from the features 541 in figs. 5A/5B); and
wherein the second processing block is configured to combine the plurality of channels of the second data to generate and render the audio signal (Fig 5A/5B, par [0],                         
                            
                                
                                    S
                                
                                ^
                            
                            (
                            n
                            )
                        
                     as audio signal through adders in 521 and 523).
However, Skordilis does not explicitly discloses the audio generator being configured to obtain the target data by converting an input in form of text or elements of text onto at least one acoustic feature.
Binkowski, in the analogous field of endeavor, discloses the audio generator (Fig.1, the generative neural network 110) being configured to generate an audio signal from noise (Figs.1 and 2, par [025], noise input 104/204) and target data and obtain the target data by converting an input in form of text or elements of text onto at least one acoustic feature (Figs1 and 2, para [021-024], "…the generative neural network 110 to receive a conditioning text input 102 and to process the conditioning text input 102 to generate an audio output 112…the conditioning text input 102 characterizes an input text..."; conditioning text input includes the input text itself and character- or word-level embedding of the input text characterizing linguistic and acoustic features such as phoneme and pitch information).
Therefore, It would have been obvious to a person of ordinary skill in the art to use Binkowski's generative neural network (generator block) in Skordilis' voice coding system to generate audio samples from the conditioning text and noise input as specified in Claim 1 (Binkowski, Abstract, para [002-004, 20]).
 	However, neither Skordilis nor Binkowski explicitly discloses the limitation, "the audio generator being configured to obtain the target  data by converting an input in form of text or elements of text onto at least one acoustic feature, wherein the target data comprise the at least one acoustic feature among a spectrogram, a log-spectrogram, or an MFCC, and a mel-spectrogram or another type of spectrogram obtained from a text."
Peng, in the analogous field of endeavor, discloses the audio generator being configured to obtain the target  data by converting an input in form of text or elements of text onto at least one acoustic feature, wherein the target data comprise the at least one acoustic feature among a spectrogram, a log-spectrogram, or an MFCC, and a mel-spectrogram or another type of spectrogram obtained from a text (Peng, Fig.2, par [058], "…an encoder 205 to encode text into per-time step key and value vectors…", "…the decoder 230 uses these to predict the mel-scale log magnitude spectrograms 242…").
Therefore, It would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified an audio generative neural network taught by Skordilis in view of Binkowski, by implementing the audio synthesis module (e.g., acoustic model) taught by Peng to generate the target data as the spectrogram of Peng from input texts with a reasonable expectation of success to create the optimal neural network architecture for improved speaker text-to-speech systems (Peng, paras. [004-005])
 
However, neither Skordilis nor Binkowski nor Peng explicitly discloses 2-dimensional convolutions which provide feature parameters having dimension adapted to the dimension of the first data and the audio synthesis process to obtain the waveform.
Koishida, in the analogous field of endeavor, discloses an audio generator (Abstract, Fig.1, par [015], speech enhancement system 100) where a conditioning set of learnable layers (Fig.1, ResBlock in encoder 116 and upsample/Conv2D of decoder 122) configured to process the target data (the output from Log-Mel 114 is processed through the encoder 116 and decoder122) to obtain conditioning features parameters (mask prediction 124 for multiple noises) through 2-dimensional convolutions which provide feature parameters having dimension adapted to the dimension of the first data and an audio synthesis process to obtain the waveform (Fig.1, para [020-021], through multiple upsample/conv2D in decoder 122 in fig. 1 and channel-wise concatenation for various direct mapping and mask approximation).
wherein the second processing block is configured to combine the plurality of channels of the second data to generate and render the audio signal (Fig.1, par [0], through the iSTFT by applying phase 130 and enhanced magnitude 126 in time-frequency domain as inputs to render enhanced waveform 128).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied the second processing block is configured to combine the plurality of channels of the second data and perform a synthesis to obtain the audio signal, as taught by Koishida, to the second processing block configured to combine the plurality of channels of the second data in the audio signal generator/decoder, as taught by the combination of Skordilis,  Binkowski, and Peng to improve quality of generated audio sound by improving the perceptual quality of a targeted acoustic signal and by improving performance in reducing the number of modal parameters, (Koishida, para [010, 032, 040]).
Regarding Claim 6,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, configured to obtain the target data by converting at least one linguistic feature onto the at least one acoustic feature  (Peng, Fig.2, para [064-073], "… a phoneme-only model requires a preprocessing step to convert words to their phoneme representations ( e.g., by using an external phoneme dictionary or a separately trained grapheme-to-phoneme model)"; i.e., conversion from text (e.g., word) to linguistic feature (e.g., phoneme) ; par [058], "…an encoder 205 to encode text into per-time step key and value vectors…", "…the decoder 230 uses these to predict the mel-scale log magnitude spectrograms 242…").
Regarding Claim 7, 
The combination of Skordilis, Binkowski, Peng, and Koishida  further teaches audio generator of claim 1, configured to convert the input in form of text onto at least one linguistics feature among a phoneme, words prosody, intonation, phrase breaks, and filled pauses obtained from a text, and to further convert the at least one linguistics feature onto the at least one acoustic feature (Peng, Fig.2, para [064-073], "… a phoneme-only model requires a preprocessing step to convert words to their phoneme representations ( e.g., by using an external phoneme dictionary or a separately trained grapheme-to-phoneme model)"; i.e., conversion from text (e.g., word) to linguistic feature (e.g., phoneme) ; par [058], "…an encoder 205 to encode text into per-time step key and value vectors…", "…the decoder 230 uses these to predict the mel-scale log magnitude spectrograms 242…").
Regarding Claim 10,
The combination of Skordilis, Binkowski, Peng, and Koishida  further teaches audio generator of claim 1, configured to derive the target data from the text using a statistical model, performing text analysis and/or using an acoustic model (Peng, Fig.2, para [054, 058], target data from the input text to log-mel spectrograms 135 (i.e., acoustic model) using a causal convolutional decoder, which decodes the encoder representation with an attention mechanism 120  in an autoregressive manner (I.e., statistical model)). 
Regarding Claim 11,
The combination of Skordilis, Binkowski, Peng, and Koishida  further teaches audio generator of claim 1, configured to derive the target data from the text using a learnable model performing text analysis and/or using an acoustic model  (Peng, Fig.2, para [054, 058], target data from the input text to log-mel spectrograms 135 (i.e., acoustic model) using a causal convolutional decoder, which decodes the encoder representation with an attention mechanism 120; para [064-073], Text Preprocessing and phoneme-only model, or mixed character-and phoneme models (i.e., text analysis) ). 
Regarding Claim 12,
The combination of Skordilis, Binkowski, Peng, and Koishida  further teaches audio generator of claim 1, configured to derive the target data from the text using a rules-based algorithm performing text analysis and/or an acoustic model (Peng, Fig.2, para [054, 058], target data from the input text to log-mel spectrograms 135 (i.e., acoustic model) using a causal convolutional decoder, which decodes the encoder representation with an attention mechanism 120; para [064-069], Text Preprocessing  (i.e., text analysis and rule-based) ). 
Regarding Claim 13,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, configured to derive the target data through at least one deterministic layer (Peng, Figs.2 and 3, par [058], residual convolutional layers and fully connected layers in encoder 205/305).
Regarding Claim 14,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, configured to derive the target data through at least one learnable layer (Skordilis, Fig.4, par [078], a first convolutional 1x3 layer 444, the second convolutional layer 446).
Regarding Claim 16,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, wherein a first convolution layer is configured to convolute the target data or up-sampled target data to obtain  first convoluted data using a first activation function (Binkowski, Fig.2, Table 1, par [053], "…the activation layer 212 b can be a ReLU activation layer.  the final five generator blocks listed in Table 1, the upsampling layer 212 c upsamples the output of the activation layer by the corresponding upsampling factor p…").
Regarding Claim 17,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, wherein the conditioning set of learnable layers and the styling element are part of a weight layer in a residual block of a neural network comprising one or more residual blocks (Peng, Fig.3, par [061], "…the model 300 uses a deep residual convolutional network (i.e., residual block) to encode text and/or phonemes into per-timestep key 320 and value 322 vectors for an attentional decoder 330…"; "…weight normalization is applied to all convolution filters and fully connected layer weight matrices in the model…(i.e., weight layer)").
Regarding Claim 18,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, wherein the audio generator further comprises a normalizing element, which is configured to normalize the first data (Skordilis, par [091], "…normalized correlation function..."; Koishida, par [19], "…Convolution may be performed over time-frequency dimensions and may be followed by batch normalization and a leaky rectified linear unit (ReLU) activation...").
Regarding Claim 19,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, wherein the audio signal is a voice audio signal (Koishida, title, abstract, speech enhanced system, Fig.1, Enhanced Waveform 128; Binkowski, par [023], "… the audio output 112 depicts speech corresponding to the input text").
Regarding Claim 20,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1,  wherein the target data is up-sampled by a factor of at least 2 (Binkowski, Fig.2, para [029, 49, 53], "…a dimensionality of the layer output of each upsampling layer is larger than a dimensionality of the layer input of the upsampling layer, e.g., 1600Hz turned to 3200 or 2x frequency...").
Regarding Claim 21,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 20, wherein the target data is up-sampled by non-linear interpolation (Binkowski, par [053], "…the upsampling layer 212 c can perform higher-order (i.e., non-linear) interpolation on the elements of the layer input to generate the layer output"; Binkowski teaches linear or higher-order interpolations). 
Regarding Claim 22,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 16, wherein the first activation function is a leaky rectified linear unit, leaky ReLu, function (Binkowski, Fig.2, (par [053], "…the activation layer 212 b can be a ReLU activation layer…").
Regarding Claim 23,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, 
wherein convolution operations run with maximum dilation factor of 2 (Binkowski, Fig.2, 212d, 216c, 232c, 236c, dilation 2, 4, 8; par [063], "…a dilation value of (1,2,4, or 8)…").
Regarding Claim 24,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, comprising eight first processing blocks and one second processing block (Binkowski, Table 1, Example Generator Neural Network Architecture, Seven G-Block and input/output Conv. Layers). 
Regarding Claim 25,
The combination of Skordilis, Binkowski, Peng, and Koishida further teaches audio generator of claim 1, wherein the first data comprises a lower dimensionality than the audio signal (Binkowski, par [070], through downsampling).
Claim 26 is a method claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. 
Rationale for combination is similar to that provide for Claim 1
Claim 28 is a method claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. 
Claim 29 is a method claim with limitations similar to the limitations of Claim 6 and is rejected under similar rationale. 
Claim 30 is a method claim with limitations similar to the limitations of Claim 7 and is rejected under similar rationale. 
Claim 33 is a method claim with limitations similar to the limitations of Claim 10 and is rejected under similar rationale. 
Claim 34 is a method claim with limitations similar to the limitations of Claim 11 and is rejected under similar rationale. 
Claim 35 is a method claim with limitations similar to the limitations of Claim 12 and is rejected under similar rationale. 
Claim 36 is a method claim with limitations similar to the limitations of Claim 13 and is rejected under similar rationale. 
Claim 37 is a method claim with limitations similar to the limitations of Claim 14 and is rejected under similar rationale. 
Claim 39 is a method claim with limitations similar to the limitations of Claim 16 and is rejected under similar rationale. 
 	Claim 40 is a method claim with limitations similar to the limitations of Claim 17 and is rejected under similar rationale. 
Claim 41 is a method claim with limitations similar to the limitations of Claim 18 and is rejected under similar rationale. 
Claim 42 is a method claim with limitations similar to the limitations of Claim 19 and is rejected under similar rationale. 
Claim 43 is a method claim with limitations similar to the limitations of Claim 20 and is rejected under similar rationale. 
Claim 44 is a method claim with limitations similar to the limitations of Claim 21 and is rejected under similar rationale. 
Claim 45 is a method claim with limitations similar to the limitations of Claim 22 and is rejected under similar rationale. 
Claim 46 is a method claim with limitations similar to the limitations of Claim 23 and is rejected under similar rationale. 
Claim 47 is a method claim with limitations similar to the limitations of Claim 24 and is rejected under similar rationale. 
Claim 48 is a method claim with limitations similar to the limitations of Claim 25 and is rejected under similar rationale. 
 	Claim 49 is a method claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. 
Claim 53 is a non-transitory digital storage medium claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Additionally,
Binkowski discloses a non-transitory digital storage medium having a computer program stored thereon to perform the method (Binkowski, par [122], "…one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution") for generating an audio signal by an audio generator from an input signal and target data, the target data representing the audio signal and being derived from a text, comprising: 
Rationale for combination is similar to that provided for Claim 1.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Arik et al. (US Pub No. 2019/0355347) discloses a deep neural network architecture, based on transposed convolutions, that can efficiently synthesize waveforms from spectrograms without any autoregressive computation using a generative adversarial network (GAN) framework.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JANGWOEN LEE whose telephone number is (703)756-5597. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BHAVESH MEHTA can be reached at (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JANGWOEN LEE/Examiner, Art Unit 2656                                                                                                                                                                                                        
/Paras D Shah/Supervisory Patent Examiner, Art Unit 2653                                                                                                                                                                                                        
03/23/2026
Read full office action
Prosecution Timeline

Apr 14, 2023
Application Filed
Apr 03, 2025
Non-Final Rejection — §103
Jul 31, 2025
Response Filed
Oct 10, 2025
Final Rejection — §103
Mar 10, 2026
Request for Continued Examination
Mar 12, 2026
Response after Non-Final Action
Mar 22, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/007,025
Patent 12597432
HUM NOISE DETECTION AND REMOVAL FOR SPEECH AND MUSIC RECORDINGS
2y 5m to grant Granted Apr 07, 2026
18/118,619
Patent 12586571
EFFICIENT SPEECH TO SPIKES CONVERSION PIPELINE FOR A SPIKING NEURAL NETWORK
2y 5m to grant Granted Mar 24, 2026
18/258,569
Patent 12573381
SPEECH RECOGNITION METHOD AND APPARATUS, STORAGE MEDIUM, AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 10, 2026
17/925,261
Patent 12567430
METHOD AND DEVICE FOR IMPROVING DIALOGUE INTELLIGIBILITY DURING PLAYBACK OF AUDIO DATA
2y 5m to grant Granted Mar 03, 2026
18/310,577
Patent 12566930
CONDITIONING OF PRODUCTIVITY APPLICATION FILE CONTENT FOR INGESTION BY AN ARTIFICIAL INTELLIGENCE MODEL
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
82%
Grant Probability
99%
With Interview (+24.2%)
2y 11m
Median Time to Grant
High
PTA Risk
Based on 44 resolved cases by this examiner. Grant probability derived from career allow rate.