DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to the claim amendment filed on October 16, 2025 and wherein claims 1-5, 10, 27, 32-33, 38-42, 46, 52, 55 amended and claim 56 newly added.
In virtue of this communication, claims 1-56 are currently pending in this Office Action.
With respect to the specification objection due to failure to disclosed claimed features as recited in claims 1, 55, as set forth in the previous Office Action, the claim amendment, and argument, see paragraph 3 of page 11 and paragraphs 1-2 of page 12 in Remarks filed on October 16, 2025, have been fully considered and the argument is persuasive. Therefore, the specification objection due to failure to disclose claimed features as recited in claims 1, 55, as set forth in the previous Office Action, has been withdrawn.
With respect to the objection of claims 2-5, 10, 32-33 due to formality issues, as set forth in the previous Office Action, the claim amendment, and argument, see paragraphs 2-3 of page 12 in Remarks filed on October 16, 2025, have been fully considered and the argument is persuasive. Therefore, the objection of claims 2-5, 10, 32-33 due to the formality issues, as set forth in the previous Office Action, has been withdrawn.
Claim Interpretation
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action mailed on July 16, 2025.
Claim Objections
Claims 1-56 are objected to because of the following informalities:
Claim 1 recited “Audio decoder, configured to generate an audio signal from a bitstream …” which should be -- An [[A]]audio decoder, configured to generate an audio signal from a bitstream … -- because “audio decoder” herein is not plural. Claims 2-54, 56 are objected due to the dependencies to claim 1.
Claim 55 recited “Method for decoding an audio signal …” which should be -- A [[M]]method for decoding an audio signal …-- because “method” herein is not plurality.
Claim 56 recited “Audio decoder according to claim 1” which should be -- The [[A]]audio decoder according to claim 1--.
Appropriate correction is required.
Claim Rejections - 35 USC § 101
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action mailed on July 16, 2025.
Applicant challenged the claim rejection under 35 USC § 101 as applied in the previous office action by presenting amendment “synthesis” in claims 1, 55, and argued that the application specification disclosed improvement about “audio synthesis”, see paragraph 6 of page and however, broadly citation of “synthesis” with abstract “channels” would not be considered to overcome rejection under 35 USC § 101 because claims recited “perform a synthesis” of nothing or abstract “channels”, other than argued “audio channels”, and there is no integration practice of the claimed subject matter that is purported to generate “audio signal” from representation of audio signal “bitstream” and the “synthesis” is one step of generating “audio signal” with no integration practice of generated “audio signal” and thus, “audio signal” herein is mere math symbol generated from math manipulation from cited “box” to “box”, such as “convolution” that is mere math formula applied to abstract “data” such as “target data” outputted from a box “at least one preconditioning learnable layer” by taking “bitstream” with no processing. In addition, claims failed to recite what physical meaning of “signal”, “data”, “sample”, “channel” are if they are of audio, acoustic, image, video, or mere math symbols from claimed box to box, etc. with no processing and therefore, claim rejection of claims 1-55 under 35 USC § 101, maintained. Similarly, a PQMF filter for synthesis applied to nothing to “obtain audio signal” would be considered as math formula application as recited in newly-added claim 56 and thus, claim 56 is patentable ineligible either.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(B) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Claims 1-56 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which applicant regards as the invention.
Claim 1 recited “first data derived from an input signal from an external or internal source or from the bitstream” which is confusing because it is unclear whether “from the bitstream” herein is with respect to recited “first data derived” so that “first data” is derived “from the bitstream”, or to “an input signal” so that “an input signal” is “from the bitstream” and thus, renders claim indefinite. Claim 1 further recited “apply the plurality of conditioning feature parameters to the first data derived from the input signal or a normalized first data” which is further confusing because it is unclear whether “apply … parameters” to “the first data” or “apply … parameters” to “a normalized first data”, or “the first data derived from the input signal” or “the first data derived from” “a normalized first data”, and thus, further renders claim indefinite. Claim 1 further recited “combine the plurality of channels of the second data and perform a synthesis to obtain the audio signal” which is further confusing because it is unclear whether “obtain the audio signal” by “combine … channels” or by “perform a synthesis” and thus, further renders claim indefinite. Note: it is well-known in the art that word meaning of “synthesis” is narrower than word meaning of “combine” and the amendment appears to simultaneously place a narrow limitation and a broad limitation in the same claim which causes the confusing. Claims 2-54, 56 are rejected due to the dependencies to claim 1.
Claim 55 has been analyzed and rejected according to claim 1 above because claim 55 recited similar deficient features as recited in claim 1 above.
Examiner Comment
Claims 1, 55 recited (1) “a first processing block” by “receive the first data” that is derived from “bitstream” and “to output a first output data” comprising “channels”, and “the first output data” as “second data” being is further “combined” and “synthesized” “to obtain the audio signal”. Claims 1, 55 further recited (2) “the first processing block comprises” “at least one preconditioning learnable layer” configured “to receive the bitstream”, and further comprising “styling element, configured to apply … parameters to the first data” with no relationship of any contribution to “obtain the audio signal”. Claims 1, 55 further recited (3) “the first processing block” is configured to up-sample the first data …”. Therefore, it would reasonably conclude that claims 1, 55 recited three distinct subject matters: (1), “obtain the audio signal” by receiving “first data” that is derived from “bitstream”, (2). “applying … parameters to “the first data” by receiving “bitstream”, and (3). “first data” derived from “bitstream” is up-sampled and therefore, BRI would be applied in term of an application of prior arts, see MPEP 2111.
Claim 1 further recited “audio decoder” in preamble and claim 55 recited “method for audio decoding”, but claim body appears to “obtain audio signal”, “up-sampling” and “applying … parameters” from “audio signal” (bitstream), which appears to have nothing to do with “decoding” at all and a BRI is also applied in term of prior art application.
Claims 1, 55 would be interpreted as Markush claim because claims recited multiple “or” such as “from external or internal source or from the bitstream”, “to the first data derived from the input signal or a normalized first data” in terms of applications of prior arts, etc., see MPEP 2117.
Claim 55 broadly recited “outputting, by the first processing block, a first output data comprising a plurality of channels”, which is based on nothing and the claimed features “outputting” would be also applied with BRI in term of prior art application.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-10, 13-20, 22-52, 56 are rejected under 35 U.S.C. 103 as being unpatentable over Skordilis et al (US 20210074308 A1, hereinafter Skordilis) and in view of references Binkowski et al (US 20210089909 A1, hereinafter Binkowski) and Koishida et al. (US 20210134312 A1, hereinafter Koishida).
Claim 1: Skordilis teaches audio decoder (title and abstract, ln 1-10, voice decoder 104 in fig. 1, 204 in fig. 2, 304 in fig. 3, 504 in fig. 5A, 505 in fig. 5B, 604 in fig. 6, 704 in fig. 7, and 804 in fig. 8), configured to generate an audio signal (105 in fig. 1) from a bitstream (bitstream of the compressed speech signal and provided by the audio encoder, para 47), the bitstream representing the audio signal (digitized audio signal such as speech signal, compressed and coded and then provided by the audio encoder 102, para 43-44), the audio signal being subdivided in a sequence of frames (samples of the input speech signal are divided into blocks of N samples each and N samples referred to a frame, para 49), the audio decoder comprising:
a first data provisioner (422 in Fig. 4, and 564, 566, 568, etc. in figs. 5A-5B) configured to provide, for a given frame (samples of the input speech signal are divided into blocks of N as the frame, para 49), first data (outputted from the 564, 566, 568, etc., in fig. 5A/5B) derived from an input signal (features 541 in fig. 5A/5B) from an external or internal source or from the bitstream (extracted from the compressed and transmitted from the audio encoder, para 63, 89), wherein the first data has multiple channels (the feature 541 includes linear prediction LP coefficients, line spectral pairs LSPs, line spectral frequencies LSFs, pitch lag with integer or fractional accuracy, pitch gain, pitch correlation, ,etc., para 64, i.e., channels with different feature data);
a first processing block (neural network model 534 including frame rate network 442 and a sample rate network 452 in Fig. 4 and part of a neural network model 534 in the voice decoder 504 in fig. 5A/5B), configured, for the given frame (sample n for the current prediction sample ƥr(n) and ƥ(n) as input to the neural network model 534 in fig. 5A or for the frame-by-frame processing in fig. 8), to receive the first data (the feature 441 to the element 442 in fig. 4 and features 541 to the neural network model 534 in fig. 5A and the features 841 is in current frame in fig. 8) and to output first output data in the given frame (residual ē(n) to LTP prediction in fig. 5A for the current sample, para 104, or current frame n in fig. 8), wherein the first output data comprises a plurality of channels (the feature 541 includes linear prediction LP coefficients, line spectral pairs LSPs, line spectral frequencies LSFs, pitch lag with integer or fractional accuracy, pitch gain, pitch correlation, ,etc., para 64, i.e., channels with different feature data), and
a second processing block (including LTP Engine 522 and Short-term LP Engine 520 in fig. 5A), configured, for the given frame (frame n in fig. 8, para 36, and similar processing of sample n in in fig. 5A/5B, para 73, 79, 88, etc.), to receive, as second data (the residual signal ē(n) to LTP prediction in fig. 5A and n as entire current frame in fig. 8), the first output data or data derived from the first output data (input ē(n) to adder in LTP Engine 822 and then to short-term LP Engine 820, similar to “+” in figs. 5A/5B, para. 82, 88),
wherein the first processing block (the elements 442 and 452 in Fig. 4) comprises:
at least one preconditioning learnable layer (a first convolutional 1x3 layer 444 in fig. 4) configured to receive the bitstream (through the features 441 based on the output of voice encoder in fig. 3) and, for the given frame, output a target data (output from the element 444 in fig. 4) representing the audio signal in the given frame with multiple channels and multiple samples for the given frame (conditioning layers 444 with filter size of 3, resulting in receptive field of five frames, para 78);
at least one conditioning convolutional learnable layer (the second convolutional layer 446 in fig. 4, with the same size of element 444, para 78) configured, for the given frame (three filters for different frames, para 78), to process the target data (output from the element 444 and inputted to convolutional layer 446 in fig. 4) by convolution to obtain a plurality of conditioning feature parameters (conditioning vector f with 128 dimensional vector, para 78) for the given frame (output from element 446 and having five frames, two ahead and two back and one is current frame, para 78); and
a styling element (including concatenation layer 454, GRU 456, 458, etc., in fig. 4), configured to apply the conditioning the plurality of feature parameters (the conditioning vector f with 128 dimensions) to the first data or normalized first data derived from the input signal (Markush rule applied, MPEP 2117, through at least 420 and ƥ(n) returned from short-term LP engine 520 in fig. 5A/5B and also pitch estimation engine 566 and pitch gain estimation engine 564 from the features 541 in figs. 5A/5B); and wherein the second processing block is configured to combine the plurality of channels of the second data (through adders in fig. 5A/5B) to obtain the audio signal (Ŝ(n) as audio signal in figs. 5A/5B and similar to frame-by-frame processing in fig. 8, para 36).
However, Skordilis does not explicitly teach wherein the first processing block is configured to up-sample the first data from a first number of samples for the given frame to a second number of samples for the given frame greater than the first number of samples and does not explicitly teach perform a synthesis to obtain the audio signal.
Binkowski teaches an analogous field of endeavor by disclosing an audio processing device (title and abstract, ln 1-16 and a system 100 fig. 1) and wherein a first processing block is disclosed (generative neural network 110 in fig. 1) and wherein the first processing block is configured to up-sampling the first data (the block input 202 in fig. 2, para 46) from a first number of samples for the given frame to a second number of samples for the given frame greater than the first number of samples ( A dimensionality of the layer output of each upsampling layer is larger than a dimensionality of the layer input of the upsampling layer, para 29, e.g., 1600Hz turned to 3200 or 2x frequency, para 49, 53) for benefits of improving a performance of the audio decoding (by matching the dimensionality for multiple inputs, para 58, and improving the accuracy of predictions of the next audio output, para 36, 38).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied the first processing block and wherein the first processing block is configured to up-sample the first data from the first number of samples for the given frame to the second number of samples for the given frame greater than the first number of samples, as taught by Binkowski, to the first processing block in the audio decoder, as taught by Skordilis, for the benefits discussed above.
However, the combination of Skordillis and Binkowski does not explicitly a synthesis to be performed for obtaining the audio signal.
Koishida teaches an analogous field of endeavor by disclosing an audio processing device (title and abstract, ln 1-17 and a speech enhancement system 100 in fig. 1) and wherein
a first data provisioner is disclosed (STFT 112 to provide phase 130 and mixed magnitude 110 in fig. 1, the mixed with noise, babble, reverberation, stationary and non-stationary sounds, etc., para 2) to provide, for a given frame (windowing time samples for STFT is inherency), first data (mixed magnitude 110 in fig. 1) derived from an input signal from an external or internal source or from the bitstream (from the mixed waveform as bitstream in fig. 1);
a first processing block is disclosed (including element 114, encoder 116, decoder 122, mask prediction 124 in fig. 1), for the given frame (in the frequency domain with time windowing used for STFT in fig. 1), to receive the first data (mixed magnitude 110 in fig. 1) and to output a first output data in the given frame (enhanced magnitude 126 in time-frequency domain, para 19), wherein the first output data comprises a plurality of channels (multiple time-frequency tiles inherently in time-frequency domain above, para 19 or multiple channels), and
a second processing block, configured, for the given frame, to receive, as a second data, the first output data or data derived from the first output data (the enhanced magnitude 126 in fig. 1), wherein
the first processing block(including element 114, encoder 116, decoder 122, mask prediction 124 in fig. 1) comprises:
at least one preconditioning learnable layer (Log Mel 114, for transforming mixed magnitude in time-frequency domain to an encoder 116) configured to receive the bitstream (through the STFT 112 to receive mixed waveform 108 in fig. 1) and, for the given frame, output a target data (output from Log-Mel 114 in fig. 1) representing the audio signal in the given frame (representing mixed magnitude of the mixed waveform through the Log-Mel in fig. 1) with multiple channels and multiple samples for the given frame (in time-frequency domain and discussed above);
at least one conditioning convolutional learnable layer (including ResBlok and upsample/Conv2D of decoder 122) configured, for the given frame, to process the target data (the output from Log-Mel 114 is processed through the encoder 116 and decoder 122) by convolution (through multiple upsample/conv2D in decoder 122 in fig. 1 and channel-wise concatenation for various direct mapping and mask approximation, para 20, para 20-21) to obtain a plurality of conditioning feature parameters (mask prediction 124 for multiple noises) for the given frame (the time-frequency domain and discussed above); and
a styling element (including the convolution dot by taking the mixed magnitude and mask prediction 124 in fig. 1), configured to apply the conditioning the plurality of feature parameters (the mask prediction 124 for the target speech of the speaker from the multiple noises in fig. 1, para 20) to the first data or normalized first data derived from the input signal (Markush rule applied, MPEP 2117, applied to the mixed magnitude 110 in fig. 1); and wherein the second processing block is configured to combine the plurality of channels of the second data and perform a synthesis (through the iSTFT by applying phase 130 and enhanced magnitude 126 in time-frequency domain as inputs in fig. 1) to obtain the audio signal (enhanced waveform 128 in fig. 1) for benefits of improving quality of generated audio sound (by improving the perceptual quality of a targeted acoustic signal, para 10, 32, by improving performance in reducing the number of modal parameters, para 40).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied the second processing block is configured to combine the plurality of channels of the second data and perform a synthesis to obtain the audio signal, as taught by Koishida, to the second processing block configured to combine the plurality of channels of the second data in the audio signal decoder, as taught by the combination of Skordills and Binkowski, for the benefits discussed above.
Claim 55 has been analyzed and rejected according to claim 1 above.
Claim 2: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, the second processing block (the discussion above), except wherein the second processing block is configured to up-sample the second data obtained from the first processing block from a second number of samples for the given frame to a third number of samples for the given frame greater than the second number of samples.
It has been a recognized problem and need in the art, which may include a design need to solve the problem for achieving prediction accuracy of the audio signal parameters and there had been a finite number of identified, predictable potential solutions to the sampling processing:
1. the first processing block performs upsampling,
2. the second processing block performs upsampling,
3. both first and second processing block performs upsampling,
it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to have pursued the known potential solutions with a reasonable expectation of success or obvious to try, see MPEP 2141, III.
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to have applied wherein the both the first processing block and the second processing block perform upsampling in order to maximize the prediction accuracy of the audio signal production, as taught in the obvious to try, to the second processing block in the audio decoder, as taught by the combination of Skordilis, Binkowski, and Koishida for the benefits discussed above.
Claim 3: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, the audio decoder further configured to reduce the number of channels of the first data derived from the input signal from a first number of channels to a second number of channels of the first output data which is lower than the first number of channels (Skordilis, downmix, para 90 and Binlowski, downsampling, para 98 and Koishida, through iSTFT from time-frequency domain to time domain, i.e., from two-dimension or channel to one dimension or channel inherently).
Claim 4: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the second processing block is configured to reduce the number of channels of the first output data, obtained from the first processing block, from a second number of channels to a third number of channels of the audio signal, wherein the third number of channels is lower than the second number of channels (Skordilis, downmix, para 90 and Binlowski, downsampling, para 98 and Koishida, from mixed magnitude having target speech and noise components such as babble, reverberation, stationary and non-stationary sound magnitudes, para 2 to enhanced magnitude by masking out the noise components, i.e., downmix inherently).
Claim 5: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 4 above, wherein the audio signal is a mono audio signal (Skordilis, speech signal as mono, para 90 and Binlowski, including speech signal, para 130 and Koishida, the target speech from a target speaker, i.e., as mono audio signal, abstract).
Claim 6: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, configured to obtain the input signal from the bitstream (Skordilis, the discussion in claim 1 above and Binlowski, the discussion in claim 2 above).
Claim 7: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, configured to obtain the input signal from at least one parameter of the bitstream associated to the given frame (Skordilis, the discussion in claim 1 above and Binlowski, the discussion in claim 2 above).
Claim 8: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, configured to obtain the input signal from at least a parameter indicating the pitch lag of the audio signal, or other pitch data, in the given frame (Skordilis, the discussion in claim 1 above and Binlowski, the discussion in claim 1 above).
Claim 9: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 8 above, configured to obtain the input signal by multiplication of the pitch lag by the pitch correlation (Skordilis, pitch lag and pitch correlation, para 69, and Binlowski, pitch information, para 23).
Claim 10: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, configured to obtain the input signal from noise (Skordilis, the discussion in claim 1 above and Binlowski, the discussion in claim 2 above).
Claim 13: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to derive the target data from cepstrum data encoded in the bitstream (Skordilis, cepstrum of the speech, para 64).
Claim 14: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to derive the target data from at least filter data encoded in the bitstream associated to the given frame (Skordilis, long and short prediction filters, 50 and Binlowski, filter is applied over an area larger than the length of the filter, para 63).
Claim 15: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 14 above, wherein the filter data comprise a spectral envelope data encoded in the bitstream associated to the given frame (Skordilis, long and short prediction filters, 50 and Binlowski, filter is applied over an area larger than the length of the filter, para 63).
Claim 16: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to derive the target data from at least one of excitation data, harmonicity data, periodicity data, long-term prediction data encoded in the bitstream (Skordilis, long and short prediction filters, 50 and Binlowski, filter is applied over an area larger than the length of the filter, para 63).
Claim 17: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to derive the target data from at least pitch data encoded in the bitstream (Skordilis, long and short prediction filters, 50 and Binlowski, filter is applied over an area larger than the length of the filter, para 63).
Claim 18: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 17 above, wherein the at least one preconditioning learnable layer is configured to derive the target data at least by multiplying the pitch lag by the pitch correlation (Skordilis,pitch correlation, para 69 and Binlowski, pitch information, para 23).
Claim 19: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 18 above, wherein the at least one preconditioning learnable layer is configured to derive the target data at least by convoluting the multiplication of the pitch lag by the pitch correlation and spectral envelope data (Skordilis,pitch correlation, para 69 and Binlowski, pitch information, para 23).
Claim 20: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to derive the target data by at least convoluting the pitch lag, the pitch correlation, and spectral envelope data (Skordilis,pitch correlation, para 69 and Binlowski, pitch information, para 23).
Claim 22: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the target data is a convolution map, and the at least one preconditioning learnable layer is configured to perform a convolution onto the convolution map (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 23: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 22 above, wherein the target data comprises cepstrum data of the audio signal in the given frame (Skordilis, mel cepstrum of the speech, para 64).
Claim 24: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the input signal is obtained from at least correlation data of the audio signal in the given frame (Skordilis, pitch correlation, para 64 and Binlowski, pitch information, para 23).
Claim 25: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the target data is obtained from pitch data of the audio signal in the given frame (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 26: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the target data comprises a multiplied value obtained by multiplying pitch data of the audio signal in the given frame and correlation data of the audio signal in the given frame (Skordilis, pitch correlation, para 64 and Binlowski, pitch information , para 23).
Claim 27: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to perform at least one convolution on a bitstream model obtained by juxtaposing at least one cepstrum data obtained from the bitstream, or a processed version thereof (Skordilis, cepstrum, para 64).
Claim 28: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above,, wherein the at least one preconditioning learnable layer is configured to perform at least one convolution on a bitstream model obtained by juxtaposing at least one parameter obtained from the bitstream (the discussion in claim 1 above).
Claim 29: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to perform at least one convolution on a convolution map obtained from the bitstream, or a processed version thereof (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 30: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 29 above, wherein the convolution map is obtained by juxtaposing parameters associated to subsequent frames (the discussion in claim 28 above).
Claim 31: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 28 above, wherein at least one of the convolution(s) performed by the at least one preconditioning learnable layer is activated by a preconditioning activation function (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 32: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 31 above, wherein the preconditioning activation function is a rectified linear unit, ReLu, function (discussion in claim 1 above).
Claim 33: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 32 above, wherein the preconditioning activation function is a leaky rectified linear unit, leaky ReLu, function (discussion in claim 1 above).
Claim 34: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 28 above, wherein the at least one convolution is a non-conditional convolution.
Claim 35: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 28 above, wherein the at least one convolution is part of a neural network (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 36: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, further comprising a queue to store frames to be subsequently processed by the first processing block and/or the second processing block while the first processing block and/or the second processing block processes a previous frame (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 37: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the first data provisioner is configured to perform a convolution on a bitstream model obtained by juxtaposing one set of coded parameters obtained from the given frame of the bitstream adjacent to the immediately preceding frame of the bitstream (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 38: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the conditioning set of learnable layers comprises one or at least two convolution layers (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 39: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein a first convolution layer is configured to convolute the target data or up-sampled target data to obtain first convoluted data using a first activation function.
Claim 40: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the conditioning set of learnable layers and the styling element are part of a weight layer in a residual block of a neural network comprising one or more residual blocks (Skordilis, convolutions, para 78, and residual blocks, para 15, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 41: the combination of the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the audio decoder further comprises a normalizing element, which is configured to normalize the first data (Skordilis, normalization, para 91).
Claim 42: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the audio decoder further comprises a normalizing element, which is configured to normalize the first data in the channel dimension (Skordilis, normalization, para 91).
Claim 43: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the audio signal is a voice audio signal (speech signal and the discussion in claim 1 above).
Claim 44: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the target data is up-sampled by a factor of a power of 2 (the discussion in claim 3 above).
Claim 45: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 44 above, wherein the target data is up-sampled by non-linear interpolation (the discussion in claim 3 above).
Claim 46: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the first processing block further comprises: a further set of learnable layers, configured to process data derived from the first data using a second activation function, wherein the second activation function is a gated activation function (Skordilis, softmax activation of a softmax layer 462, para 81).
Claim 47: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 46 above, where the further set of learnable layers comprises one or two or more convolution layers (Skordilis, convolutions, para 78, and Binkowski, the group of convolutions, i.e., convolution map, para 31).
Claim 48: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the second activation function is a softmax-gated hyperbolic tangent, TanH, function (Skordilis, softmax activation of a softmax layer 462, para 81).
Claim 49: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 40 above, wherein the first activation function is a leaky rectified linear unit, leaky ReLu, function (Skordilis, softmax activation of a softmax layer 462, para 81).
Claim 50: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein convolution operations run with maximum dilation factor of 2 (discussion in claims 1, 3 above).
Claim 51: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, comprising eight first processing blocks and one second processing block (Binlowski, eight block in the generator block 200 in fig. 2).
Claim 52: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the first data has own dimension which is lower than the audio signal (Binlowski, through downsampling, para 70).
Claim 56: the combination of Skordilis, Binkowski, and Koishida further teaches, according to claim 1 above, wherein the second processing block is configured to perform the synthesis to obtain the audio signal (Skordillis, by using combining or adding in figs. 5A/5B and Koishida, by using iSTFT in fig. 1), except Pseudo Quadrature Mirror Filter PQMF synthesis.
It has been a recognized problem and need in the art, which may include a design choise to solve the synthesis for obtaining the audio signal by achieving accurate and energy-preserved processing with less distortion, and there had been a finite number of identified, predictable potential solutions to perform the synthesis:
using combining or addition for synthesis (Skordillis, additions in fig. 5A/5B) for simplicity and costless,
using reverse STFT or iSTFT (Koishida, fig. 1) for achieving better performance in frequency characteristics,
PQMF synthesis for high fidelity of sound generation, etc.
it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to have pursued the known potential solutions with a reasonable expectation of success or obvious to try, see MPEP 2141, III.
Therefore, it would have been obvious for one having ordinary skill in the art before the effective filing date of the claimed invention to have applied one of the synthesis approach, including PQMF synthesis, as taught in the obvious to try above, to the synthesis in the audio decoder, as taught by the combination of Skordilis, Binkowski, and Koishida, for the benefits discussed above.
Claims 11-12, 21, 53-54 are rejected under 35 U.S.C. 103 as being unpatentable over Skordilis (above) and in view of references Binkowski (above), Koishida (above), and Arik et al (US 20190180732 A1, hereinafter Arik).
Claim 11: the combination of Skordilis, Binlowski, and Koishida further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to provide the target data (discussed in claim 1 above), except that the target data is as a spectrogram.
Arik teaches an analogous field of endeavor by disclosing an audio signal processing device (title and abstract, ln 1-19 and a device in fig. 19) and wherein target data is disclosed to be as spectrogram (spectrogram as input to one channel and having corresponding ground truth waveform for output from the CNN, para 100) for benefits of improving audio quality (high quality of speech signal, para 39-40).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied where the target data is as the spectrogram, as taught by Arik, to the target data in the audio decoder, as taught by the combination of Skordilis, Binlowski, and Koishida, for the benefits discussed above.
Claim 12: the combination of Skordilis, Binlowski, Koishida, and Arik further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to provide the target data as a mel-spectrogram (Arik, mel-spectrogram, para 87).
Claim 21: the combination of Skordilis, Binlowski, Koishida, and Arik further teaches, according to claim 1 above, wherein the at least one preconditioning learnable layer is configured to derive the target data from LPC coefficients, spectrogrum-based coefficients and/or cepstrum-based coefficients obtained from the bitstream (discussion in claim 12 above).
Claim 53: the combination of Skordilis, Binlowski, Koishida, and Arik further teaches, according to claim 1 above, wherein the target data is a spectrogram (the discussion in claim 11 above).
Claim 54: the combination of Skordilis, Binlowski, Koishida, and Arik further teaches, according to claim 1 above, wherein the target data is a mel-spectrogram (the discussion in claim 12 above and Koishida, log mel in the time-frequency domain to obtain spectrogram).
Response to Arguments
Applicant's arguments filed on October 16, 2025 have been fully considered and but are moot in view of the new ground(s) of rejection necessitated by the applicant amendment. Although a new ground of rejection has been used to address additional limitations that have been added to at least claims 1, 55, a response is considered necessary for several of applicant’s arguments since references Skordilis and Binkowski, will continue to be used to meet several claimed limitations.
With respect to the prior art rejection of independent claim 1, similar to claim 55, under 35 USC §103(a), as set forth in the Office Action, applicant, with respect to claimed “the first data comprises multiple channels” by “a first data provisioner” “to provide, for a given frame, first data derived from an input signal …”, argued “with the invention the audio is generated frame-by-frame, generating one frame and not one single sample of audio for each inference call”, and “generating one frame of audio for each inference call” and Skordillis’s engines 420 and 422 cannot correspond to the claimed first data provisioner 702, since the first data provisioner 702 provdes data for a frame with multiple channels”, as asserted in paragraph 1 of page 14 in Remarks filed on October 16, 2025,
In response to argument above, the Office respectfully disagrees because claim broadly recited “first data” outputted by “first data provisioner” with no further limitation of how “deriving” is performed, and there is no recitation of any physical characteristic of claimed “multiple channels”, and thus, a BRI would be given to claimed and argued “multiple channels” which mapped to Skordillis’ feature 541 having multiple features (pitch gain, pitch correlation, factional accuracy, line spectra pairs LSPs, LP coefficients, etc., para 64) as broadly claimed “multiple channels”, but applicant is in silence and thus, the argument is moot. Applicant appears to improperly interpret the broadly recited “multiple channels” as audio channels. If such features are disclosed in the specification, it is recommended to amend claim by clarifying physical characteristic of argued “multiple channels”.
Applicant further challenged claimed feature “a styling element” by arguing “Skordillis’s concatenation layer 454 which is in the sample rate network 452 cannot correspond to the claimed styling element 77” because “Skordillis’ concatenation layer 454 is not in the frame rate network 442, allegedly corresponding to the first processing block 40, 50: it contradicts claim 1, requiring the styling element to be part of the first processing block 40” and thus, “Skordillis’s concatenation layer 454 which is in the sample rate network 452 cannot correspond to the claimed styling element 77”, as asserted in paragraph 1 of page 14 and paragraph 1 of page 15 in Remarks filed on October 16, 2025.
In response to the argument above, the Office further disagrees because Skordillis does not only teaches sample-by-sample processing, but also teaches frame-by-frame processing (fig. 8, para 36 and wherein n is represented as frame, while n is represented as sample in figs. 4, 5A-5B), but applicant is in silence of the Skordillis’ frame-by-frame processing embodiment above. About argued “a first processing block” with Skordillis’ “concatenation” layer, etc., as discussed in office action, the claimed “a first processing block” is mapped to Skordillis’ both elements 442, 452 that are included in Skordillis’ neural network model box 534 (fig. 5A/5B), other than argued that “element 452” is not in “first processing block”, and the argued Skordillis’ “concatenation layer is within the neural network model box 534 as mapped to the claimed “first processing block”, or specifically within the element 452, as part of the “first processing block” and thus, the argument on the ground that element 452 is not in the “first processing block” is also moot.
Applicant further challenged claimed “upsample the first data derived from the input signal” with prior art Binkowski, and argued “Binkowski fails to provide any valuable teaching towards the upsampling of the first data” because “Binkowski presents two inputs, … i) a block input 202 and ii) a noise input 204. Hence: the block input 202 corresponding to the bistream 3, which is subsequently processed to become the claimed target data 12; and ii) the noise input 204 corresponds to the claimed input signal, e.g., noise, 14, … as shown by fig. 2 of Binkowski, it is the block input 202 … which is upsampled, not the noise input 204 allegedly corresponding to the intpu signal 14/first data 15”, etc. and claim 1 requires to upsample the first data 15, not the target data 12 …”, as asserted in paragraph 3-4 of page 15 in Remarks filed October 16, 2025.
In response to argument above, the Office further disagrees because (1), referred to application drawing figure 7c, as claimed elements mapped to the application specification according to 35 U.S.C. 112(f) that has been applied as discussed in the office action above, wherein claimed and disclosed “first data” from “first data provisioner” by taking either noise signal or bitstream through pitch data first option (3b), e.g., claim 1 recited a Markush style features: “first data derived from an input signal from an external or internal source or from the bitstream”, i.e., upsampling does not have to be applied to “noise” of “externa signal” as “second option” and other external signal second option (fig. 7c), but can be from the bitstream (3) mapped to Binkowski’s 202 and upsampled accordingly (through upsample 222 applied to h as representative of 202 in fig. 2), and applicant is also in silence. Applicant appeared to improperly interpret the broadly claimed and disclosed Markush style feature to be non-Markush style against the prior art, and therefore, the argument above is also moot.
Because the new ground(s) of rejection necessitated by the applicant amendment as office action above, the prior art rejection of claim 1, similar to claim 55, under 35 U.S.C. 103(a) is addressed in office action above over Skordillis and in references Binkowski, and newly-added prior art Koishida.
In the response to this office action, the Office respectfully requests that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line numbers in the specification and/or drawing figure(s). This will assist the Office in prosecuting this application.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expi