Last updated: May 29, 2026
Application No. 18/256,285
SOUND SIGNAL PROCESSING METHOD AND APPARATUS, AND ELECTRONIC DEVICE

Non-Final OA §103
Filed
Jun 07, 2023
Priority
Dec 08, 2020 — CN 202011462091.2 +1 more
Examiner
TENGBUMROONG, NATHAN NARA
Art Unit
2654
Tech Center
2600 — Communications
Assignee
BEIJING YOUZHUJU NETWORK TECHNOLOGY CO., LTD.
OA Round
2 (Non-Final)
This examiner grants 47% of cases after interview

— +26.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 19 resolved cases, 2023–2026
Examiner Intelligence

TENGBUMROONG, NATHAN NARA View full profile →
Grants 47% of resolved cases
Career Allowance Rate
9 granted / 19 resolved
-14.6% vs TC avg
Strong +27% interview lift
Without
With
+26.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
21 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
98.3%
+58.3% vs TC avg
§102
1.7%
-38.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 19 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Claims 1, 9, and 14-15 are amended. Claims 1-12 and 14-21 are presented for examination.

Response to Arguments
Rejection under 35 U.S.C. 103
Applicant’s arguments have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5, 8-9, 11, 14-15, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Tashev et al. (US 20190318755 A1; hereinafter referred to as Tashev) in view of Manning et al. (US 20170140260 A1; hereinafter referred to as Manning).
Regarding claim 1, Tashev teaches: a sound signal processing method, comprising: importing first frequency spectrum data corresponding to first audio data ([0081] signals in the spectrogram may be continuous along the time dimension, and they may also have similar values in adjacent frequency bins. Thus, convolutional neural networks may be applied to efficiently extract local patterns from an input spectrogram) into a pre-trained sound processing model to obtain a processing result ([0009] calculating, for each approximate speech signal estimation, a clean speech estimation and at least one additional target including an ideal ratio mask using a trained neural network model; and calculating, for each frequency bin, a final clean speech estimation using the calculated at least one additional target including the calculated ideal ratio mask and the calculated clean speech estimation);
and generating, based on the processing result, pure audio data corresponding to the first audio data ([0108] Once the neural network model is trained, at step 616, the trained neural network model configured to output the clean speech estimation and the ideal ratio mask of audio data may be outputted).
Tashev does not explicitly, but Manning discloses: wherein the sound processing model comprises at least one preset convolution layer ([0030] The convolutional neural networks 110, 120, and 130 may use any suitable neural network architectures, including, for example, any suitable number of convolutional layers), and operations performed by using the preset convolution layer comprises: performing, based on a first convolution kernel group, a convolution operation on a first sound spectrum feature map ([0039] The audio spectrogram generated by the input converter 105 may be input to a convolutional neural network, such as the convolutional neural network 110. The audio spectrogram may be input to a convolution layer 305 of the convolutional neural network 110. The convolution layer 305 may be implemented in any suitable manner on the computing device 100, and may implement any suitable filter, kernel, or feature detector. The convolution layer 305 may generate a feature map for the audio spectrogram. The first sound spectrum feature map can be a mel spectrogram.) inputted into the preset convolution layer, to obtain a second sound spectrum feature map ([0016] The one-dimensional convolution of the spectrogram may produce a feature map. The convolutional neural network may include any suitable number of convolutional layers, implementing any suitable filters, kernels, or feature detectors, which may be applied to the spectrogram in any suitable order, in any suitable combination of iteratively and consecutively. For example, a first convolutional layer may include two filters which may each produce a feature map);
and combining, based on a second convolution kernel group, the obtained second sound spectrum feature map ([0016] second convolutional layer may include three additional filters which may each produce a feature map from the two processed feature maps produced by the first convolutional layer), to obtain a third sound spectrum feature map corresponding to the second convolution kernel group ([0016] This may result in a total of six feature maps which may be input to additional layers of the convolutional neural network. The convolutional neural network may use any suitable convolutions implemented by any suitable convolutional layer.),
the first convolution kernel groups are in one-to-one correspondence with the first sound spectrum feature maps ([0016] The convolutional neural network may include any suitable number of convolutional layers, implementing any suitable filters, kernels, or feature detectors, which may be applied to the spectrogram in any suitable order, in any suitable combination of iteratively and consecutively. For example, a first convolutional layer may include two filters which may each produce a feature map. Each feature map may be further processed by the convolutional neural network);
the first convolution kernel group comprises at least two first convolution kernels ([0039] the convolution layer 305 may implement more than one filter, kernel, or feature detector, and may generate more than one feature map from the audio spectrogram).
 Tashev and Manning are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tashev to combine the teachings of Manning because doing so would allow the convolutional neural network to increase the signal to noise ratio, leading to a clean audio signal (Manning [0016] A one-dimensional convolution may be performed along the time axis of a spectrogram, for example by a convolutional layer of the convolutional neural network. The spectrogram may be, for example, an MFC, mel spectrogram, or any other suitable spectrogram, representing any suitable length of audio. For example, the spectrogram may be an MFC representing 30 seconds of a song. This one-dimensional convolution may smooth the spectrogram along the time axis and increase the signal to noise ratio).

	Regarding claim 5, Tashev in view of Manning teaches: the method according to claim 1. Tashev further teaches: wherein a number of convolution kernels in the first convolution kernel group is determined according to a length of a frequency dimension of the first sound spectrum feature map and a first step size ([0084] because of local similarity of a spectrogram in adjacent frequency bins, when convolving with the kernel z, a stride of size b/2 may be used along the frequency dimension).

	Regarding claim 8, Tashev in view of Manning teaches: the method according to claim 1. Tashev further teaches: wherein the method is applied to a terminal device, and the sound processing model is provided on the terminal device ([0113-0114] FIG. 7 depicts a high-level illustration of an exemplary computing device 700 that may be used in accordance with the systems, methods, modules, and computer-readable media disclosed herein, according to embodiments of the present disclosure. For example, the computing device 700 may be used in a system that processes data, such as audio data, using a deep neural network, according to embodiments of the present disclosure. The computing device 700 may include at least one processor 702 that executes instructions that are stored in a memory 704... The computing device 700 also may include an output interface 712 that interfaces the computing device 700 with one or more external devices. For example, the computing device 700 may display text, images, etc. by way of the output interface 712).

Regarding claim 9, Tashev in view of Manning teaches: the method according to claim 1. Tashev further teaches: wherein the processing result comprises mask data ([0007] a clean speech estimation and at least one additional target including an ideal ratio mask using a trained neural network model), and the generating, based on the processing result, pure audio data corresponding to the first audio data comprises: generating second frequency spectrum data based on the mask data and the first frequency spectrum data ([0008] calculating, for each frequency bin, a final clean speech estimation using the calculated at least one additional target including the calculated ideal ratio mask and the calculated clean speech estimation); 
and converting the second frequency spectrum data into time domain data ([0105] Further, after processing the audio data, if the audio data is represented in a conversion domain, the audio data may be converted into the time domain) to obtain the pure audio data ([0109] the convolutional component may exploit local patterns in a spectrogram in both a spatial domain and a temporal domain, the at least one bidirectional recurrent component may model dynamic correlations between consecutive frames, and the fully-connected layer may predict clean spectrograms).

Regarding claim 11, Tashev in view of Manning teaches: the method according to claim 1. Tashev further teaches: wherein the processing result comprises pure frequency spectrum data ([0035] A convolutional-recurrent neural network may include three components: (i) a convolutional component, which may exploit local patterns in a spectrogram in both spatial and temporal domain; then (ii) a bidirectional recurrent component, which may model dynamic correlations between consecutive frames; and finally (iii) a fully-connected layer that predicts clean spectrograms), and the generating, based on the processing result, pure audio data corresponding to the first audio data comprises: converting the pure frequency spectrum data into time domain data to obtain the pure audio data ([0105] after processing the audio data, if the audio data is represented in a conversion domain, the audio data may be converted into the time domain).

Regarding claim 14, Tashev teaches: an electronic device, comprising: at least one processor; and a storage device configured to store at least one program, wherein the at least one program, when executed by the at least one processor, causes the at least one processor to... ([0118] computing device 802 may include at least one processor and a memory; the at least one processor executes instructions that are stored in the memory). The rest of the claim recites similar limitations as claim 1 and therefore is rejected similarly.

Regarding claim 15, Tashev teaches: a non-transitory computer-readable medium, on which a computer program is stored ([0123] the functions may be stored on and/or transmitted over as one or more instructions or code on a computer -readable medium. Computer-readable media may include computer-readable storage media. A computer-readable storage media may be any available storage media that may be accessed by a computer). The rest of the claim recites similar limitations as claim 1 and therefore is rejected similarly.

Regarding claim 19, it recites similar limitations as claim 5 and therefore is rejected similarly.

Claims 2 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Tashev in view of Manning and in further view of Tacer et al. (US 11985179 B1; hereinafter referred to as Tacer).
Regarding claim 2, Tashev in view of Manning teaches: the method according to claim 1. Tashev in view of Manning does not explicitly, but Tacer teaches: wherein a number of the first convolution kernel group matches a number of the first sound spectrum feature map inputted into the preset convolution layer ([col 24, lines 25-32] During a first convolution operation 950a, a first kernel 952a (e.g., 1x5 kernel, although the disclosure is not limited thereto) may be applied to the input audio data 945a to generate feature map data 954a. The feature map data 954a may have the same dimensions as the input audio data 945a and may include multiple different feature maps (e.g., channels)), and a number of the second convolution kernel group matches a number of an output channel ([col 25, lines 32-42] During a second convolution operation 960b, a second kernel 962b (e.g., 1x3 kernel, although the disclosure is not limited thereto) may be applied to the feature map data 954b to generate feature map data 964b. During a third convolution operation 970b, a third kernel 972b (e.g., 1x3 kernel, although the disclosure is not limited thereto) may be applied to the feature map data 964b to generate channel! data 974b. The channel data 974b may have dimensions equal to the third dimensions (e.g., d=(KxN)), along with four channels (e.g., F=4), resulting in first tensor dimensions (e.g., (Fxd) or (Fx(KxN))).
Tashev, Manning, and Tacer are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tashev and Manning to combine the teachings of Tacer because doing so would improve audio quality by performing bandwidth extension using a convolutional neural network ([Tacer [col 26, lines 12-19] performing bandwidth extension corresponds to using cascaded subpixel CNNs to improve the quality of audio and extend the bandwidth of a narrowband input signal. Thus, the bandwidth extension component 122 may recreate high-quality audio data from low-quality, down-sampled input audio data that includes only a small fraction of the original samples).

Regarding claim 16, it recites similar limitations as claim 2 and therefore is rejected similarly.

Claims 3-4 and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Tashev in view of Manning and in further view of Quader et al. (US 20220114424 A1; hereinafter referred to as Quader). 
Regarding claim 3, Tashev in view of Manning teaches: the method according to claim 1. Tashev in view of Manning does not explicitly, but Quader teaches: wherein the first convolution kernel group comprises at least two first convolution kernels ([0028] FIG. 1C is a schematic diagram of a conventional convolution layer of convolution block showing the dimensions of an input data array, an output data array, and a set of convolution kernels applied by the convolution block), and the performing, based on a first convolution kernel group, a convolution operation on a first sound spectrum feature map inputted into the preset convolution layer, to obtain a second sound spectrum feature map comprises: performing, according to a first correspondence, the convolution operation on the first sound spectrum feature map by using the first convolution kernels in the first convolution kernel group, to obtain the second sound spectrum feature map ([0051] The convolution block 124 receives an input activation map (e.g., the preprocessed input data) and performs convolution operations, using convolution kernels, to generate an output activation map. As will be discussed further below, the convolution kernels (which may also be referred to as filter kernels or simply filters) each include a set of weights), wherein the first correspondence indicates a correspondence between the first convolution kernel and a frequency of the first sound spectrum feature map ([0017] for each set of convolution kernels, prior to convolving each subset of input channels with its respective set of convolution kernels: learning a set of frequency-based attention multipliers, applying the set of frequency-based attention multipliers to the weights of the set of convolution kernels). 
Tashev, Manning, and Quader are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tashev and Manning to combine the teachings of Quader because doing so would improve the performance of the convolutional neural network by implementing a multi-bandwidth separated feature extraction layer (Quader [0011] Performance of a CNN that includes a multi-bandwidth separated feature extraction convolution layer in accordance with the present disclosure may be improved, including increasing accuracy of the CNN, decreasing memory use, and/or decreasing computation cost for performing the operations of the CNN).

Regarding claim 4, Tashev in view of Manning teaches: the method according to claim 1. Tashev in view of Manning does not explicitly, but Quader teaches: wherein the second convolution kernel group comprises at least two second convolution kernels, and the combining, based on a second convolution kernel group, the obtained second sound spectrum feature map, to obtain a third sound spectrum feature map corresponding to the second convolution kernel group comprises: combining, according to a second correspondence, the obtained second sound spectrum feature map by using the second convolution kernels in the second convolution kernel group, to obtain the third sound spectrum feature map corresponding to the second convolution kernel group, wherein the second correspondence indicates a correspondence between the second convolution kernel and a frequency of the second sound spectrum feature map ([0054] The convolution block 124 may include several layers, including one or more multi- bandwidth feature extraction convolution layers, conventional convolution layers, and other layers, such as an activation layer, a batch normalization layer, and so on. The classification head 126 may include one or more layers, such as one or more fully connected layers, a SoftMax layer, and so on. The CNN 120 may include more than one convolution block, as well as additional blocks or layers. It will be appreciated that the structure of the CNN 120 shown in FIG. 1Bis intended as a simplified representation of a CNN. The process recited in claim 3 can be repeated.).

Regarding claim 17, it recites similar limitations as claim 3 and therefore is rejected similarly.

Regarding claim 18, it recites similar limitations as claim 4 and therefore is rejected similarly.

Claims 6 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Tashev in view of Manning and in further view of Chen et al. (US 20210182077 A1; hereinafter referred to as Chen).
Regarding claim 6, Tashev in view of Manning teaches: the method according to claim 1. Tashev in view of Manning does not explicitly, but Chen teaches: wherein a receptive field of a first convolution kernel is determined based on a candidate sampling position ([2166] a computation process is to use each convolution kernel to slide in the H and W dimensions of the input data, and when the convolution kernel slides to each position, an inner product computation is performed on the convolution kernel and corresponding input data of the position. The input data is extracted and rearranged according to C*Kh*Kw pieces of data corresponding to each position where the convolution kernel slides. It is assumed that there are Kernum sliding positions of convolution kernel, the convolution layer computes a sample of batch processing. In a case where N>1, the same computation is performed on each sample) and a preset position offset parameter ([3653] sets two additional flag bits including a flag bit offset and a flag bit EL for data of the same layer and the same type in the neural network, such as all the weight data of a first convolution layer).
Tashev, Manning, and Chen are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tashev and Manning to combine the teachings of Chen because doing so would improve the efficiency in training the convolutional neural network using weights (Chen [3332] By fully exploiting the characteristics of weight distribution of the neural network, low-bit quantized weights are obtained, which may greatly improve the processing speed and reduce the weight storage overhead and memory access overhead).

Regarding claim 20, it recites similar limitations as claim 6 and therefore is rejected similarly.

Claims 7 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Tashev in view of Manning and in further view of Zhou et al. (US 20220223125 A1; hereinafter referred to as Zhou).
Regarding claim 7, Tashev in view of Manning teaches: the method according to claim 1. Tashev in view of Manning does not explicitly, but Zhou teaches: wherein the sound processing model comprises at least one self-attention layer, and the self-attention layer is arranged subsequent to the at least one preset convolution layer, and an operation performed by using the self-attention layer comprises: for each sound spectrum feature map output by the preset convolution layer, re-evaluate, based on a value of each position in the sound spectrum feature map and values of other positions in the sound spectrum feature map, the value of the position ([0061] a linear/non-linear transformation may be performed on a set of convolutional feature maps x corresponding to the vector representation of the notes in a song, respectively, to obtain, for example, a set of transformed x1, x2, x3. Next, x1 may be transposed and matrix- multiplied with x2, and the multiplication result may be normalized by Softmax to obtain the attention map. Based on the notes or chords, the attention map may be matrix-multiplied with x3 to obtain a set of self-attention feature maps. The convolutional features maps must be first determined by a convolutional layer.).
Tashev, Manning, and Zhou are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tashev and Manning to combine the teachings of Zhou because doing so would allow for the addition of a self-attention layer to better understand the dependencies in an input audio (Zhou [0063] The result of the matrix multiplication 430 is a weight matrix that represents the distance between the notes of the song 420 and the semantics of the words of the text 410, which further forms the attention map 440. Matrix multiplication 450 may then be performed on the attention map 440 and the converted song vector 428 to further identify words that are most suitable or relevant for each note in the song).

Regarding claim 21, it recites similar limitations as claim 7 and therefore is rejected similarly.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev in view of Manning, in further view of Koshinaka et al. (US 20220238119 A1; hereinafter referred to as Koshinaka), and in even further view of Mesgarani et al. (US 20190066713 A1; hereinafter referred to as Mesgarani).
Regarding claim 10, Tashev in view of Manning teaches: the method according to claim 9. Tashev in view of Manning does not explicitly, but Koshinaka teaches: wherein the sound processing model is trained by: obtaining a mixed audio sample ([0040] The mixed signal input unit 30 inputs the signal (that is, the mixed signal) including the target signal to be extracted. In the example illustrated in FIG. 2, the mixed audio X.sub.f,t.sup.ms corresponds to the mixed signal);
importing the mixed audio sample into an untrained sound processing model to generate candidate mask data ([0042] The reconstruction mask estimation unit 42 applies the input anchor signal and mixed signal to the first network, and estimates the reconstruction mask of the class to which the anchor signal belongs. Specifically, the reconstruction mask estimation unit 42 estimates the output of the first network in the neural network as the reconstruction mask);
generating a first loss value based on a label of the mixed audio sample ([0086] a loss calculation unit 84 (for example, the loss calculation unit 46) that calculates the loss function between the class into which the extracted target signal is classified and the true class (for example, the class to which the input anchor signal belongs)) and the candidate mask data ([0043] The signal classification unit 44 applies the mixed signal to the estimated reconstruction mask to extract the target signal, and applies the extracted target signal to the second network to classify the target signal into the class);
and adjusting, based on the first loss value, a parameter of the untrained sound processing model ([0086] a parameter update unit 85 (for example, the parameter update unit 48) that updates the parameter of the first network and the parameter of the second network in the neural network based on the calculation result of the loss function).
Tashev, Manning, and Koshinaka are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tashev and Manning to combine the teachings of Koshinaka because doing so would allow for a mixed audio signal to be processed and obtain loss values to update the neural network (Koshinaka [0059] the reconstruction mask estimation unit 42 applies the anchor signal and the mixed signal to the first network to estimate the reconstruction mask of the class to which the anchor signal belongs. The signal classification unit 44 applies the mixed signal to the estimated reconstruction mask to extract the target signal, and applies the extracted target signal to the second network to classify the target signal into the class. The loss calculation unit 46 calculates the loss function between the class into which the extracted target signal is classified and the true class, and the parameter update unit 48 updates the parameter of the first network and the parameter of the second network in the neural network based on the calculation result of the loss function. Thereafter, the output unit 50 outputs the updated first network).
The combination of Tashev, Manning, and Koshinaka does not explicitly, but Mesgarani teaches: wherein the label of the mixed audio sample is generated by performing time- frequency transformation on a pure audio sample ([0017] Selecting the one of the plurality of separated sound signals based on the obtained neural signals for the person may include generating an attended speaker spectrogram based on the neural signals for the person, comparing the attended speaker spectrogram to the derived multiple resultant speaker spectrograms to select one of the multiple resultant speaker spectrograms, and transforming the selected one of the multiple resultant speaker spectrograms into an acoustic signal. The separated sound signal is a pure audio sample.) and the mixed audio sample separately ([0155] the speech separation process starts with calculating the short-time Fourier transform to create a time-frequency (T-F) representation of the mixture sound), generating mask data for training based on data obtained through the transformation ([0134] Using the similarity between the embedded points and each attractor, a mask is estimated for each sources in the mixture), and determining the mask data for training as the label ([0023] estimating a plurality of reconstructed sounds signals from the derived plurality of mask matrices corresponding to the different groups of the multiple sound sources).
Tashev, Manning, Koshinaka, and Mesgarani are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tashev, Manning, and Koshinaka to combine the teachings of Mesgarani because doing so would allow for better performance in processing mixed audio signals by using a mask (Mesgarani [0179] Speaker extraction is achieved by applying a weighting function (mask) to the encoder output. The modified encoder representation is then inverted to the sound waveform using a linear decoder. The masks are found using a temporal convolution network comprising of dilated convolutions, which allow the network to model the long-term dependencies of the speech signal. This end-to-end speech separation approach provides good performance in terms of separating speakers in mixed audio).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev in view of Manning and in further view of Koshinaka.
	Regarding claim 12, Tashev in view of Manning teaches: the method according to claim 11. Tashev further teaches: importing the mixed audio sample into an untrained sound processing model to generate candidate pure frequency spectrum data ([0076] the convolutional-recurrent neural network may include three components: (i) a convolutional component, which may exploit local patterns in a spectrogram in both spatial and temporal domain; then (ii) a bidirectional recurrent component, which may model dynamic correlations between consecutive frames; and finally (iii) a fully-connected layer that predicts clean spectrograms);
generating a second loss value based on the pure frequency spectrum sample and the candidate pure frequency spectrum data ([0077] the convolutional-recurrent neural network may be trained end-to-end by defining a loss function between the predicted spectrogram and the clean spectrogram);
and adjusting a parameter of the untrained sound processing model based on the second loss value ([0091] the last step may be to define the mean-squared error between the predicted spectrogram y and the clean one y, and optimize all the model parameters simultaneously).
Tashev in view of Manning does not explicitly, but Koshinaka teaches: wherein the sound processing model is trained by: obtaining a mixed audio sample ([0040] The mixed signal input unit 30 inputs the signal (that is, the mixed signal) including the target signal to be extracted. In the example illustrated in FIG. 2, the mixed audio X.sub.f,t.sup.ms corresponds to the mixed signal), wherein a label of the mixed audio sample includes a pure frequency spectrum sample corresponding to a pure audio sample ([0043] when the mixed signal is the audio stream indicating the utterance of the speaker, the signal classification unit 44 extracts a spectrogram of the speaker as the target signal, and applies the extracted spectrogram to the second network to classify the speaker).
Tashev, Manning, and Koshinaka are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tashev and Manning to combine the teachings of Koshinaka because doing so would allow for a mixed audio signal to be processed and obtain loss values to update the neural network (Koshinaka [0059] the reconstruction mask estimation unit 42 applies the anchor signal and the mixed signal to the first network to estimate the reconstruction mask of the class to which the anchor signal belongs. The signal classification unit 44 applies the mixed signal to the estimated reconstruction mask to extract the target signal, and applies the extracted target signal to the second network to classify the target signal into the class. The loss calculation unit 46 calculates the loss function between the class into which the extracted target signal is classified and the true class, and the parameter update unit 48 updates the parameter of the first network and the parameter of the second network in the neural network based on the calculation result of the loss function. Thereafter, the output unit 50 outputs the updated first network).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Nathan Tengbumroong whose telephone number is (703)756-1725. The examiner can normally be reached Monday - Friday, 11:30 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NATHAN TENGBUMROONG/Examiner, Art Unit 2654                                      

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654
Read full office action
Prosecution Timeline

Jun 07, 2023
Application Filed
May 30, 2025
Non-Final Rejection mailed — §103
Sep 02, 2025
Response Filed
Sep 30, 2025
Final Rejection mailed — §103
Dec 01, 2025
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/195,121
Patent 12640161
METHOD AND APPARATUS FOR PROCESSING AUDIO FOR SCENE CLASSIFICATION
3y 0m to grant Granted May 26, 2026
18/173,495
Patent 12530536
Mixture-Of-Expert Approach to Reinforcement Learning-Based Dialogue Management
2y 11m to grant Granted Jan 20, 2026
17/876,156
Patent 12451142
NON-WAKE WORD INVOCATION OF AN AUTOMATED ASSISTANT FROM CERTAIN UTTERANCES RELATED TO DISPLAY CONTENT
3y 2m to grant Granted Oct 21, 2025
17/883,265
Patent 12412050
MULTI-PLATFORM VOICE ANALYSIS AND TRANSLATION
3y 1m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 4 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

2-3
Expected OA Rounds
47%
Grant Probability
74%
With Interview (+26.7%)
3y 0m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 19 resolved cases by this examiner. Grant probability derived from career allowance rate.