Last updated: April 19, 2026
Application No. 18/766,392
DEEP LEARNING SEGMENTATION OF AUDIO USING MAGNITUDE SPECTROGRAM

Non-Final OA §DP
Filed
Jul 08, 2024
Examiner
OPSASNICK, MICHAEL N
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Audioshake, Inc.
OA Round
1 (Non-Final)
Interview Optional

— +10.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 900 resolved cases, 2023–2026
Examiner Intelligence

OPSASNICK, MICHAEL N View full profile →
Grants 82% — above average
Career Allow Rate
737 granted / 900 resolved
+19.9% vs TC avg
Moderate +10% lift
Without
With
+10.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
46 currently pending
Career history
946
Total Applications
across all art units
Statute-Specific Performance

§101
17.7%
-22.3% vs TC avg
§103
33.0%
-7.0% vs TC avg
§102
29.9%
-10.1% vs TC avg
§112
6.3%
-33.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 900 resolved cases
Office Action

§DP
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
Content of Specification

(b) CROSS-REFERENCES TO RELATED APPLICATIONS: See 37 CFR 1.78 and MPEP § 211 et seq.

The specification is objected to, because the Cross-Reference to Related Applications section, does not match the continuation information provided on the Application Data Sheet.  For example, the parent case of the instant application, 17/622418, is missing from the chain of continuation information.  Furthermore, all 3 parent cases have issued as US Patents.  This information is missing as well.  Correction is required.

Double Patenting

The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.

Claims 21-40 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-16 of U.S. Patent No. 11,837,245. Although the claims at issue are not identical, they are not patentably distinct from each other because the claim scope of the ‘245 patent meets the additional limitations of binary mask, and k masks extracting the complete information from the original signal.

18/766392
11,837,245
21. A method for decomposing an audio signal, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; 
sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; 

producing a sequence of K mask fragments, 
22. The method of claim 1, wherein the K mask fragments are concatenated together in order to form a complete mask which is the same length as the complex spectrogram.
Claim 21 continued…wherein the K mask fragments are used to extract individual components from the original audio file; 
Claim 21 continued
multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; 

and transforming the new complex spectrogram into a new audio file associated with a decomposed audio signal.

24. The method of claim 1, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.

25. The method of claim 1, wherein transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.

26. The method of claim 1, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

27. The method of claim 1, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

28. The method of claim 1, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

29. The method of claim 1, wherein instead of concatenating the K mask fragments together and multiplying the complete mask with the complex spectrogram, each K mask fragment is multiplied with a corresponding complex spectrogram fragment thereby producing a fragment of the new complex spectrogram.

30. A system for decomposing an audio signal, the system comprising: a processor; and memory storing instructions to cause the processor to execute a method, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments, wherein the K mask fragments are used to extract individual components from the original audio file; multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

32. The system of claim 10, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.

33. The system of claim 10, wherein transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.

34. The system of claim 10, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

35. The system of claim 10, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

36. The system of claim 10, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

37. The system of claim 10, wherein instead of concatenating the K mask fragments together and multiplying the complete mask with the complex spectrogram, each K mask fragment is multiplied with a corresponding complex spectrogram fragment thereby producing a fragment of the new complex spectrogram.

38. A non-transitory computer readable medium storing instructions to cause a processor to execute a method, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments, wherein the K mask fragments are used to extract individual components from the original audio file; multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

39. The non-transitory computer readable medium of claim 18, wherein the K mask fragments are concatenated together in order to form a complete mask which is the same length as the complex spectrogram.

40. The non-transitory computer readable medium of claim 18, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.
1.A method for decomposing an audio signal, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; 
sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer;

 producing a sequence of K mask fragments; concatenating the K mask fragments together in order to form a complete mask which is the same length as the complex spectrogram;
 
multiplying the complete mask with the complex spectrogram to create a new complex spectrogram; 

and transforming the new complex spectrogram into a new audio file.

2. The method of claim 1, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.

3. The method of claim 1, wherein transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.

4. The method of claim 1, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

5. The method of claim 1, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

6. The method of claim 1, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

7. The method of claim 1, wherein instead of concatenating the K mask fragments together and multiplying the complete mask with the complex spectrogram, each K mask fragment is multiplied with a corresponding complex spectrogram fragment thereby producing a fragment of the new complex spectrogram.

8. A system for decomposing an audio signal, the system comprising: a processor; and memory storing instructions to cause the processor to execute a method, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments; concatenating the K mask fragments together in order to form a complete mask which is the same length as the complex spectrogram; multiplying the complete mask with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

9. The system of claim 8, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.

10. The system of claim 8, wherein transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.

11. The system of claim 8, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

12. The system of claim 8, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

13. The system of claim 8, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

14. The system of claim 8, wherein instead of concatenating the K mask fragments together and multiplying the complete mask with the complex spectrogram, each K mask fragment is multiplied with a corresponding complex spectrogram fragment thereby producing a fragment of the new complex spectrogram.

15. A non-transitory computer readable medium storing instructions to cause a processor to execute a method, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments; multiplying the complete mask with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

concatenating the K mask fragments together in order to form a complete mask which is the same length as the complex spectrogram;

16. The non-transitory computer readable medium of claim 15, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.

17. The non-transitory computer readable medium of claim 15, wherein transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.

18. The non-transitory computer readable medium of claim 15, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

19. The non-transitory computer readable medium of claim 15, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

20. The non-transitory computer readable medium of claim 15, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

Claims 21-40 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1,3-6,9,11,13-16,19,20 of U.S. Patent No. 11,521,630. Although the claims at issue are not identical, they are not patentably distinct from each other because the claim scope of the ‘630 patent meets the additional limitations of binary mask, and k masks extracting the complete information from the original signal, as well as the extra steps of decomposing the complex signal into magnitude and phase, in the ’63- patent is not necessary to realize the functionality of the claims in the instant invention.

18/766392
11,521,630
21. A method for decomposing an audio signal, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments, wherein the K mask fragments are used to extract individual components from the original audio file; multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file associated with a decomposed audio signal.

24. The method of claim 1, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.
25. The method of claim 1, wherein transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.

26. The method of claim 1, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

27. The method of claim 1, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

28. The method of claim 1, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

29. The method of claim 1, wherein instead of concatenating the K mask fragments together and multiplying the complete mask with the complex spectrogram, each K mask fragment is multiplied with a corresponding complex spectrogram fragment thereby producing a fragment of the new complex spectrogram.

30. A system for decomposing an audio signal, the system comprising: a processor; and memory storing instructions to cause the processor to execute a method, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments, wherein the K mask fragments are used to extract individual components from the original audio file; multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

22. The method of claim 1, wherein the K mask fragments are concatenated together in order to form a complete mask which is the same length as the complex spectrogram.

32,40 The system of claim 10, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.
33. The system of claim 10, wherein transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.
34. The system of claim 10, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

35. The system of claim 10, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

36. The system of claim 10, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

37. The system of claim 10, wherein instead of concatenating the K mask fragments together and multiplying the complete mask with the complex spectrogram, each K mask fragment is multiplied with a corresponding complex spectrogram fragment thereby producing a fragment of the new complex spectrogram.

38. A non-transitory computer readable medium storing instructions to cause a processor to execute a method, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments, wherein the K mask fragments are used to extract individual components from the original audio file; multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

39. The non-transitory computer readable medium of claim 18, wherein the K mask fragments are concatenated together in order to form a complete mask which is the same length as the complex spectrogram.

1. A method for decomposing an audio signal, the method comprising: transforming an original audio file into an original complex spectrogram; decomposing the original complex spectrogram into an original magnitude spectrogram and an original phase spectrogram; splitting the original magnitude spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments; concatenating the K mask fragments together in order to form a complete mask which is the same length as the original magnitude spectrogram; multiplying the complete mask with the original magnitude spectrogram to create a new magnitude spectrogram; combining the new magnitude spectrogram with the original phase spectrogram to produce a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

2. The method of claim 1, wherein instead of combining the new magnitude spectrogram with the original phase spectrogram, a multi-channel wiener filter is applied to the new magnitude spectrograms, using the original complex spectrogram as an input, in order to produce the new complex spectrograms.

3. The method of claim 1, wherein transforming the original audio signal into the original complex spectrogram involves a short-time fourier transform and transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.

4. The method of claim 1, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

5. The method of claim 1, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

6. The method of claim 1, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

7. The method of claim 1, wherein instead of combining the new magnitude spectrogram with the original phase spectrogram, a new phase spectrogram is constructed from new source using a generative adversarial neural network.

8. The method of claim 1, wherein instead of combining the new magnitude spectrogram with the original phase spectrogram, a new phase spectrogram is constructed from a new source using the Griffin-Lim algorithm.

9. The method of claim 1, wherein instead of concatenating the K mask fragments together and multiplying the complete mask with the original magnitude spectrogram, each K mask fragment is multiplied with a corresponding original magnitude spectrogram fragment thereby producing a fragment of the new magnitude spectrogram.

10. The method of claim 1, wherein the fragments of the new magnitude spectrogram are then appended to complete the magnitude spectrogram.

11. A system for decomposing an audio signal, the system comprising: a processor; and memory storing instructions to cause the processor to execute a method, the method comprising: transforming an original audio file into an original complex spectrogram; decomposing the original complex spectrogram into an original magnitude spectrogram and an original phase spectrogram; splitting the original magnitude spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments; concatenating the K mask fragments together in order to form a complete mask 

which is the same length as the original magnitude spectrogram; 

multiplying the complete mask with the original magnitude spectrogram to create a new magnitude spectrogram; combining the new magnitude spectrogram with the original phase spectrogram to produce a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

12. The system of claim 11, wherein instead of combining the new magnitude spectrogram with the original phase spectrogram, a multi-channel wiener filter is applied to the new magnitude spectrograms, using the original complex spectrogram as an input, in order to produce the new complex spectrograms.

13. The system of claim 11, wherein transforming the original audio signal into the original complex spectrogram involves a short-time fourier transform and transforming the new complex spectrogram into the new audio file involves an inverse short time fourier transform.

14. The system of claim 11, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

15. The system of claim 11, wherein at least one of the one or more deep neural networks includes an input scale layer and an output scale layer.

16. The system of claim 11, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

17. The system of claim 11, wherein instead of combining the new magnitude spectrogram with the original phase spectrogram, a new phase spectrogram is constructed from new source using a generative adversarial neural network.

18. The system of claim 11, wherein instead of combining the new magnitude spectrogram with the original phase spectrogram, a new phase spectrogram is constructed from a new source using the Griffin-Lim algorithm.

19. The system of claim 11, wherein instead of concatenating the K mask fragments together and multiplying the complete mask with the original magnitude spectrogram, each K mask fragment is multiplied with a corresponding original magnitude spectrogram fragment thereby producing a fragment of the new magnitude spectrogram.

20. A non-transitory computer readable medium storing instructions to be executed by a processor, the instructions comprising: transforming an original audio file into an original complex spectrogram; decomposing the original complex spectrogram into an original magnitude spectrogram and an original phase spectrogram; splitting the original magnitude spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments; concatenating the K mask fragments together in order to form a complete mask 

which is the same length as the original magnitude spectrogram;

 multiplying the complete mask with the original magnitude spectrogram to create a new magnitude spectrogram; combining the new magnitude spectrogram with the original phase spectrogram to produce a new complex spectrogram; and transform.

Claims 21,24,26,28,30,34,38  are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 3, 5, 6, 11, 13, 15, 20 of U.S. Patent No. 12, 057, 131.. Although the claims at issue are not identical, they are not patentably distinct from each other because the claim scope of the ‘630 patent meets the additional limitations of binary mask, and k masks extracting the complete information from the original signal, as well as the extra steps of decomposing the complex signal into magnitude and phase, in the ’63- patent is not necessary to realize the functionality of the claims in the instant invention.

18/766392
12,057,131
21. A method for decomposing an audio signal, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments, wherein the K mask fragments are used to extract individual components from the original audio file; multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file associated with a decomposed audio signal.

24. The method of claim 1, wherein transforming the original audio file into the complex spectrogram involves a short-time Fourier transform.

26. The method of claim 1, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

28. The method of claim 1, wherein at least one of the one or more deep neural networks includes a bridge layer comprising a first convolution 2D layer and a second convolution 2D layer and an attention layer.

30. A system for decomposing an audio signal, the system comprising: a processor; and memory storing instructions to cause the processor to execute a method, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments, wherein the K mask fragments are used to extract individual components from the original audio file; multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

32. The system of claim 10, wherein transforming the original audio file into the complex spectrogram involves a short-time fourier transform.

34. The system of claim 10, wherein at least one of the one or more deep neural networks includes a series of downsample layers and a series of upsample layers.

38. A non-transitory computer readable medium storing instructions to cause a processor to execute a method, the method comprising: transforming an original audio file into a complex spectrogram; splitting the complex spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through one or more convolutional deep neural networks, the convolutional deep neural networks including one or more convolutional layers, the one or more convolutional layers including a subpixel upsample convolutional layer; producing a sequence of K mask fragments, wherein the K mask fragments are used to extract individual components from the original audio file; multiplying the K mask fragments with the complex spectrogram to create a new complex spectrogram; and transforming the new complex spectrogram into a new audio file.

1. A method for isolating a source from an audio signal, the method comprising: transforming an audio file into a complex spectrogram; decomposing the complex spectrogram into a magnitude spectrogram and a phase spectrogram; splitting the magnitude spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through a convolutional deep neural network to produce K small fragment outputs, wherein the convolutional deep neural network includes convolutional layers augmented with an attention layer to help the convolutional deep neural network find the most informative areas and reduce artifacts in each of the K small fragments, wherein the attention layer informs the convolutional deep neural network of important pixels; producing a source mask based on the K small fragment outputs, wherein the source mask corresponds to an audio source targeted to be isolated; multiplying the source mask with the magnitude spectrogram to create a new magnitude spectrogram corresponding to the source.

2. The method of claim 1, further comprising combining the new magnitude spectrograms with the phase spectrogram in order to produce a new complex spectrogram.

3. The method of claim 1, wherein transforming the audio file into the complex spectrogram is done via a short-time fourier transform.

4. The method of claim 1, wherein a second deep neural network is used for isolating a second audio source.

5. The method of claim 1, wherein the deep neural network includes an input scale layer before a series of downsample layers and an output scale layer following a series of upsample layers.

6. The method of claim 1, wherein the deep neural network includes a bridge layer comprising a first convolutional 2D layer and a second convolutional 2D layer.

7. The method of claim 1, further comprising constructing a new phase spectrogram from a new source using a generative adversarial neural network.

8. The method of claim 1, further comprising constructing a new phase spectrogram from a new source using the Griffin-Lim algorithm.

9. The method of claim 1, further comprising applying a multi-channel wiener filter to the new magnitude spectrogram.

10. The method of claim 1, wherein the source mask is produced by concatenating the K small fragment outputs.

11. A system for isolating a source from an audio signal, the system comprising: a processor; and memory storing instructions to cause the processor to execute a method, the method comprising: transforming an audio file into a complex spectrogram; decomposing the complex spectrogram into a magnitude spectrogram and a phase spectrogram; splitting the magnitude spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through a convolutional deep neural network to produce K small fragment outputs, wherein the convolutional deep neural network includes convolutional layers augmented with an attention layer to help the convolutional deep neural network find the most informative areas and reduce artifacts in each of the K small fragments, wherein the attention layer informs the convolutional deep neural network of important pixels; producing a source mask based on the K small fragment outputs, wherein the source mask corresponds to an audio source targeted to be isolated; multiplying the source mask with the magnitude spectrogram to create a new magnitude spectrogram corresponding to the source.

12. The system of claim 11, wherein the method further comprises combining the new magnitude spectrograms with the phase spectrogram in order to produce a new complex spectrogram.

13. The system of claim 11, wherein transforming the audio file into the complex spectrogram is done via a short-time fourier transform.

14. The system of claim 11, wherein a second deep neural network is used for isolating a second audio source.

15. The system of claim 11, wherein the deep neural network includes an input scale layer before a series of downsample layers and an output scale layer following a series of upsample layers.

16. The system of claim 11, wherein the deep neural network includes a bridge layer comprising a first convolutional 2D layer and a second convolutional 2D layer.

17. The system of claim 11, wherein the method further comprises constructing a new phase spectrogram from a new source using a generative adversarial neural network.

18. The system of claim 11, wherein the method further comprises constructing a new phase spectrogram from a new source using the Griffin-Lim algorithm.

19. The system of claim 11, wherein the method further comprises applying a multi-channel wiener filter is applied to the new magnitude spectrogram.

20. A non-transitory computer readable medium storing instructions to be executed by a processor, the instructions comprising: transforming an audio file into a complex spectrogram; decomposing the complex spectrogram into a magnitude spectrogram and a phase spectrogram; splitting the magnitude spectrogram into K small fragments along the time dimension; sending each fragment in the K small fragments through a convolutional deep neural network to produce K small fragment outputs, wherein the convolutional deep neural network includes convolutional layers augmented with an attention layer to help the convolutional deep neural network find the most informative areas and reduce artifacts in each of the K small fragments, wherein the attention layer informs the convolutional deep neural network of important pixels; producing a source mask based on the K small fragment outputs, wherein the source mask corresponds to an audio source targeted to be isolated; multiplying the source mask with the magnitude spectrogram to create a new magnitude spectrogram corresponding to the source.

Allowable Subject Matter

Claims 21-40 are allowed over the prior art of record.  

The following is a statement of reasons for the indication of allowable subject matter:
As per the independent claims, the claim limitations toward decomposing an audio signal into an original complex spectrogram; magnitude/phase and splitting into k small fragments, sending each fragment in the K small fragments through one or more convolutional deep neural networks, with subpixel upsample convolutional layers; producing a sequence of K mask fragments; filtering the magnitude spectrogram with k mask fragments generating a new magnitude spectrum and combining the new magnitude spectrogram with the original phase spectrogram to produce a new complex spectrogram; and transforming the new complex spectrogram into a new audio file, is not explicitly taught by the prior art of record. Le Roux (20190318754) teaches operating on an audio signal (para 0014, 0015); extracting spectrogram information into sections/frames (para 0008; processing the sections into DNN — para 0013); producing a sequence of masking elements — para 0060 — using a sigmoid function to produce a mask for each time-frequency element); creating a modified magnitude with a reconstruction process — para 0059) ; combining the new magnitude spectrogram with the phase spectrogram to produce a new complex spectrogram (as combining the reconstructed magnitude and reconstructed phase to generate a refined spectrogram — para 0059); and transforming the new complex spectrograms into final waveforms; into a new audio file (generating the signal by inverse transforming the modified spectrograms — para 0060). Li (20200334526) teaches specifically applying the attention layer to “LSTM” RNN’s, which Li (20200334526) explicitly states is different from LSTM RNN’s. One cited portion of Li is reproduced below. [0026] There has been a significant progress in Automatic Speech Recognition (“ASR”) since the transition from the Deep feedforward Neural Networks (“DNNs”) to Recurrent Neural Networks (“RNNs’”) with Long Short-Term Memory (“LSTM”) units. LSTMs alleviate the gradient vanishing or exploding issues in standard RNNs by using input, output and forget gates, thus improving the capacity of the network to capture long temporal context information in audio sequences. LSTM-RNNs have been shown to outperform DNNs on a variety of ASR tasks, and considerable efforts have been devoted to improving the structure of LSTM for ASR, such as convolutional LSTM DNN (““CLDNN’), time-frequency LSTM-RNNs, grid LSTMs, residual LSTMs, highway LSTMs, etc. The cited paragraph 0026 of Li says explicitly that ASR (automated speech recognition) has transitioned away from DNN to RNN with LSTM. The entirety of Li is directed toward RNN with LSTM to improve ASR models. By contrast, the claims recite convolutional neural network (CNN) because the method described for music separation, as opposed to ASR, is actually performed better using CNN models and not RNN models. In addition, Li also explicitly applies the attention layer to the LSTM. Other cited portions of Li are reproduced below. [0037] FIG. 2D is a layer trajectory LSTM architecture 208 with context frames from a depth processing block (D-LSTM 228 instead of the T-LSTM 218 as in FIG. 2C) being added 248 for an input to a D-LSTM 228 in accordance with some embodiments. The process used to evaluate the 1+1th layer depth-LSTM output when incorporating t future frames of hidden vectors from depth- LSTM at every layer. [0038] Thus, FIGS. 2C and 2D show the computational steps to update the 1+1th layer depth (D-LSTM) output gt 1+1 when incorporating t future frames from the T- LSTM and D- LSTM, respectively. When future frames are incorporated from only T-LSTM only (as in FIG. 2C) the evaluation of gt 1+1 depends on gt 1 and nt 1+1 which is generated from [ht 1+1... htt+t1+1] as in equation (12). When multiple layers are stacked, there is no latency accumulation, so the total number of lookahead frames in this case is still t. However, when incorporating future frames from D-LSTM there is latency accumulation when multiple layers are stacked. For an L layer ItLSTM with t future context frames, the total number of additional lookahead frames will be Lt. [0039] Another way to incorporate future frames is to use the attention mechanism to generate an embedding vector of a context window with input-dependent weights. This method might, for example, improve the accuracy of Connectionist Temporal Classification (“CTC”) modeling. Note that the attention modeling may be applied to all the hidden layers (and not just the top layer). [0040] Note that attention may be used to calculate nt, which is then used to replace ht 1 in equations (7) through (11) to calculate gt 1. Now, the context vector can be computed using: nt l=y Xd=t-1 t+t1(at,6 © rd). (17) Hence, the method may comprise a dimension-wise, location-based attention. As shown above in paragraphs [0037-0038], the context of applying the attention layer in paragraph [0039] is only for LSTM models described. There is no mention of applying the attention layers to CNN layers, and Li would actually teach against it because Li specifically focuses on RNN for ASR. The following references were found toward an attention mechanism used in the multilayer convolution mn/dnns: Arik (20180336880) , para 0095, 0107-0110); Bach (20180018553) para 0008, 0076, 0219; Malah (20100042408) in the realm of audio coding, demonstrating the use of neural networks (para 0067), with down/up sampling (para 0110), operating on spectral parameters — para 0066. However, none of the prior art of record explicitly teach the claim limitations as discussed above. 

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.   Please see related art listed on the PTO-892 form.

Germain (20190043516) – in addition to the downsampling/upsampling feature detailed above, Germain also teaches convolutional processing in the neural network layers – para 0030, etc.

Wang (20170061978) teaches DNN for processing speech signals using masking (para 0044), down/up sampling – para 0049.
Malah (20100042408) in the realm of audio coding, demonstrating the use of neural networks (para 0067), with down/up sampling (para 0110), operating on spectral parameters – para 0066

In the realm of magnitude/phase filtering and reconstruction, see Tashev (20190318755), Mesgarani (20190066713).
 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                                        02/06/2026
Read full office action
Prosecution Timeline

Jul 08, 2024
Application Filed
Feb 06, 2026
Non-Final Rejection — §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/512,723
Patent 12602554
SYSTEMS AND METHODS FOR PRODUCING RELIABLE TRANSLATION IN NEAR REAL-TIME
2y 5m to grant Granted Apr 14, 2026
17/698,029
Patent 12592246
SYSTEM AND METHOD FOR EXTRACTING HIDDEN CUES IN INTERACTIVE COMMUNICATIONS
2y 5m to grant Granted Mar 31, 2026
18/367,779
Patent 12586580
System For Recognizing and Responding to Environmental Noises
2y 5m to grant Granted Mar 24, 2026
18/344,007
Patent 12579995
Automatic Speech Recognition Accuracy With Multimodal Embeddings Search
2y 5m to grant Granted Mar 17, 2026
18/273,354
Patent 12567432
VOICE SIGNAL ESTIMATION METHOD AND APPARATUS USING ATTENTION MECHANISM
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
82%
Grant Probability
92%
With Interview (+10.5%)
3y 3m
Median Time to Grant
Low
PTA Risk
Based on 900 resolved cases by this examiner. Grant probability derived from career allow rate.