DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the non-final office action dated 12/17/2025, applicant has amended claims 1, 19 and 20. Claims 1-20 are currently pending in the application.
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claim(s) 1-5, 13 and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mesgarani et al (U.S. Pub No. 20190066713, hereinafter Mesgarani) in view of Short et al (US Pub No. 2015/0287422, hereinafter Short).
Regarding claim 1, Mesgarani teaches an audio processing system (See Mesgarani Fig 1, system 100), comprising: an input interface (See Mesgarani Fig 1, microphone 110) configured to receive an input audio mixture (See Mesgarani Fig 3 & ¶ [0106] lines 3-12, step 310 receiving mixed sound signal via a device) and transform it into a time-frequency representation defined by values of time- frequency bins (See Mesgarani ¶ [0111] lines 3-10, input audio represented as a time-frequency mixture in an embedded space comprised of embedded time-frequency bins); a processor (See Mesgarani Fig 31 & ¶ [0244] lines 1-4, processor 3112) configured to map the values of time-frequency bins into a hyperbolic space by executing an embedding neural network (See Mesgarani ¶ [0111] lines 18-23, time-frequency mixture projected (mapping) into embedding space by a neural network) trained to associate (See Mesgarani ¶ [0117] lines 4-9, pre-trained deep neural networks (DNNs)) each time-frequency bin to a high-dimensional embedding and projecting each high-dimensional embedding into the hyperbolic space (See Mesgarani ¶ [0111] lines 3-10, input audio represented as a time-frequency mixture in an embedded space comprised of embedded time-frequency bins); and an output interface (See Mesgarani Fig 31, controller system 3100) configured to accept a selection of at least a portion of the hyperbolic space and render, via a display (See Mesgarani Fig 31 & ¶ [0244], monitor 3120), selected hyperbolic embeddings falling within the selected portion of the hyperbolic space (See Mesgarani ¶ [0115] lines 1-8, a separated signal may be processed for amplification and output).
Mesgarani does not explicitly teach an output interface that includes a user operated selection tool able to select a portion of an output, the selection tool being movable and resizable allowing the user to customize selection of the output, and the selected portion corresponding to a sound class.
Short teaches an output interface that includes a user operated selection tool able to select a portion of an output, the selection tool being movable and resizable allowing the user to customize selection of the output, and the selected portion corresponding to a sound class (See Short Fig 59 & ¶ [0615], track editor GUI allowing a user to select displayed data by area via a drawing box or lasso. Data can be selected by tracklet, coherent group, or oscillator peak.).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the user selection interface taught by Short with the audio processing system taught by Mesgarani. Doing so provides the user with greater control and customization allowing for a more detailed and user friendly experience.
Regarding claim 2, Mesgarani in view of Short teaches the audio processing system of claim 1, wherein to render the selected hyperbolic embeddings, the output interface (See Mesgarani Fig 31, controller system 3100) is further configured to transform the selected hyperbolic embeddings into a separated output audio signal (See Mesgarani ¶ [0115] lines 1-8, a separated signal may be processed for amplification) and send the separated output audio signal to a memory, and the memory stores the separated output signal (See Mesgarani ¶ [0154] lines 1-11, separated speech signals sent to Deep Long-Short Term Memory Network (LSTM) for storage).
Regarding claim 3, Mesgarani in view of Short teaches the audio processing system of claim 2, wherein the output interface (See Mesgarani Fig 31, controller system 3100) is further configured to send the separated output signal to a loudspeaker (See Mesgarani ¶ [0245] lines 27-31, speakers).
Regarding claim 4, Mesgarani in view of Short teaches the audio processing system of claim 1, wherein the output interface is further configured to transform the selected hyperbolic embeddings into a separated output audio signal by creating a time-frequency mask based on the selected hyperbolic embeddings and applying the time-frequency mask to the time-frequency representation of the input audio mixture (See Mesgarani ¶ [0155] lines 1-15, separated time-frequency bins by calculating a time-frequency mask for each source).
Regarding claim 5, Mesgarani in view of Short teaches the audio processing system of claim 4, wherein the output interface is further configured to create the time-frequency mask based on a softmax operation (See Mesgarani ¶ [0160] lines 1-13, softmax activation function used for mask generation).
Regarding claim 13, Mesgarani in view of Short teaches the audio processing system of claim 1, wherein the output interface is operatively connected to a display device (See Mesgarani Fig 31, monitor 3120) configured to display a visual representation of the hyperbolic embeddings mapped to different locations of the hyperbolic space to enable the selection of the portion of the hyperbolic space (See Mesgarani ¶ [0114] lines 1-12, selection of signal).
Regarding claim 18, Mesgarani in view of Short teaches the audio processing system of claim 1, wherein the processor is further configured to: accept a selection of weight on energy of the selected hyperbolic embeddings; and render the selected hyperbolic embeddings based on the weight on the energy of the selected hyperbolic embeddings (See Mesgarani ¶ [0109] lines 3-16, each segment is transformed into a weighted sum with weight coefficients and deriving mask matrices prior to reconstruction).
Regarding claim 19, Mesgarani teaches an audio processing method (See Mesgarani ¶ [0080] lines 1-16, audio processing method), comprising: receiving an input audio mixture (See Mesgarani Fig 3 & ¶ [0106] lines 3-12, step 310 obtaining mixed sound signal via a device) and transform it into a time-frequency representation defined by values of time-frequency bins (See Mesgarani ¶ [0111] lines 3-10, input audio represented as a time-frequency mixture in an embedded space comprised of embedded time-frequency bins); mapping the values of time-frequency bins into a hyperbolic space by executing an embedding neural network (See Mesgarani ¶ [0111] lines 18-23, time-frequency mixture projected (mapping) into embedding space by a neural network) trained to associate (See Mesgarani ¶ [0117] lines 4-9, pre-trained deep neural networks (DNNs)) each time-frequency bin to a high-dimensional embedding and projecting each high-dimensional embedding into the hyperbolic space (See Mesgarani ¶ [0111] lines 3-10, input audio represented as a time-frequency mixture in an embedded space comprised of embedded time-frequency bins); and accepting a selection of at least a portion of the hyperbolic space and rendering, via a display (See Mesgarani Fig 31 & ¶ [0244], monitor 3120), selected hyperbolic embeddings falling within the selected portion of the hyperbolic space (See Mesgarani ¶ [0115] lines 1-8, a separated signal may be processed for amplification and output).
Mesgarani does not explicitly teach an output interface that includes a user operated selection tool able to select a portion of an output, the selection tool being movable and resizable allowing the user to customize selection of the output, and the selected portion corresponding to a sound class.
Short teaches an output interface that includes a user operated selection tool able to select a portion of an output, the selection tool being movable and resizable allowing the user to customize selection of the output, and the selected portion corresponding to a sound class (See Short Fig 59 & ¶ [0615], track editor GUI allowing a user to select displayed data by area via a drawing box or lasso. Data can be selected by tracklet, coherent group, or oscillator peak.).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the user selection interface taught by Short with the audio processing method taught by Mesgarani. Doing so provides the user with greater control and customization allowing for a more detailed and user friendly experience.
Regarding claim 20, Mesgarani teaches a non-transitory computer-readable storage medium (See Mesgarani Fig 31 & ¶ [0246] lines 6-13, storage 3114) embodied thereon a program executable by a processor for performing a method (See Mesgarani ¶ [0245] lines 9-13, program stored in memory and performed by device processor), comprising: receiving an input audio mixture (See Mesgarani Fig 3 & ¶ [0106] lines 3-12, step 310 obtaining mixed sound signal via a device) and transform it into a time-frequency representation defined by values of time-frequency bins (See Mesgarani ¶ [0111] lines 3-10, input audio represented as a time-frequency mixture in an embedded space comprised of embedded time-frequency bins); mapping the values of time-frequency bins into a hyperbolic space by executing an embedding neural network (See Mesgarani ¶ [0111] lines 18-23, time-frequency mixture projected (mapping) into embedding space by a neural network) trained to associate (See Mesgarani ¶ [0117] lines 4-9, pre-trained deep neural networks (DNNs)) each time-frequency bin to a high-dimensional embedding and projecting each high-dimensional embedding into the hyperbolic space (See Mesgarani ¶ [0111] lines 3-10, input audio represented as a time-frequency mixture in an embedded space comprised of embedded time-frequency bins); and accepting a selection of at least a portion of the hyperbolic space and render, via a display (See Mesgarani Fig 31 & ¶ [0244], monitor 3120), selected hyperbolic embeddings falling within the selected portion of the hyperbolic space (See Mesgarani ¶ [0115] lines 1-8, a separated signal may be processed for amplification and output).
Mesgarani does not explicitly teach an output interface that includes a user operated selection tool able to select a portion of an output, the selection tool being movable and resizable allowing the user to customize selection of the output, and the selected portion corresponding to a sound class.
Short teaches an output interface that includes a user operated selection tool able to select a portion of an output, the selection tool being movable and resizable allowing the user to customize selection of the output, and the selected portion corresponding to a sound class (See Short Fig 59 & ¶ [0615], track editor GUI allowing a user to select displayed data by area via a drawing box or lasso. Data can be selected by tracklet, coherent group, or oscillator peak.).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the user selection interface taught by Short with the non-transitory computer-readable storage medium taught by Mesgarani. Doing so provides the user with greater control and customization allowing for a more detailed and user friendly experience.
Claim(s) 6-7 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mesgarani et al (U.S. Pub No. 20190066713, hereinafter Mesgarani) in view of Short et al (US Pub No. 2015/0287422, hereinafter Short) as applied to claims above, and further in view of Raghavan et al (U.S. Pub No. 20210224633, hereinafter Raghavan).
Regarding claim 6, Mesgarani in view of Short teaches the audio processing system of claim 1 wherein the hyperbolic space is classified according to a hyperbolic geometry that carries a notion of classification hierarchy of audio sources based on locations of the hyperbolic embeddings with respect to an origin (See Mesgarani ¶ [0109] lines 3-16, creates non-overlapping segments within the embedded space and generates mask matrices corresponding to different groups of sound sources).
Mesgarani in view of Short does not explicitly teach the hyperbolic space using a Poincare disk.
Raghavan teaches a neural network that uses a Poincare disk (See Raghavan Fig 5C & ¶ [0083] lines 14-16, neural network using Poincare disks for a self-organizing network).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have used a Poincare disk as taught by Raghavan with the neural network taught by Mesgarani in view of Short. The Poincare disk model is well known in the art for its use in 2-dimensional hyperbolic geometry. Implementing the Poincare disk model in the embedded space taught by Mesgarani would allow for a self-organizing network with hyperbolic distribution.
Regarding claim 7, Mesgarani in view of Short and Raghavan teaches The audio processing system of claim 6, wherein a distance from the origin of the hyperbolic space to each hyperbolic embedding is used (See Mesgarani ¶ [0111] lines 22-36, attractor points used to compute distances from current embedded time-frequency bins to previous locations) to derive a measure of certainty of the processing, and the creating of the time-frequency mask (See Mesgarani ¶ [0111] lines 22-36, masks generated based on updated locations) is based on the measure of certainty of the processing (See Mesgarani ¶ [0085] lines 27-31, normalization calculations using standard deviation which is a measure of certainty).
Regarding claim 15, Mesgarani in view of Short teaches the audio processing system of claim 1, wherein the processor is further configured to receive a user input (See Mesgarani ¶ [0245] lines 27-31, user input).
Mesgarani in view of Short does not explicitly teach a user input that selects the size and shape of the hyperbolic space.
Raghavan teaches a user input that selects the size and shape of the hyperbolic space (See Raghavan ¶ [0122] line 1, user-defined growth parameters).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the user defined parameters taught by Raghavan with the embedded space taught by Mesgarani in view of Short. Doing so allows for a user to adjust the hyperbolic space as needed giving them greater control and the ability to modify for specific uses.
Claim(s) 8 and 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mesgarani et al (U.S. Pub No. 20190066713, hereinafter Mesgarani) in view of Short et al (US Pub No. 2015/0287422, hereinafter Short) and Raghavan et al (U.S. Pub No. 20210224633, hereinafter Raghavan) as applied to claims above, and further in view of Yu et al (U.S. Pub No. 20110010168, hereinafter Yu).
Regarding claim 8, Mesgarani in view of Short and Raghavan teaches the audio processing system of claim 6, wherein the processor is further configured to determine, based on a distance of a hyperbolic embedding from the origin of the Poincar6 ball or the Poincar6 disk, a measure of certainty of the hyperbolic embedding (See Mesgarani ¶ [0085] lines 27-31, normalization calculations using standard deviation which is a measure of certainty) to belong to only a single specific audio class (See Mesgarani ¶ [0109] lines 3-16, creates non-overlapping segments within the embedded space and generates mask matrices corresponding to different groups of sound sources).
Mesgarani in view of Short and Raghavan does not explicitly teach the use of a classification hierarchy.
Yu teaches the use of a classification hierarchy (See Yu ¶ [0040] lines 12-17, classifying into audio classes based on confidence measure).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the classification hierarchy taught by Yu with the audio processing system taught by Mesgarani in view of Short and Raghavan. Doing so allows for a broader processing range to include both speech and instruments.
Regarding claim 9, Mesgarani in view of Short and Raghavan teaches the audio processing system of claim 6.
Mesgarani in view of Short and Raghavan does not explicitly teach the input audio being classified based on a measured confidence compared to a set threshold.
Yu teaches an input audio being classified based on a measured confidence compared to a set threshold (See Yu ¶ [0040] lines 10-18, input audio classified based on a fixed length with the confidence being compared to a set threshold).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the classification confidence/threshold taught by Yu with the audio processing system taught by Mesgarani in view of Short and Raghavan. Doing so allows for higher confidence levels for incoming audio signals and prevents uncertainty through the use of threshold comparison.
Claim(s) 12, 14, and 16-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mesgarani et al (U.S. Pub No. 20190066713, hereinafter Mesgarani) in view of Short et al (US Pub No. 2015/0287422, hereinafter Short) as applied to claims above, and further in view of Yu et al (U.S. Pub No. 20110010168, hereinafter Yu).
Regarding claim 12, Mesgarani in view of Short teaches the audio processing system of claim 1, wherein the embedding neural network is trained end-to-end with a classifier trained to classify the hyperbolic embeddings according to the classification hierarchy (See Mesgarani ¶ [0151] lines 8-12, a classifier can be implemented with the neural network for real-time processing).
Mesgarani in view of Short does not explicitly teach the use of a classification hierarchy.
Yu teaches the use of a classification hierarchy (See Yu ¶ [0040] lines 12-17, classifying into audio classes based on confidence measure).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the classification hierarchy taught by Yu with the audio processing system taught by Mesgarani in view of Short. Doing so allows for a broader processing range to include both speech and instruments.
Regarding claim 14, Mesgarani in view of Short teaches the audio processing system of claim 1, including training data comprising male and female speech (See Mesgarani ¶ [0230] lines 6-8, training data comprising male and female speech).
Mesgarani in view of Short does not explicitly teach the training data including music.
Yu teaches the training data including music (See Yu ¶ [0068] lines 14-21, training data including music comprising multiple instruments).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the addition of music as training data as taught by Yu with the audio processing system taught by Mesgarani in view of Short. Doing so allows for a broader processing range to include both speech and instruments.
Regarding claim 16 Mesgarani in view of Short and Yu teaches the audio processing system of claim 12, wherein the classifier segments the hyperbolic space using hyperbolic hyperplanes according to the classification hierarchy, and the output interface is further configured to create a T-F mask for each audio class based on the hyperbolic hyperplanes (See Mesgarani ¶ [0109] lines 3-16, creates non-overlapping segments within the embedded space and generates mask matrices corresponding to different groups of sound sources).
Regarding claim 17 Mesgarani in view of Short and Yu teaches the audio processing system of claim 16, wherein the output interface is further configured to generate an output signal for each audio class based on the T- F mask created for a corresponding audio class (See Mesgarani ¶ [0116] lines 27-31, separated input audio sources can be re-synthesized).
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TYLER LIEBGOTT whose telephone number is (703)756-1818. The examiner can normally be reached Mon-Fri 10-6:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Fan Tsang can be reached at (571)272-7547. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/T.M.L./Examiner, Art Unit 2694
/FAN S TSANG/Supervisory Patent Examiner, Art Unit 2694