Last updated: April 19, 2026
Application No. 18/398,971
PROCESSING AND UTILIZING AUDIO SIGNALS BASED ON SPEECH SEPARATION

Final Rejection §103
Filed
Dec 28, 2023
Examiner
WITHEY, THEODORE JOHN
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Orcam Technologies Ltd.
OA Round
2 (Final)
This examiner grants 44% of cases after interview

— +46.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 23 resolved cases, 2023–2026
Examiner Intelligence

WITHEY, THEODORE JOHN View full profile →
Grants 44% of resolved cases
Career Allow Rate
10 granted / 23 resolved
-18.5% vs TC avg
Strong +47% interview lift
Without
With
+46.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
39 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
22.0%
-18.0% vs TC avg
§103
48.6%
+8.6% vs TC avg
§102
17.1%
-22.9% vs TC avg
§112
12.0%
-28.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 23 resolved cases
Office Action

§103
DETAILED ACTION
This office action is in response to Applicant’s Amendment/Request for Reconsideration, received on 12/31/2025. Claims 1-2, 5-7, 10-12, and 15 have been amended. Claims 1-15 are pending and have been considered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments, see pgs. 6-8, filed 12/31/2025, with respect to the rejection(s) of claim(s) 1-4, 6-9, 11-14 under 35 U.S.C. 102(a)(2) have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Sivaraman et al. (US-20220084509-A1), hereinafter Sivaraman. Sivaraman discloses “a machine-learning architecture system that enhances the speech audio of a user-defined target speaker by suppressing interfering speakers, as well as background noise and reverberations. The machine-learning architecture includes a speech separation engine for separating the speech signal of a target speaker from a mixture of multiple speakers' speech, and a noise suppression engine for suppressing various types of noise in the input audio signal” (abstract) with specific reference to using a predetermined voiceprint for extracting further voiceprints from new audio signals (see Fig. 1B, Fig. 4). See updated rejections below.
Applicant’s arguments with respect to claim(s) 5, 10, 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-4, 6-9, 11-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lindrup et al. (US-20250046330-A1), hereinafter Lindrup, in view of Sivaraman et al. (US-20220084509-A1), hereinafter Sivaraman.

	Regarding claim 1, Lindrup discloses: a method performed in an environment of a user ([0024] “audio input signal” will be generally be construed to mean an electrical signal representing a sound from the sound environment, [0057] the audio device system is configured to enable the user to select between different methods for selecting the first sound signal, [Allowing a user to select sound signals, wherein those are gathered from an environment, indicates the user is in the environment (required for them to be able to hear the audio)]), a plurality of people is located in the environment ([0073] it can be assumed that the user of the audio device system is engaged in a conversation with the person represented by said sound source signal, [A conversation requires a plurality of people in a same environment]), the user having a mobile device used for obtaining user input ([0106] the system may additionally comprise at least one of a second audio device and an external device such as a smart phone, [0107] The audio device system 200 comprises an acoustical-electrical input transducer block (typically comprising two microphones) 201 and an analogue-digital converter (not shown for reasons of clarity), which provides an input signal), the user having at least one hearable device used for providing audio output to the user ([0025] the term “audio output signal” will generally be construed to mean an electrical signal representing a sound to be output by an electrical-acoustical output transducer of an audio device of an audio device system, [0116] The methods and systems of the present invention according to the disclosed embodiments may generally be implemented in both hearing aids and in audio devices that are not hearing aids (i.e. they do not comprise means for compensating a hearing loss), but nevertheless comprise both acoustical-electrical input transducers and electro-acoustical output transducers), the method comprising:
capturing a noisy audio signal from the environment ([Fig. 1, 101], [0031] a plurality of sound source signals each representing a sound source of the present sound environment are provided, [0037] According to an embodiment the mixed audio signal may also comprise non-speech signals, [Wherein input of a plurality of sound source signals will clearly result in a mixed signal at the receiver, wherein non-speech signals track to noise in the context of improving speech intelligibility of Lindrup (see abstract). Further, see [0010] which discloses mixed audio typically comprising multiple speakers]);
extracting from the noisy audio signal a first separate audio signal that represents a first voice ([0049] an own voice signal representing the voice of the audio device system user is detected and in response hereto the own voice signal is selected as the first sound signal), and a second separate audio signal that represents a second voice ([Fig. 1, 103], [0058] In a third step 103 of the method according to the present embodiment the speech content of the first sound signal is compared with the speech content of the provided sound source signals, [Comparing a first sound signal, previously defined to represent a voice, to a second, i.e. sound source signal, wherein sound source signals can each represent individual speakers derived from a mixed signal (see [0010]), indicating at least a second separate audio signal that represent a second voice]),
combining the first and second separate audio signals to obtain an enhanced audio signal ([0099] an audio output is provided based on said output signal, wherein the contribution to the audio output from the remaining sound source signals is suppressed compared to the contribution from the output signal, [Suppressing contributions from remaining sound source signals, i.e. second signals, compared to the contribution of an output signal, i.e. that most closely corresponding to the first sound signal (see comparison definition of Step 103, [0062]-[0064]; therefore, representative of the first sound signal), indicates a combination of the two separate audio signals to obtain enhanced audio]); and,
outputting to the user, via the at least one hearable device ([0116] The methods and systems of the present invention according to the disclosed embodiments may generally be implemented in both hearing aids and in audio devices that are not hearing aids (i.e. they do not comprise means for compensating a hearing loss), but nevertheless comprise both acoustical-electrical input transducers and electro-acoustical output transducers), the enhanced audio signal ([Fig. 1, 105], [Providing an output with sound suppression performed, i.e. enhanced, on remaining, i.e. not first, sound signals, wherein that output is defined to be sent to hearing aids which process output through transducers, indicating generated sound to the user]).
Lindrup does not disclose:
said extracting is performed based on first and second acoustic fingerprints that correspond to the first and second voices, respectively.
Sivaraman discloses:
said extracting is performed based on first and second acoustic fingerprints that correspond to the first and second voices, respectively ([0061] the analytics server 102 applies the speaker-embedding engine 126 on the complete target speaker utterance to extract a target voiceprint as the clean reference, [0064] The speech separation engine 122 is a neural network architecture comprising layers that identify or separate the target speaker's speech signal from a mixture of speech signals in the input audio signal. The speech separation engine 122 receives as inputs the input audio signal containing the mixture of speech signals and a target speaker embedding (sometimes called the “target voiceprint”), [Separating target speaker speech from a mixture signal based on a target speaker embedding indicates the extraction of the target speaker speech is performed based on respective acoustic fingerprints, wherein the “first” and “second” fingerprints/voices as claimed could be applied to the architecture of Sivaraman without a change in functionality as Sivaraman discloses mixture signals containing interfering speakers (also see [0064]), indicating multiple voices with multiple associated voiceprints in view of the plurality of voice speakers of Lindrup]).
 Lindrup and Sivaraman are considered analogous art within target speech enhancement. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lindrup to incorporate the teachings of Sivaraman, because of the novel way to perform dual tasks of speaker mixture separation and background noise suppression to enhance the perceptual quality of speech audio in a single-channel audio input containing a mixture of speakers based on a target speaker voiceprint (Sivaraman, [0013]).

Regarding claim 2, Lindrup in view of Sivaraman discloses: the method of claim 1.
Lindrup further discloses:
wherein said extracting comprises using one or more models to extract from the noisy audio signal the first separate audio signal and the second separate audio signal ([0065] a sound source separation encoder-decoder neural network, [A neural network for sound source separation indicates that network to be a model for audio signal extraction, as required for separation]), the one or more models comprise at least one of: a generative model ([The examiner would like to note that due to the disjunctive nature of the claim, this element does not require a mapping]), a discriminative model ([Defining a “sound source separation” neural network indicates that it is discriminatory in nature, i.e. between sources]), or a beamforming based model ([The examiner would like to note that due to the disjunctive construction of the claim, not all elements require a mapping. Further, Lindrup defines using a beamformer for identifying sound source signals, indication a beamforming model for audio separation, see ([0048])]).

Regarding claim 3, Lindrup in view of Sivaraman discloses: the method of claim 1.
Lindrup further discloses:
wherein the first and second acoustic fingerprints are retained in a database of pre-generated acoustic fingerprints ([0054] comparing the generated voice print to previously obtained voice prints that each are associated to a specific speaker [Comparing newly generated voiceprints to previously obtained voiceprints indicates a required retaining of the obtained voiceprints for obtaining/comparing. The obtained voiceprints are “pre-generated” as compared to the newly generated prints. Lindrup defines the system as being able to be performed using a smart phone (see [0006], [0106]) indicating a database, i.e. the phone storage, for storing/retrieving previously generated voiceprints]).

Regarding claim 4, Lindrup in view of Sivaraman discloses: the method of claim 3.
Lindrup further discloses:
wherein the first and second acoustic fingerprints are pre- generated based on respective first and second audio records of corresponding entities ([0054] One implementation comprises generating a “voice print” of data derived from a given audio signal and comparing the generated voice print to previously obtained voice prints that each are associated to a specific speaker, whereby the speaker of said given audio signal may be identified as the person associated with said previously obtained voice print, [Comparing previously generated voiceprints to new voiceprints indicates the previously generated are “pre-generated” as compared to the newly generated voiceprints. Further, obtaining previously generated voiceprints which correspond to specific speakers indicates those speakers represent entities corresponding to the audio]), the first and second audio records comprising at least one of: 
past vocal communications with the user ([The ability to have a previously obtained voiceprint compared to a new voiceprint indicates the obtained voiceprint is from a past vocal communication with the user, i.e. an enrollment]); 
a designated enrollment audio ([The examiner would like to note that due to the disjunctive construction of the claim, these element do not require mappings]); and 
a social media platform ([The examiner would like to note that due to the disjunctive construction of the claim, these element do not require mappings]).

Regarding claim 6, Lindrup discloses: a computer program product comprising a non-transitory computer readable storage medium retaining program instructions ([0106] a second audio device and an external device such as a smart phone, [A smartphone will inherently have non-transitory storage media]), which program instructions when read by a processor ([A smartphone will inherently have a processor]), cause the processor to perform a method in an environment of a user ([0024] “audio input signal” will be generally be construed to mean an electrical signal representing a sound from the sound environment, [0057] the audio device system is configured to enable the user to select between different methods for selecting the first sound signal, [Allowing a user to select sound signals, wherein those are gathered from an environment, indicates the user is in the environment (required for them to be able to hear the audio)]), a plurality of people is located in the environment ([0073] it can be assumed that the user of the audio device system is engaged in a conversation with the person represented by said sound source signal, [A conversation requires a plurality of people in a same environment]), the user having a mobile device used for obtaining user input ([0106] the system may additionally comprise at least one of a second audio device and an external device such as a smart phone, [0107] The audio device system 200 comprises an acoustical-electrical input transducer block (typically comprising two microphones) 201 and an analogue-digital converter (not shown for reasons of clarity), which provides an input signal), the user having at least one hearable device used for providing audio output to the user ([0025] the term “audio output signal” will generally be construed to mean an electrical signal representing a sound to be output by an electrical-acoustical output transducer of an audio device of an audio device system, [0116] The methods and systems of the present invention according to the disclosed embodiments may generally be implemented in both hearing aids and in audio devices that are not hearing aids (i.e. they do not comprise means for compensating a hearing loss), but nevertheless comprise both acoustical-electrical input transducers and electro-acoustical output transducers), the method comprising:
capturing a noisy audio signal from the environment ([Fig. 1, 101], [0031] a plurality of sound source signals each representing a sound source of the present sound environment are provided, [0037] According to an embodiment the mixed audio signal may also comprise non-speech signals, [Wherein input of a plurality of sound source signals will clearly result in a mixed signal at the receiver, wherein non-speech signals track to noise in the context of improving speech intelligibility of Lindrup (see abstract). Further, see [0010] which discloses mixed audio typically comprising multiple speakers]);
extracting from the noisy audio signal a first separate audio signal that represents a first voice ([0049] an own voice signal representing the voice of the audio device system user is detected and in response hereto the own voice signal is selected as the first sound signal), and a second separate audio signal that represents a second voice ([Fig. 1, 103], [0058] In a third step 103 of the method according to the present embodiment the speech content of the first sound signal is compared with the speech content of the provided sound source signals, [Comparing a first sound signal, previously defined to represent a voice, to a second, i.e. sound source signal, wherein sound source signals can each represent individual speakers derived from a mixed signal (see [0010]), indicating at least a second separate audio signal that represent a second voice]);
combining the first and second separate audio signals to obtain an enhanced audio signal ([0099] an audio output is provided based on said output signal, wherein the contribution to the audio output from the remaining sound source signals is suppressed compared to the contribution from the output signal, [Suppressing contributions from remaining sound source signals, i.e. second signals, compared to the contribution of an output signal, i.e. that most closely corresponding to the first sound signal (see comparison definition of Step 103, [0062]-[0064]; therefore, representative of the first sound signal), indicates a combination of the two separate audio signals to obtain enhanced audio]); and,
outputting to the user, via the at least one hearable device ([0116] The methods and systems of the present invention according to the disclosed embodiments may generally be implemented in both hearing aids and in audio devices that are not hearing aids (i.e. they do not comprise means for compensating a hearing loss), but nevertheless comprise both acoustical-electrical input transducers and electro-acoustical output transducers), the enhanced audio signal ([Fig. 1, 105], [Providing an output with sound suppression performed, i.e. enhanced, on remaining, i.e. not first, sound signals, wherein that output is defined to be sent to hearing aids which process output through transducers, indicating generated sound to the user]).
Lindrup does not disclose:
said extracting is performed based on first and second acoustic fingerprints that correspond to the first and second voices, respectively.
Sivaraman discloses:
said extracting is performed based on first and second acoustic fingerprints that correspond to the first and second voices, respectively ([0061] the analytics server 102 applies the speaker-embedding engine 126 on the complete target speaker utterance to extract a target voiceprint as the clean reference, [0064] The speech separation engine 122 is a neural network architecture comprising layers that identify or separate the target speaker's speech signal from a mixture of speech signals in the input audio signal. The speech separation engine 122 receives as inputs the input audio signal containing the mixture of speech signals and a target speaker embedding (sometimes called the “target voiceprint”), [Separating target speaker speech from a mixture signal based on a target speaker embedding indicates the extraction of the target speaker speech is performed based on respective acoustic fingerprints, wherein the “first” and “second” fingerprints/voices as claimed could be applied to the architecture of Sivaraman without a change in functionality as Sivaraman discloses mixture signals containing interfering speakers (also see [0064]), indicating multiple voices with multiple associated voiceprints in view of the plurality of voice speakers of Lindrup]).
 Lindrup and Sivaraman are considered analogous art within target speech enhancement. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lindrup to incorporate the teachings of Sivaraman, because of the novel way to perform dual tasks of speaker mixture separation and background noise suppression to enhance the perceptual quality of speech audio in a single-channel audio input containing a mixture of speakers based on a target speaker voiceprint (Sivaraman, [0013]).


Regarding claim 7, Lindrup in view of Sivaraman discloses: the computer program product of claim 6.
Lindrup further discloses:
wherein said extracting comprises using one or more models to extract from the noisy audio signal the first separate audio signal and the second separate audio signal ([0065] a sound source separation encoder-decoder neural network, [A neural network for sound source separation indicates that network to be a model for audio signal extraction, as required for separation]), the one or more models comprise at least one of: a generative model ([The examiner would like to note that due to the disjunctive nature of the claim, this element does not require a mapping]), a discriminative model ([Defining a “sound source separation” neural network indicates that it is discriminatory in nature, i.e. between sources]), or a beamforming based model ([The examiner would like to note that due to the disjunctive construction of the claim, not all elements require a mapping. Further, Lindrup defines using a beamformer for identifying sound source signals, indication a beamforming model for audio separation, see ([0048])]).

Regarding claim 8, Lindrup in view of Sivaraman discloses: the computer program product of claim 6.
Lindrup further discloses:
wherein the first and second acoustic fingerprints are retained in a database of pre-generated acoustic fingerprints ([0054] comparing the generated voice print to previously obtained voice prints that each are associated to a specific speaker [Comparing newly generated voiceprints to previously obtained voiceprints indicates a required retaining of the obtained voiceprints for obtaining/comparing. The obtained voiceprints are “pre-generated” as compared to the newly generated prints. Lindrup defines the system as being able to be performed using a smart phone (see [0006], [0106]) indicating a database, i.e. the phone storage, for storing/retrieving previously generated voiceprints]).

Regarding claim 9, Lindrup in view of Sivaraman discloses: the computer program product of claim 8.
Lindrup further discloses:
wherein the first and second acoustic fingerprints are pre- generated based on respective first and second audio records of corresponding entities ([0054] One implementation comprises generating a “voice print” of data derived from a given audio signal and comparing the generated voice print to previously obtained voice prints that each are associated to a specific speaker, whereby the speaker of said given audio signal may be identified as the person associated with said previously obtained voice print, [Comparing previously generated voiceprints to new voiceprints indicates the previously generated are “pre-generated” as compared to the newly generated voiceprints. Further, obtaining previously generated voiceprints which correspond to specific speakers indicates those speakers represent entities corresponding to the audio]), the first and second audio records comprising at least one of: 
past vocal communications with the user ([The ability to have a previously obtained voiceprint compared to a new voiceprint indicates the obtained voiceprint is from a past vocal communication with the user, i.e. an enrollment]); 
a designated enrollment audio ([The examiner would like to note that due to the disjunctive construction of the claim, these element do not require mappings]); and 
a social media platform ([The examiner would like to note that due to the disjunctive construction of the claim, these element do not require mappings]).

Regarding claim 11, Lindrup discloses: an apparatus comprising a processor and coupled memory ([0106] a second audio device and an external device such as a smart phone, [Smartphones inherently have a processor coupled to memory]), the processor being adapted to perform in an environment of a user ([0024] “audio input signal” will be generally be construed to mean an electrical signal representing a sound from the sound environment, [0057] the audio device system is configured to enable the user to select between different methods for selecting the first sound signal, [Allowing a user to select sound signals, wherein those are gathered from an environment, indicates the user is in the environment (required for them to be able to hear the audio)]), a plurality of people is located in the environment ([0073] it can be assumed that the user of the audio device system is engaged in a conversation with the person represented by said sound source signal, [A conversation requires a plurality of people in a same environment]), the user having a mobile device used for obtaining user input ([0106] the system may additionally comprise at least one of a second audio device and an external device such as a smart phone, [0107] The audio device system 200 comprises an acoustical-electrical input transducer block (typically comprising two microphones) 201 and an analogue-digital converter (not shown for reasons of clarity), which provides an input signal), the user having at least one hearable device used for providing audio output to the user ([0025] the term “audio output signal” will generally be construed to mean an electrical signal representing a sound to be output by an electrical-acoustical output transducer of an audio device of an audio device system, [0116] The methods and systems of the present invention according to the disclosed embodiments may generally be implemented in both hearing aids and in audio devices that are not hearing aids (i.e. they do not comprise means for compensating a hearing loss), but nevertheless comprise both acoustical-electrical input transducers and electro-acoustical output transducers), the method comprising:
capturing a noisy audio signal from the environment ([Fig. 1, 101], [0031] a plurality of sound source signals each representing a sound source of the present sound environment are provided, [0037] According to an embodiment the mixed audio signal may also comprise non-speech signals, [Wherein input of a plurality of sound source signals will clearly result in a mixed signal at the receiver, wherein non-speech signals track to noise in the context of improving speech intelligibility of Lindrup (see abstract). Further, see [0010] which discloses mixed audio typically comprising multiple speakers]);
extracting from the noisy audio signal a first separate audio signal that represents a first voice ([0049] an own voice signal representing the voice of the audio device system user is detected and in response hereto the own voice signal is selected as the first sound signal), and a second separate audio signal that represents a second voice ([Fig. 1, 103], [0058] In a third step 103 of the method according to the present embodiment the speech content of the first sound signal is compared with the speech content of the provided sound source signals, [Comparing a first sound signal, previously defined to represent a voice, to a second, i.e. sound source signal, wherein sound source signals can each represent individual speakers derived from a mixed signal (see [0010]), indicating at least a second separate audio signal that represent a second voice]);
combining the first and second separate audio signals to obtain an enhanced audio signal ([0099] an audio output is provided based on said output signal, wherein the contribution to the audio output from the remaining sound source signals is suppressed compared to the contribution from the output signal, [Suppressing contributions from remaining sound source signals, i.e. second signals, compared to the contribution of an output signal, i.e. that most closely corresponding to the first sound signal (see comparison definition of Step 103, [0062]-[0064]; therefore, representative of the first sound signal), indicates a combination of the two separate audio signals to obtain enhanced audio]); and,
outputting to the user, via the at least one hearable device ([0116] The methods and systems of the present invention according to the disclosed embodiments may generally be implemented in both hearing aids and in audio devices that are not hearing aids (i.e. they do not comprise means for compensating a hearing loss), but nevertheless comprise both acoustical-electrical input transducers and electro-acoustical output transducers), the enhanced audio signal ([Fig. 1, 105], [Providing an output with sound suppression performed, i.e. enhanced, on remaining, i.e. not first, sound signals, wherein that output is defined to be sent to hearing aids which process output through transducers, indicating generated sound to the user]).
Lindrup does not disclose:
said extracting is performed based on first and second acoustic fingerprints that correspond to the first and second voices, respectively.
Sivaraman discloses:
said extracting is performed based on first and second acoustic fingerprints that correspond to the first and second voices, respectively ([0061] the analytics server 102 applies the speaker-embedding engine 126 on the complete target speaker utterance to extract a target voiceprint as the clean reference, [0064] The speech separation engine 122 is a neural network architecture comprising layers that identify or separate the target speaker's speech signal from a mixture of speech signals in the input audio signal. The speech separation engine 122 receives as inputs the input audio signal containing the mixture of speech signals and a target speaker embedding (sometimes called the “target voiceprint”), [Separating target speaker speech from a mixture signal based on a target speaker embedding indicates the extraction of the target speaker speech is performed based on respective acoustic fingerprints, wherein the “first” and “second” fingerprints/voices as claimed could be applied to the architecture of Sivaraman without a change in functionality as Sivaraman discloses mixture signals containing interfering speakers (also see [0064]), indicating multiple voices with multiple associated voiceprints in view of the plurality of voice speakers of Lindrup]).
 Lindrup and Sivaraman are considered analogous art within target speech enhancement. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lindrup to incorporate the teachings of Sivaraman, because of the novel way to perform dual tasks of speaker mixture separation and background noise suppression to enhance the perceptual quality of speech audio in a single-channel audio input containing a mixture of speakers based on a target speaker voiceprint (Sivaraman, [0013]).

Regarding claim 12, Lindrup in view of Sivaraman discloses: the apparatus of claim 11.
Lindrup further discloses:
wherein said extracting comprises using one or more models to extract from the noisy audio signal the first separate audio signal and the second separate audio signal ([0065] a sound source separation encoder-decoder neural network, [A neural network for sound source separation indicates that network to be a model for audio signal extraction, as required for separation]), the one or more models comprise at least one of: a generative model ([The examiner would like to note that due to the disjunctive nature of the claim, this element does not require a mapping]), a discriminative model ([Defining a “sound source separation” neural network indicates that it is discriminatory in nature, i.e. between sources]), or a beamforming based model ([The examiner would like to note that due to the disjunctive construction of the claim, not all elements require a mapping. Further, Lindrup defines using a beamformer for identifying sound source signals, indication a beamforming model for audio separation, see ([0048])]).

Regarding claim 13, Lindrup in view of Sivaraman discloses: the apparatus of claim 11.
Lindrup further discloses:
wherein the first and second acoustic fingerprints are retained in a database of pre-generated acoustic fingerprints ([0054] comparing the generated voice print to previously obtained voice prints that each are associated to a specific speaker [Comparing newly generated voiceprints to previously obtained voiceprints indicates a required retaining of the obtained voiceprints for obtaining/comparing. The obtained voiceprints are “pre-generated” as compared to the newly generated prints. Lindrup defines the system as being able to be performed using a smart phone (see [0006], [0106]) indicating a database, i.e. the phone storage, for storing/retrieving previously generated voiceprints]).

Regarding claim 14, Lindrup in view of Sivaraman discloses: the apparatus of claim 13.
Lindrup further discloses:
wherein the first and second acoustic fingerprints are pre- generated based on respective first and second audio records of corresponding entities ([0054] One implementation comprises generating a “voice print” of data derived from a given audio signal and comparing the generated voice print to previously obtained voice prints that each are associated to a specific speaker, whereby the speaker of said given audio signal may be identified as the person associated with said previously obtained voice print, [Comparing previously generated voiceprints to new voiceprints indicates the previously generated are “pre-generated” as compared to the newly generated voiceprints. Further, obtaining previously generated voiceprints which correspond to specific speakers indicates those speakers represent entities corresponding to the audio]), the first and second audio records comprising at least one of: 
past vocal communications with the user ([The ability to have a previously obtained voiceprint compared to a new voiceprint indicates the obtained voiceprint is from a past vocal communication with the user, i.e. an enrollment]); 
a designated enrollment audio ([The examiner would like to note that due to the disjunctive construction of the claim, these element do not require mappings]); and 
a social media platform ([The examiner would like to note that due to the disjunctive construction of the claim, these element do not require mappings]).

Claim(s) 5, 10, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lindrup in view of Sivaraman, further in view of Gong et al. (US-12308035-B2), hereinafter Gong.

Regarding claim 5, Lindrup in view of Sivaraman discloses: the method of claim 1.
Lindrup in view of Sivaraman does not disclose:
wherein said extracting comprises attenuating the first voice,
said attenuating comprises using beamforming model to attenuate a direction of arrival of the first voice, or removing the first separate audio signal from the enhanced audio signal.
Gong discloses:
wherein said extracting comprises attenuating the first voice ([Col. 14, Lines 5-10] The one or more fixed beamformers (e.g., fixed beamformers 600) may include a plurality of filters configured to alter the magnitude and/or phase of the plurality of channels for specific signal-based criteria, [Using a beamformer to alter, i.e. reduce (see suppression of below element), the magnitude of a channel/signal, wherein the first voice signal of Lindrup could be used as the channel/signal of Gong without a change in functionality to Gong, indicates the beamformer is performing attenuation on the magnitude of the first voice signal]),
said attenuating comprises using a beamforming model to attenuate a direction of arrival of the first voice ([Col. 13, Lines 45-55] In some implementations, the one or more fixed beamformers (e.g., fixed beamformers 600) may be configured to direct beams and/or nulls toward certain locations/directions. This may be beneficial when there is a known speaker or noise source in a certain position relative to the microphone array. It may also be useful when suppressing sounds from a certain position (i.e., receiving speech from one speaker while attenuating speech from another speaker in a different position within a monitored space) [Wherein the suppressed sound from a certain position could be that disclosed as a first voice signal in Lindrup without a change in functionality to the signal with which beamforming is applied of Gong]), or removing the first separate audio signal from the enhanced audio signal ([The examiner would like to note that due to the disjunctive nature of the claim, this element does not require a mapping; however, suppression can reasonably be understood to be a method of audio removal, i.e. if suppressed enough]).
 Lindrup, Sivaraman, and Gong are considered analogous art within speech channel mixing/balancing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lindrup in view of Sivaraman to incorporate the teachings of Gong, because of the novel way to use beamforming to alter the magnitude and/or phase of received signals based on directionality/contribution of signals, improving signal processing performance of multichannel systems (Gong, [Col. 13, Lines 55-67]).

Regarding claim 10, Lindrup in view of Sivaraman discloses: the computer program product of claim 6.
Lindrup in view of Sivaraman does not disclose:
wherein said extracting comprises attenuating the first voice,
said attenuating comprises using beamforming model to attenuate a direction of arrival of the first voice, or removing the first separate audio signal from the enhanced audio signal.
Gong discloses:
wherein said extracting comprises attenuating the first voice ([Col. 14, Lines 5-10] The one or more fixed beamformers (e.g., fixed beamformers 600) may include a plurality of filters configured to alter the magnitude and/or phase of the plurality of channels for specific signal-based criteria, [Using a beamformer to alter, i.e. reduce (see suppression of below element), the magnitude of a channel/signal, wherein the first voice signal of Lindrup could be used as the channel/signal of Gong without a change in functionality to Gong, indicates the beamformer is performing attenuation on the magnitude of the first voice signal]),
said attenuating comprises using a beamforming model to attenuate a direction of arrival of the first voice ([Col. 13, Lines 45-55] In some implementations, the one or more fixed beamformers (e.g., fixed beamformers 600) may be configured to direct beams and/or nulls toward certain locations/directions. This may be beneficial when there is a known speaker or noise source in a certain position relative to the microphone array. It may also be useful when suppressing sounds from a certain position (i.e., receiving speech from one speaker while attenuating speech from another speaker in a different position within a monitored space) [Wherein the suppressed sound from a certain position could be that disclosed as a first voice signal in Lindrup without a change in functionality to the signal with which beamforming is applied of Gong]), or removing the first separate audio signal from the enhanced audio signal ([The examiner would like to note that due to the disjunctive nature of the claim, this element does not require a mapping; however, suppression can reasonably be understood to be a method of audio removal, i.e. if suppressed enough]).
 Lindrup, Sivaraman, and Gong are considered analogous art within speech channel mixing/balancing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lindrup in view of Sivaraman to incorporate the teachings of Gong, because of the novel way to use beamforming to alter the magnitude and/or phase of received signals based on directionality/contribution of signals, improving signal processing performance of multichannel systems (Gong, [Col. 13, Lines 55-67]).

Regarding claim 15, Lindrup in view of Sivaraman discloses: the apparatus of claim 11.
Lindrup in view of Sivaraman does not disclose:
wherein said extracting comprises attenuating the first voice,
said attenuating comprises using beamforming model to attenuate a direction of arrival of the first voice, or removing the first separate audio signal from the enhanced audio signal.
Gong discloses:
wherein said extracting comprises attenuating the first voice ([Col. 14, Lines 5-10] The one or more fixed beamformers (e.g., fixed beamformers 600) may include a plurality of filters configured to alter the magnitude and/or phase of the plurality of channels for specific signal-based criteria, [Using a beamformer to alter, i.e. reduce (see suppression of below element), the magnitude of a channel/signal, wherein the first voice signal of Lindrup could be used as the channel/signal of Gong without a change in functionality to Gong, indicates the beamformer is performing attenuation on the magnitude of the first voice signal]),
said attenuating comprises using a beamforming model to attenuate a direction of arrival of the first voice ([Col. 13, Lines 45-55] In some implementations, the one or more fixed beamformers (e.g., fixed beamformers 600) may be configured to direct beams and/or nulls toward certain locations/directions. This may be beneficial when there is a known speaker or noise source in a certain position relative to the microphone array. It may also be useful when suppressing sounds from a certain position (i.e., receiving speech from one speaker while attenuating speech from another speaker in a different position within a monitored space) [Wherein the suppressed sound from a certain position could be that disclosed as a first voice signal in Lindrup without a change in functionality to the signal with which beamforming is applied of Gong]), or removing the first separate audio signal from the enhanced audio signal ([The examiner would like to note that due to the disjunctive nature of the claim, this element does not require a mapping; however, suppression can reasonably be understood to be a method of audio removal, i.e. if suppressed enough]).
 Lindrup, Sivaraman, and Gong are considered analogous art within speech channel mixing/balancing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lindrup in view of Sivaraman to incorporate the teachings of Gong, because of the novel way to use beamforming to alter the magnitude and/or phase of received signals based on directionality/contribution of signals, improving signal processing performance of multichannel systems (Gong, [Col. 13, Lines 55-67]).

	
	
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Fukuda et al. (US-20240232312-A1) discloses “An authentication device includes an acquisition unit configured to acquire a voice signal of a speaker; a detection unit configured to detect a first utterance period during which the speaker is speaking; and an authentication unit configured to authenticate the speaker based on a comparison between a voice signal of the first utterance period and a database. The detection unit detects a second utterance period different from the first utterance period when the authentication unit determines that the speaker authentication is impossible, and the authentication unit authenticates the speaker based on a comparison between the voice signal of the first utterance period and a voice signal of the second utterance period, and the database” (abstract). See entire document.
Lesso et al. (US-20180040323-A1) discloses “This application describes methods and apparatus for speaker recognition. An apparatus according to an embodiment has an analyzer (202) for analyzing each frame of a sequence of frames of audio data (A.sub.IN) which correspond to speech sounds uttered by a user to determine at least one characteristic of the speech sound of that frame. An assessment module (203) determines, for each frame of audio data, a contribution indicator of the extent to which the frame of audio data should be used for speaker recognition processing based on the determined characteristic of the speech sound. In this way frames which correspond to speech sounds that are of most use for speaker discrimination may be emphasized and/or frames which correspond to speech sounds that are of least use for speaker discrimination may be de-emphasized” (abstract). See entire document.
Xu et al. (US-20240071392-A1) discloses “This application provides an upgrade method including: An electronic device acquires a first verification voice entered by a user; processes the first verification voice by using a first model stored in the electronic device, to obtain a first voiceprint feature; verifies an identity of the user based on the first voiceprint feature and a first user feature template stored in the electronic device; after the identity of the user is verified, if the electronic device has received a second model, processes the first verification voice by using the second model, to obtain a second voiceprint feature; and updates the first user feature template based on the second voiceprint feature, and updates the first model by using the second model” (abstract). See entire document.
Xu et al. (US-20220270627-A1) discloses “The present disclosure relates to a method and an apparatus for audio processing and a storage medium. The method includes: obtaining an audio mixing feature of a target object, in which the audio mixing feature at least includes: a voiceprint feature and a pitch feature of the target object; and determining a target audio matching with the target object in the mixed audio according to the audio mixing feature” (abstract). See entire document.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THEODORE JOHN WITHEY whose telephone number is (703)756-1754. The examiner can normally be reached Monday - Friday, 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/THEODORE WITHEY/Examiner, Art Unit 2655        

/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

Dec 28, 2023
Application Filed
Aug 25, 2025
Non-Final Rejection — §103
Dec 31, 2025
Response Filed
Feb 10, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/655,770
Patent 12591744
METHOD FOR TRAINING SEMANTIC REPRESENTATION MODEL, DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/113,192
Patent 12536994
APPARATUS FOR CLASSIFYING SOUNDS BASED ON NEURAL CODE IN SPIKING NEURAL NETWORK AND METHOD THEREOF
2y 5m to grant Granted Jan 27, 2026
17/956,558
Patent 12475330
METHOD FOR IDENTIFYING NOISE SAMPLES, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Nov 18, 2025
17/813,944
Patent 12417759
SPEECH RECOGNITION USING CADENCE PATTERNS
2y 5m to grant Granted Sep 16, 2025
17/986,417
Patent 12412580
Sound Extraction System and Sound Extraction Method
2y 5m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
44%
Grant Probability
90%
With Interview (+46.9%)
2y 11m
Median Time to Grant
Moderate
PTA Risk
Based on 23 resolved cases by this examiner. Grant probability derived from career allow rate.