Last updated: April 19, 2026
Application No. 18/069,618
SPEECH FILTER FOR SPEECH PROCESSING

Final Rejection §103
Filed
Dec 21, 2022
Examiner
BECKER, TYLER JUSTIN
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
4 (Final)
Interview Optional

— +19.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 19 resolved cases, 2023–2026
Examiner Intelligence

BECKER, TYLER JUSTIN View full profile →
Grants 74% — above average
Career Allow Rate
14 granted / 19 resolved
+11.7% vs TC avg
Strong +19% interview lift
Without
With
+19.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
22 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
23.1%
-16.9% vs TC avg
§103
45.4%
+5.4% vs TC avg
§102
14.9%
-25.1% vs TC avg
§112
16.7%
-23.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 19 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
	The amendments filed December 17th, 2025 have been entered. Claims 1, 2, 19, 27, and 28 have been amended. Claims 1-28 are pending and have been examined.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-28 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
As such, the language “means for obtaining…first speech signature data associated with the first person” and “means for obtaining … second speech signature data associated with at least one second person” in claim 28 is being interpreted under 35 U.S.C. 112(f) as referring to paragraph [0156] of the specification: “For example, the means for obtaining the first speech signature data can correspond to the device 102, the processor(s) 190, the audio analyzer 140, the audio preprocessor 118, the speech input filter(s) 120, the first stage speech processor 124, the speaker detector 128, the integrated circuit 802, the processor 1806, the processor(s) 1810, one or more other circuits or components configured to obtain the speech signature data, or any combination thereof.”
Additionally, the language “means for selectively enabling” in claim 28 is being interpreted under 35 U.S.C. 112(f) as referring to paragraph [0157] of the specification: “For example, the means for selectively enabling the speaker-specific speech input filter can correspond to the device 102, the processor(s) 190, the audio analyzer 140, the audio preprocessor 118, the speech input filter(s) 120, the first stage speech processor 124, the speaker detector 128, the integrated circuit 802, the processor 1806, the processor(s) 1810, one or more other circuits or components configured to selectively enable a speaker-specific speech input filter, or any combination thereof.”


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-5, 7, 14-22, and 27-28 is/are rejected under 35 U.S.C. 103 as being unpatentable over McLaren et al. (US Patent No. 10,476,872 B2 hereinafter McLaren), in view of Norton et al. (US Pat. Pub. No. 2023/0154460 A1 hereinafter Norton), Miller, Gregory (US Pub. No. 2019/0080698 A1 hereinafter Miller), and Eskimez et al. (US Pat. Pub. No. 2023/0116052 A1 hereinafter Eskimez).
Regarding claim 1, McLaren discloses one or more processors configured to (McLaren, Col. 12, lines 39-42: “The illustrative computing device 610 includes at least one processor 612 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 614, and an input/output (I/O) subsystem 616.”): based on detection of a wake word in an utterance from a first person, obtain first speech signature data associated with the first person (McLaren, Col. 2, lines 43-48: "a spoken command analyzer module is configured to, given a sample of human speech, jointly analyze speech information extracted from the sample to determine whether the analyzed speech includes specific content, such as a command, and to verify the identity of the human speaker of the speech."). However, McLaren fails to expressly recite obtain, based on configuration data that indicates at least one second person that has permissions associated with the first person, second speech signature data associated with the at least one second person, the permissions indicating that the at least one second person is approved to interrupt a voice session associated with the first person; and selectively enable a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data, wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data; a combiner configured to generate a combined vector based on the latent- space representation, the first speech signature data, and the second speech signature data; and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person.
Norton teaches obtain, based on configuration data that indicates at least one second person that has permissions associated with the first person, second speech signature data associated with the at least one second person, the permissions indicating that the at least one second person is approved to interrupt a voice session associated with the first person (Norton, Fig. 2, 216; [0061]: "The rankings may be set and altered by a primary user of the DA device 100. Primary users may, by default, be ranked higher than users having lower authority levels. In addition, unrecognized users may be allocated a lower ranking than all known users present in the hierarchical table."; [0057]: "At step 216, the one or more processors determine whether the source of the voice command, as identified, has authority to override or supersede the pre-existing function. This determination may be based on whether the source of the voice command has priority over a user that is the source of the pre-existing function.").
Norton and Miller are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren to incorporate the teachings of Norton to obtain speech signature data for a second person, including permissions indicating approval to interrupt the first person’s voice session. This allows the system to resolve conflicting commands from multiple users (Norton, [0028]). As such, the system will provide a better user experience for the primary user. However, McLaren, in view of Norton, fails to expressly recite selectively enable a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data.
Miller teaches selectively enable a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data (Miller, [0012]: “The user privileges may allow John Smith to interact with the voice assistant system, such as to issue commands, extract information (e.g., ask questions and obtain responses, etc.), play media, and/or perform other actions with the voice assistant system which may be unavailable to people who are not introduced to the voice assistant system by a trusted user and/or do not have privileges associated with a user profile and associated with the particular voice assistant system, but may have privileges associated with a different voice assistant system.”).
McLaren, Norton, and Miller are analogous arts because they each belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, to incorporate the teachings of Miller to filter the inputs based on speech signature data for both users. This allows the system to filter out commands from unknown or untrusted users (Miller, [0012]). This filtering ensures that the system does not perform actions for people who may be unknown or untrusted by the system’s primary user. However, McLaren, in view of Norton and Miller, fails to expressly recite wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data; a combiner configured to generate a combined vector based on the latent- space representation, the first speech signature data, and the second speech signature data; and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person.
Eskimez teaches wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data (Eskimez, Fig. 3, 4001a-4001f; [0042]: “Input audio 112 is provided to an STFT block 302, and then a series of N encoder blocks 4001a-4001f, where N=-6 is indicated, encoder block 4001a is the initial encoder block and encoder block 4001a is the final encoder block.”); a combiner configured to generate a combined vector based on the latent- space representation, the first speech signature data, and the second speech signature data (Eskimez, fig. 3, 306; [0042]: “The speaker embeddings 308 (d-vector) are extracted from enrollment data for target speaker 102 (and also, in some examples, target speaker 102a) and concatenated with the output from final encoder block 4001f.”; [0031]: “A microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.”; Here, the speaker embeddings are seen as speech signature data.); and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person (Eskimez, fig. 3, 114; [0051]: “After predicting the real and imaginary parts of the mask (real mask 310, an imaginary mask 312, respectively), the complex ratio mask is applied to the original real and imaginary parts of the noisy spectrogram (e.g., the output of decoder block 4002a) to produce output data 114.”; [0031]: “PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a.”).
McLaren, Norton, Miller, and Eskimez are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton and the privilege administration system of Miller, to incorporate the teachings of Eskimez to filter speech with multiple target speakers. This allows background noise to be filtered out effectively even when two speakers are speaking together (Eskimez, [0030]). This helps ensure that the audio can be cleaned of background noise without accidently filtering out intended speech.

Regarding claim 2, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. Norton further teaches wherein the one or more processors are further configured to: based on detection of the wake word in an utterance from a particular second person of the at least one second person, obtain second particular speech signature data associated with the particular second person (Norton, Fig. 2, 202 and 206; [0041]: "At step 202, a triggering event occurs that prompts the DA device 100 to transition from a sleep mode to a wake mode. The triggering event may be detection, via the one or more microphones 106, of a designated sound and/or sound pattern that is used as a signal to wake the DA device 100 from a low power sleep mode."; [0044]: "At step 206, the one or more processors identify a source of the voice command. As stated above, reference to one or more processors in the description of this method 200 may represent the one or more processors 118 of the DA device 100 and/or one or more processors of at least one external device communicatively connected to the DA device 100. The source of the voice command is the particular person or entity that spoke the voice command to perform the first function. In an embodiment, the source of the voice command is determined based on voice recognition analysis of the voice command. For example, one or more processors analyze audio characteristics of the voice command to associate the voice command with the voice of a particular user (e.g., person). The audio characteristics may include intonation, speed of speech, frequency of sounds, amplitude of sounds, patterns of sounds, patterns of words, and/or the like."). The same motivation for claim 1 applies equally to claim 2. However, McLaren, in view of Norton and Miller fails to expressly recite selectively enable a second speaker-specific speech input filter that is based on the second particular speech signature data responsive to the configuration data not indicating that the first person is approved to interrupt a voice session associated with the particular second person.
Eskimez further teaches selectively enable a second speaker-specific speech input filter that is based on the second particular speech signature data responsive to the configuration data not indicating that the first person is approved to interrupt a voice session associated with the particular second person (Eskimez, fig. 1; [0031]: "A microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.; "PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a."). The same motivation for claim 1 applies equally to claim 2.

Regarding claim 3, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses wherein obtaining the first speech signature data comprises selecting the first speech signature data from a set of speech signature data associated with a plurality of persons based on comparison of features of the utterance to enrollment data (McLaren, Col. 4, lines 10-15: "A front end module 112 of the spoken command analyzer system 110 uses training data (one or more speech samples collected from the user) to create and store one or more joint content and speaker models 126 of the training data. This can be done during an enrollment process or passively during normal use of the user's device, for example. ").

Regarding claim 4, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses wherein the speaker-specific speech input filter is configured to separate speech of the first person from speech of one or more other persons and to provide the speech of the first person to one or more voice assistant applications (McLaren, Col. 4, lines 25-28: "the spoken command analyzer 110 can distinguish between the two speakers by comparing portions of the audio signal that correspond to similar phonetic units, and thereby more accurately authenticate users."; McLaren, Col. 4, lines 35-37: "Based on the comparison of the current model 124 to the stored speech model(s) 126, the back end module 120 outputs a joint command and speaker determination 138.").

Regarding claim 5, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. Eskimez further teaches wherein the speaker-specific speech input filter is configured to remove or attenuate, from audio data, sounds that are not associated with speech from the first person or speech from the second person (Eskimez, fig. 1; [0031]: “A microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.; “PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a.”). The same motivation for claim 1 applies equally to claim 5.

Regarding claim 7, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. Eskimez further teaches wherein the speaker-specific speech input filter is configured to de-emphasize speech from a third person distinct from the first person and the at least one second person (Eskimez, fig. 1; [0031]: “A microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.; “PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a.”). The same motivation for claim 1 applies equally to claim 7.

Regarding claim 14, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses wherein the one or more processors are integrated into a vehicle (McLaren, Col. 11, lines 35-37: "a self-driving car configured to perform joint command and speaker identification as described herein.").

Regarding claim 15, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses wherein the one or more processors are integrated into at least one of a smart speaker, a speaker bar, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device, a television, a gaming console, a music player, a radio, a digital video player, a tuner, a camera, a navigation device, a headset, an augmented reality headset, a mixed reality headset, a virtual reality headset, a home automation system, a voice-activated device, a wireless speaker and voice activated device, a portable electronic device, a communication device, an internet-of-things (IoT) device, an extended reality (XR) device, a base station, or a mobile device (McLaren, Col. 11, lines 22-27: "One such application is the voice control of devices in the home or office. Such “smart” devices may comprise the so-called “Internet of Things” (IoT) where a user may speak commands to various devices to effect different actions for various devices and without worry of other people being able to effect such actions.").

Regarding claim 16, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses a microphone configured to capture sound including the utterance from the first person (McLaren, Col. 3, line 66 - Col. 4 line 2: "The computing system 100 receives data such as audio input signals including samples of natural language (NL) speech 130, from time to time, through an audio capture/acquisition device such as a microphone.").

Regarding claim 17, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses a modem configured to send data associated with the utterance from the first person to a remote voice assistant server (McLaren, Col. 12, lines 20-23: "The computing system 600 includes a user computing device 610, which may be in communication with one or more other computing systems or devices 660 via one or more networks 650.").

Regarding claim 18, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses one or more audio transducers configured to output sound corresponding to a voice assistant response to the first person (McLaren, Col. 12, lines 63-66: "the user interface subsystem 624, which includes one or more user input devices (e.g., a touchscreen, keyboard, virtual keypad, etc.) and one or more output devices (e.g., speakers, displays, LEDs, etc.).").

Regarding claim 19, McLaren discloses A method comprising: based on detection of a wake word in an utterance from a first person, obtaining first speech signature data associated with the first person (McLaren, Col. 2, lines 43-48: "a spoken command analyzer module is configured to, given a sample of human speech, jointly analyze speech information extracted from the sample to determine whether the analyzed speech includes specific content, such as a command, and to verify the identity of the human speaker of the speech."). However, McLaren fails to expressly recite obtaining, based on configuration data that indicates at least one second person that has permissions associated with the first person, second speech signature data associated with the at least one second person, the permissions indicating that the at least one second person is approved to interrupt a voice session associated with the first person; and selectively enabling a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data, wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data; a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data; and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person.
Norton teaches obtaining, based on configuration data that indicates at least one second person that has permissions associated with the first person, second speech signature data associated with the at least one second person, the permissions indicating that the at least one second person is approved to interrupt a voice session associated with the first person (Norton, Fig. 2, 216; [0061]: "The rankings may be set and altered by a primary user of the DA device 100. Primary users may, by default, be ranked higher than users having lower authority levels. In addition, unrecognized users may be allocated a lower ranking than all known users present in the hierarchical table."; [0057]: "At step 216, the one or more processors determine whether the source of the voice command, as identified, has authority to override or supersede the pre-existing function. This determination may be based on whether the source of the voice command has priority over a user that is the source of the pre-existing function.").
Norton and Miller are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren to incorporate the teachings of Norton to obtain speech signature data for a second person, including permissions indicating approval to interrupt the first person’s voice session. This allows the system to resolve conflicting commands from multiple users (Norton, [0028]). As such, the system will provide a better user experience for the primary user. However, McLaren, in view of Norton, fails to expressly recite selectively enabling a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data.
Miller teaches selectively enabling a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data (Miller, [0012]: “The user privileges may allow John Smith to interact with the voice assistant system, such as to issue commands, extract information (e.g., ask questions and obtain responses, etc.), play media, and/or perform other actions with the voice assistant system which may be unavailable to people who are not introduced to the voice assistant system by a trusted user and/or do not have privileges associated with a user profile and associated with the particular voice assistant system, but may have privileges associated with a different voice assistant system.”).
McLaren, Norton, and Miller are analogous arts because they each belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, to incorporate the teachings of Miller to filter the inputs based on speech signature data for both users. This allows the system to filter out commands from unknown or untrusted users (Miller, [0012]). This filtering ensures that the system does not perform actions for people who may be unknown or untrusted by the system’s primary user. However, McLaren, in view of Norton and Miller, fails to expressly recite wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data; a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data; and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person.
Eskimez teaches wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data (Eskimez, Fig. 3, 4001a-4001f; [0042]: “Input audio 112 is provided to an STFT block 302, and then a series of N encoder blocks 4001a-4001f, where N=-6 is indicated, encoder block 4001a is the initial encoder block and encoder block 4001a is the final encoder block.”); a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data (Eskimez, fig. 3, 306; [0042]: “The speaker embeddings 308 (d-vector) are extracted from enrollment data for target speaker 102 (and also, in some examples, target speaker 102a) and concatenated with the output from final encoder block 4001f.”; [0031]: “A microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.”; Here, the speaker embeddings are seen as speech signature data.); and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person (Eskimez, fig. 3, 114; [0051]: “After predicting the real and imaginary parts of the mask (real mask 310, an imaginary mask 312, respectively), the complex ratio mask is applied to the original real and imaginary parts of the noisy spectrogram (e.g., the output of decoder block 4002a) to produce output data 114.”; [0031]: “PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a.”).
McLaren, Norton, Miller, and Eskimez are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton and the privilege administration system of Miller, to incorporate the teachings of Eskimez to filter speech with multiple target speakers. This allows background noise to be filtered out effectively even when two speakers are speaking together (Eskimez, [0030]). This helps ensure that the audio can be cleaned of background noise without accidently filtering out intended speech.

Regarding claim 20, the rejection of claim 19 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses wherein obtaining the first speech signature data comprises selecting the first speech signature data from a set of speech signature data associated with a plurality of persons based on comparison of features of the utterance to enrollment data (McLaren, Col. 4, lines 10-15: "A front end module 112 of the spoken command analyzer system 110 uses training data (one or more speech samples collected from the user) to create and store one or more joint content and speaker models 126 of the training data. This can be done during an enrollment process or passively during normal use of the user's device, for example. ").

Regarding claim 21, the rejection of claim 19 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren further discloses separating, by the speaker-specific speech input filter, speech of the first person from speech of one or more other persons; and providing the speech of the first person to one or more voice assistant applications (McLaren, Col. 4, lines 25-28: "the spoken command analyzer 110 can distinguish between the two speakers by comparing portions of the audio signal that correspond to similar phonetic units, and thereby more accurately authenticate users."; McLaren, Col. 4, lines 35-37: "Based on the comparison of the current model 124 to the stored speech model(s) 126, the back end module 120 outputs a joint command and speaker determination 138.").

Regarding claim 22, the rejection of claim 19 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. Eskimez further teaches removing or attenuating, by the speaker-specific speech input filter, sounds from audio data that are not associated with speech from the first person or speech from the second person (Eskimez, fig. 1; [0031]: “A microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.; “PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a.”). The same motivation for claim 19 applies equally to claim 22.

Regarding claim 27, McLaren discloses a non-transitory computer-readable medium storing instructions that are executable by one or more processors to cause the one or more processors to (McLaren, Col. 12, lines 39-42: “The illustrative computing device 610 includes at least one processor 612 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 614, and an input/output (I/O) subsystem 616.”): based on detection of a wake word in an utterance from a first person, obtain first speech signature data associated with the first person (McLaren, Col. 2, lines 43-48: "a spoken command analyzer module is configured to, given a sample of human speech, jointly analyze speech information extracted from the sample to determine whether the analyzed speech includes specific content, such as a command, and to verify the identity of the human speaker of the speech."). However, McLaren fails to expressly recite obtain, based on configuration data that indicates at least one second person that has permissions associated with the first person, second speech signature data associated with the at least one second person, the permissions indicating that the at least one second person is approved to interrupt a voice session associated with the first person; and selectively enable a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data, wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data; a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data; and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person.
Norton teaches obtain, based on configuration data that indicates at least one second person that has permissions associated with the first person, second speech signature data associated with the at least one second person, the permissions indicating that the at least one second person is approved to interrupt a voice session associated with the first person (Norton, Fig. 2, 216; [0061]: "The rankings may be set and altered by a primary user of the DA device 100. Primary users may, by default, be ranked higher than users having lower authority levels. In addition, unrecognized users may be allocated a lower ranking than all known users present in the hierarchical table."; [0057]: "At step 216, the one or more processors determine whether the source of the voice command, as identified, has authority to override or supersede the pre-existing function. This determination may be based on whether the source of the voice command has priority over a user that is the source of the pre-existing function.").
Norton and Miller are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren to incorporate the teachings of Norton to obtain speech signature data for a second person, including permissions indicating approval to interrupt the first person’s voice session. This allows the system to resolve conflicting commands from multiple users (Norton, [0028]). As such, the system will provide a better user experience for the primary user. However, McLaren, in view of Norton, fails to expressly recite selectively enable a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data.
Miller teaches selectively enable a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data (Miller, [0012]: “The user privileges may allow John Smith to interact with the voice assistant system, such as to issue commands, extract information (e.g., ask questions and obtain responses, etc.), play media, and/or perform other actions with the voice assistant system which may be unavailable to people who are not introduced to the voice assistant system by a trusted user and/or do not have privileges associated with a user profile and associated with the particular voice assistant system, but may have privileges associated with a different voice assistant system.”).
McLaren, Norton, and Miller are analogous arts because they each belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, to incorporate the teachings of Miller to filter the inputs based on speech signature data for both users. This allows the system to filter out commands from unknown or untrusted users (Miller, [0012]). This filtering ensures that the system does not perform actions for people who may be unknown or untrusted by the system’s primary user. However, McLaren, in view of Norton and Miller, fails to expressly recite wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data; a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data; and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person.
Eskimez teaches wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data (Eskimez, Fig. 3, 4001a-4001f; [0042]: “Input audio 112 is provided to an STFT block 302, and then a series of N encoder blocks 4001a-4001f, where N=-6 is indicated, encoder block 4001a is the initial encoder block and encoder block 4001a is the final encoder block.”); a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data (Eskimez, fig. 3, 306; [0042]: “The speaker embeddings 308 (d-vector) are extracted from enrollment data for target speaker 102 (and also, in some examples, target speaker 102a) and concatenated with the output from final encoder block 4001f.”; [0031]: “A microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.”; Here, the speaker embeddings are seen as speech signature data.); and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person (Eskimez, fig. 3, 114; [0051]: “After predicting the real and imaginary parts of the mask (real mask 310, an imaginary mask 312, respectively), the complex ratio mask is applied to the original real and imaginary parts of the noisy spectrogram (e.g., the output of decoder block 4002a) to produce output data 114.”; [0031]: “PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a.”).
McLaren, Norton, Miller, and Eskimez are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton and the privilege administration system of Miller, to incorporate the teachings of Eskimez to filter speech with multiple target speakers. This allows background noise to be filtered out effectively even when two speakers are speaking together (Eskimez, [0030]). This helps ensure that the audio can be cleaned of background noise without accidently filtering out intended speech.

Regarding claim 28, McLaren discloses an apparatus comprising: means for obtaining, based on detection of a wake word in an utterance from a first person, first speech signature data associated with the first person (McLaren, Col. 2, lines 43-48: "a spoken command analyzer module is configured to, given a sample of human speech, jointly analyze speech information extracted from the sample to determine whether the analyzed speech includes specific content, such as a command, and to verify the identity of the human speaker of the speech."). However, McLaren fails to expressly recite means for obtaining, based on configuration data that indicates at least one second person that has permissions associated with the first person, second speech signature data associated with the at least one second person, the permissions indicating that the at least one second person is approved to interrupt a voice session associated with the first person; and means for selectively enabling a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data, wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data; a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data; and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person.
Norton teaches means for obtaining, based on configuration data that indicates at least one second person that has permissions associated with the first person, second speech signature data associated with the at least one second person, the permissions indicating that the at least one second person is approved to interrupt a voice session associated with the first person (Norton, Fig. 2, 216; [0061]: "The rankings may be set and altered by a primary user of the DA device 100. Primary users may, by default, be ranked higher than users having lower authority levels. In addition, unrecognized users may be allocated a lower ranking than all known users present in the hierarchical table."; [0057]: "At step 216, the one or more processors determine whether the source of the voice command, as identified, has authority to override or supersede the pre-existing function. This determination may be based on whether the source of the voice command has priority over a user that is the source of the pre-existing function.").
Norton and Miller are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren to incorporate the teachings of Norton to obtain speech signature data for a second person, including permissions indicating approval to interrupt the first person’s voice session. This allows the system to resolve conflicting commands from multiple users (Norton, [0028]). As such, the system will provide a better user experience for the primary user. However, McLaren, in view of Norton, fails to expressly recite means for selectively enabling a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data.
Miller teaches means for selectively enabling a speaker-specific speech input filter that is based on the first speech signature data and the second speech signature data (Miller, [0012]: “The user privileges may allow John Smith to interact with the voice assistant system, such as to issue commands, extract information (e.g., ask questions and obtain responses, etc.), play media, and/or perform other actions with the voice assistant system which may be unavailable to people who are not introduced to the voice assistant system by a trusted user and/or do not have privileges associated with a user profile and associated with the particular voice assistant system, but may have privileges associated with a different voice assistant system.”).
McLaren, Norton, and Miller are analogous arts because they each belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, to incorporate the teachings of Miller to filter the inputs based on speech signature data for both users. This allows the system to filter out commands from unknown or untrusted users (Miller, [0012]). This filtering ensures that the system does not perform actions for people who may be unknown or untrusted by the system’s primary user. However, McLaren, in view of Norton and Miller, fails to expressly recite wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data; a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data; and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person.
Eskimez teaches wherein the speaker-specific speech input filter includes: a dimensional-reduction network configured to generate a latent-space representation based on input audio data (Eskimez, Fig. 3, 4001a-4001f; [0042]: “Input audio 112 is provided to an STFT block 302, and then a series of N encoder blocks 4001a-4001f, where N=-6 is indicated, encoder block 4001a is the initial encoder block and encoder block 4001a is the final encoder block.”); a combiner configured to generate a combined vector based on the latent-space representation, the first speech signature data, and the second speech signature data (Eskimez, fig. 3, 306; [0042]: “The speaker embeddings 308 (d-vector) are extracted from enrollment data for target speaker 102 (and also, in some examples, target speaker 102a) and concatenated with the output from final encoder block 4001f.”; [0031]: “A microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.”; Here, the speaker embeddings are seen as speech signature data.); and a dimensional-expansion network configured to generate output audio data including multi-person speech from the first person and the at least one second person (Eskimez, fig. 3, 114; [0051]: “After predicting the real and imaginary parts of the mask (real mask 310, an imaginary mask 312, respectively), the complex ratio mask is applied to the original real and imaginary parts of the noisy spectrogram (e.g., the output of decoder block 4002a) to produce output data 114.”; [0031]: “PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a.”).
McLaren, Norton, Miller, and Eskimez are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton and the privilege administration system of Miller, to incorporate the teachings of Eskimez to filter speech with multiple target speakers. This allows background noise to be filtered out effectively even when two speakers are speaking together (Eskimez, [0030]). This helps ensure that the audio can be cleaned of background noise without accidently filtering out intended speech.

Claim(s) 6, 13, 23, and 26 is/are rejected under 35 U.S.C. 103 as being unpatentable over McLaren, in view of Norton, Miller, and Eskimez, as applied to claims 1-5, 7, 14-21, and 27-28 above, and further in view of Sivaraman et al. (US Pub. No. 2022/0084509 hereinafter Sivaraman).
Regarding claim 6, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose wherein the speaker-specific speech input filter is configured to compare input audio data to the first speech signature data to generate output audio data that de-emphasizes portions of the input audio data that do not correspond to speech from the first person or speech from the second person.
Sivaraman discloses wherein the speaker-specific speech input filter is configured to compare input audio data to the first speech signature data to generate output audio data that de-emphasizes portions of the input audio data that do not correspond to speech from the first person or speech from the second person (Sivaraman, [0013]: "The target speaker for the speaker-dependent enhancement system is specified and indicated using the speaker's voiceprint. The target speaker for enhancement is uniquely determined using a voiceprint vector extracted by a speaker embedding system. In some cases, the target speaker's voiceprint may be extracted in real-time while the software programming is being used, though some embodiments offer an option for prior enrollment using an enrollment utterance. The output of the machine-learning architecture is an enhanced audio signal improving the voice quality of a target speaker on a single-channel audio input containing a mixture of speaker speech signals and various types of noise.").
McLaren, Norton, Miller, Eskimez, and Sivaraman are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, the privilege system of Miller, and the speech enhancement system of Eskimez, to incorporate the teachings of Sivaraman to filter noise and additional speakers from input audio. This helps improve the quality of the audio, which reduces the audio complexity for any subsequent speech recognition systems (Sivaraman, [0113]). This helps the key phrase identifier by removing other phrases and noises that may be falsely identified as key phrases.

Regarding claim 13, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose wherein the first speech signature data corresponds to a first speaker embedding, and wherein the one or more processors are configured to enable the speaker-specific speech input filter by providing the first speaker embedding as an input to a speech enhancement model.
Sivaraman discloses wherein the first speech signature data corresponds to a first speaker embedding, and wherein the one or more processors are configured to enable the speaker-specific speech input filter by providing the first speaker embedding as an input to a speech enhancement model (Sivaraman, [0029]: "The speech separation engine extracts low-level spectral features, such as such as mel-frequency cepstrum coefficients (MFCCs), and receives a voiceprint for a target speaker (sometimes called an “inbound voiceprint” or “target voiceprint”) generated by the speaker-embedding engine."). The same motivation for claim 6 applies equally to claim 13.

Regarding claim 23, the rejection of claim 19 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose comparing, by the speaker-specific speech input filter, input audio data to the first speech signature data to generate output audio data that de-emphasizes portions of the input audio data that do not correspond to speech from the first person or speech from the second person.
Sivaraman discloses comparing, by the speaker-specific speech input filter, input audio data to the first speech signature data to generate output audio data that de-emphasizes portions of the input audio data that do not correspond to speech from the first person or speech from the second person (Sivaraman, [0013]: "The target speaker for the speaker-dependent enhancement system is specified and indicated using the speaker's voiceprint. The target speaker for enhancement is uniquely determined using a voiceprint vector extracted by a speaker embedding system. In some cases, the target speaker's voiceprint may be extracted in real-time while the software programming is being used, though some embodiments offer an option for prior enrollment using an enrollment utterance. The output of the machine-learning architecture is an enhanced audio signal improving the voice quality of a target speaker on a single-channel audio input containing a mixture of speaker speech signals and various types of noise.").
McLaren, Norton, Miller, Eskimez, and Sivaraman are analogous arts because they belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, the privilege system of Miller, and the speech enhancement system of Eskimez, to incorporate the teachings of Sivaraman to filter noise and additional speakers from input audio. This helps improve the quality of the audio, which reduces the audio complexity for any subsequent speech recognition systems (Sivaraman, [0113]). This helps the key phrase identifier by removing other phrases and noises that may be falsely identified as key phrases.

Regarding claim 26, the rejection of claim 19 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose wherein the first speech signature data corresponds to a first speaker embedding, and wherein enabling the speaker-specific speech input filter comprises providing the first speaker embedding as an input to a speech enhancement model.
Sivaraman discloses wherein the first speech signature data corresponds to a first speaker embedding, and wherein enabling the speaker-specific speech input filter comprises providing the first speaker embedding as an input to a speech enhancement model (Sivaraman, [0029]: "The speech separation engine extracts low-level spectral features, such as such as mel-frequency cepstrum coefficients (MFCCs), and receives a voiceprint for a target speaker (sometimes called an “inbound voiceprint” or “target voiceprint”) generated by the speaker-embedding engine."). The same motivation for claim 23 applies equally to claim 26.

Claim(s) 8-11 and 24-25 is/are rejected under 35 U.S.C. 103 as being unpatentable over McLaren, in view of Norton, Miller, and Eskimez, as applied to claims 1-5, 7, 14-22, and 27-28 above, and further in view of Sharifi et al. (US Pub. No. 2022/0139388 hereinafter Sharifi).
Regarding claim 8, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose wherein the one or more processors are further configured to, after enabling the speaker-specific speech input filter based on the first speech signature data: receive audio data that includes a second utterance from a third person; and determine whether to provide content of the second utterance to a voice assistant application based on whether the content of the second utterance is contextually relevant to a voice assistant request received from the first person.
Sharifi discloses wherein the one or more processors are configured to, after enabling the speaker-specific speech input filter based on the first speech signature data: receive audio data that includes a second utterance from a third person (Sharifi, Fig. 1B; [0028]: "the AED 104 leaves the microphone 16 open and receives a second instance of the raw audio data corresponding to an utterance 176 of audible contents 178 for the video call 150 spoken by the user and captured by the AED 104."); and determine whether to provide content of the second utterance to a voice assistant application based on whether the content of the second utterance is contextually relevant to a voice assistant request received from the first person (Sharifi, Fig. 1B; [0028]: "While recognized as an additional sound not spoken by the user 102, the additional utterance 180 is spoken by the other individual 107 who is indicated by the image data 20 as a likely participant of the voice call, and thus contains audible contents intended for the recipient 103 to hear. Accordingly, when executing the routine 200 results in the classification model 210 generating a classification result 212 that indicates to activate voice filtering for the voices of the user 102 and the other individual 107, the voice filtering engine 300 will apply voice filtering to generate enhanced audio data 152 that excludes the background noise 179 and isolates the voices of the user 102 and the other individual 107 in the video call 150.").
McLaren, Norton, Miller, Eskimez, and Sharifi are analogous arts because they both belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, the privilege system of Miller, and the speech enhancements system of Eskimez, to incorporate the teachings of Sharifi to actively control the speech filter depending on the content of the input audio. The additional control of the filter helps focus the speech recognition engine on the voice of a desired user (Sharifi, [0003]). This improves the effectiveness of the speech recognition system in environments with multiple speakers.

Regarding claim 9, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose wherein the one or more processors are configured to: when the speaker-specific speech input filter is enabled, provide first audio data to a first speech enhancement model based on the first speech signature data; and when the speaker-specific speech input filter is not enabled, provide second audio data to a second speech enhancement model based on third speech signature data.
Sharifi discloses wherein the one or more processors are configured to: when the speaker-specific speech input filter is enabled, provide first audio data to a first speech enhancement model based on the first speech signature data; and when the speaker-specific speech input filter is not enabled, provide second audio data to a second speech enhancement model based on third speech signature data (Sharifi, [0005]: "The operations also include executing a voice filtering recognition routine to determine whether to activate voice filtering for at least a voice of the user in the audio-based communication based on the first instance of the raw audio data. When the voice filtering recognition routine determines to activate voice filtering for at least the voice of the user, the operations also include: obtaining a respective speaker embedding of the user that represents voice characteristics for the user; and processing, using the respective speaker embedding of the user, the second instance of the raw audio data to generate enhanced audio data for the audio-based communication that isolates the utterance of the audible contents spoken by the user and excludes at least a portion of the one or more additional sounds that are not spoken by the user."). The same motivation for claim 8 applies equally to claim 9.

Regarding claim 10, the rejection of claim 9 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose wherein the third speech signature data represents speech of multiple persons.
Sharifi discloses wherein the third speech signature data represents speech of multiple persons (Sharifi, [0029]: "The voice filtering engine 300 uses the respective speaker embeddings 318 to process the second instance of the raw audio data to generate the enhanced audio data 152 for the video call 150 that isolates the utterance 176 (spoken by the user 102) and the additional utterance 180 (spoken by the other individual 107) and excludes the background noise 179."). The same motivation for claim 8 applies equally to claim 10.

Regarding claim 11, the rejection of claim 1 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose wherein the one or more processors are configured to, after enabling the speaker-specific speech input filter, disable the speaker-specific speech input filter based on a determination that a voice assistant session associated with the first person has ended.
Sharifi discloses wherein the one or more processors are further configured to, after enabling the speaker-specific speech input filter, disable the speaker-specific speech input filter based on a determination that a voice assistant session associated with the first person has ended (Sharifi, [0022]: "The AED 104 may also include a physical button that may be selected to activate or deactivate voice filtering."). The same motivation for claim 8 applies equally to claim 11.

Regarding claim 24, the rejection of claim 19 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose after enabling the speaker-specific speech input filter based on the first speech signature data: receiving audio data that includes a second utterance from a third person; and determining whether to provide content of the second utterance to a voice assistant application based on whether the content of the second utterance is contextually relevant to a voice assistant request received from the first person.
Sharifi discloses after enabling the speaker-specific speech input filter based on the first speech signature data: receiving audio data that includes a second utterance from a third person (Sharifi, Fig. 1B; [0028]: "the AED 104 leaves the microphone 16 open and receives a second instance of the raw audio data corresponding to an utterance 176 of audible contents 178 for the video call 150 spoken by the user and captured by the AED 104. "); and determining whether to provide content of the second utterance to a voice assistant application based on whether the content of the second utterance is contextually relevant to a voice assistant request received from the first person (Sharifi, Fig. 1B; [0028]: "While recognized as an additional sound not spoken by the user 102, the additional utterance 180 is spoken by the other individual 107 who is indicated by the image data 20 as a likely participant of the voice call, and thus contains audible contents intended for the recipient 103 to hear. Accordingly, when executing the routine 200 results in the classification model 210 generating a classification result 212 that indicates to activate voice filtering for the voices of the user 102 and the other individual 107, the voice filtering engine 300 will apply voice filtering to generate enhanced audio data 152 that excludes the background noise 179 and isolates the voices of the user 102 and the other individual 107 in the video call 150.").
McLaren, Norton, Miller, Eskimez, and Sharifi are analogous arts because they both belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, the privilege system of Miller, and the speech enhancements system of Eskimez, to incorporate the teachings of Sharifi to actively control the speech filter depending on the content of the input audio. The additional control of the filter helps focus the speech recognition engine on the voice of a desired user (Sharifi, [0003]). This improves the effectiveness of the speech recognition system in environments with multiple speakers.

Regarding claim 25, the rejection of claim 19 is incorporated. McLaren, in view of Norton, Miller, and Eskimez, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, and Eskimez, does not disclose when the speaker-specific speech input filter is enabled, providing first audio data to a first speech enhancement model based on the first speech signature data; and when the speaker-specific speech input filter is not enabled, providing second audio data to a second speech enhancement model based on third speech signature data.
Sharifi discloses when the speaker-specific speech input filter is enabled, providing first audio data to a first speech enhancement model based on the first speech signature data; and when the speaker-specific speech input filter is not enabled, providing second audio data to a second speech enhancement model based on third speech signature data (Sharifi, [0005]: "The operations also include executing a voice filtering recognition routine to determine whether to activate voice filtering for at least a voice of the user in the audio-based communication based on the first instance of the raw audio data. When the voice filtering recognition routine determines to activate voice filtering for at least the voice of the user, the operations also include: obtaining a respective speaker embedding of the user that represents voice characteristics for the user; and processing, using the respective speaker embedding of the user, the second instance of the raw audio data to generate enhanced audio data for the audio-based communication that isolates the utterance of the audible contents spoken by the user and excludes at least a portion of the one or more additional sounds that are not spoken by the user."). The same motivation for claim 24 applies equally to claim 25.

Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over McLaren, in view of Norton, Miller, Eskimez, and Sharifi, as applied to claims 8-11 and 24-25 above, and further in view of Sivaraman.
Regarding claim 12, the rejection of claim 11 is incorporated. McLaren, in view of Norton, Miller, Eskimez, and Sharifi, discloses all of the elements of the current invention as stated above. McLaren, in view of Norton, Miller, Eskimez, and Sharifi, does not disclose wherein the one or more processors are configured to, during the voice assistant session: receive first audio data representing multi-person speech; generate, based on the speaker-specific speech input filter, second audio data representing single-person speech; and provide the second audio data to a voice assistant application.
Sivaraman discloses wherein the one or more processors are configured to, during the voice assistant session: receive first audio data representing multi-person speech (Sivaraman, [0026]: "Described herein are systems and methods for processing various types of data associated with inbound calls, including audio signals containing a mixture of one or more speaker signals or utterances"); generate, based on the speaker-specific speech input filter, second audio data representing single-person speech (Sivaraman, [0026]: "to generate an enhanced audio signal in which utterances of interfering speakers and noise are suppressed compared to the input audio signal received by the system."); and provide the second audio data to a voice assistant application (Sivaraman, [0043]: "The waveform synthesis engine outputs the enhanced speaker signal in a machine-readable format representing the enhanced speaker signal as a waveform. Any number of downstream operations may employ the enhanced speaker signal with improved accuracy or efficiency as a benefit of the improved quality compared to the original input audio signal, which contained any number of interfering speaker signals and noise.").
McLaren, Norton, Miller, Eskimez, Sharifi, and Sivaraman are analogous arts because they all belong to the same field of endeavor of speech processing. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the key phrase identification system of McLaren, as modified by the command conflict resolution system of Norton, the privilege system of Miller, the voice filtering system of Sharifi, and the speech enhancement system of Eskimez, to incorporate the teachings of Sivaraman to filter multi-person speech into single-person speech. This helps improve the quality of the audio, which reduces the audio complexity for any subsequent speech recognition systems (Sivaraman, [0113]). This helps the key phrase identifier by removing other phrases and noises that may be falsely identified as key phrases.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to TYLER J BECKER whose telephone number is (703)756-1271. The examiner can normally be reached M-Th, 7:15am-5:45pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/TYLER BECKER/              Examiner, Art Unit 2657                                                                                                                                                                                          

/DANIEL C WASHBURN/               Supervisory Patent Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Dec 21, 2022
Application Filed
Dec 23, 2024
Non-Final Rejection — §103
Mar 19, 2025
Applicant Interview (Telephonic)
Mar 19, 2025
Examiner Interview Summary
Mar 27, 2025
Response Filed
May 15, 2025
Final Rejection — §103
Jul 24, 2025
Applicant Interview (Telephonic)
Jul 24, 2025
Examiner Interview Summary
Aug 18, 2025
Request for Continued Examination
Aug 27, 2025
Response after Non-Final Action
Sep 29, 2025
Non-Final Rejection — §103
Dec 03, 2025
Applicant Interview (Telephonic)
Dec 03, 2025
Examiner Interview Summary
Dec 17, 2025
Response Filed
Feb 26, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/484,927
Patent 12597433
SPEECH SIGNAL ENHANCEMENT METHOD AND APPARATUS, AND ELECTRONIC DEVICE
2y 5m to grant Granted Apr 07, 2026
18/334,771
Patent 12585893
Full Media Translator
2y 5m to grant Granted Mar 24, 2026
17/692,070
Patent 12518777
SYSTEMS AND METHODS FOR AUTHENTICATION USING SOUND-BASED VOCALIZATION ANALYSIS
2y 5m to grant Granted Jan 06, 2026
18/110,990
Patent 12499869
SOUND SYNTHESIS METHOD, SOUND SYNTHESIS APPARATUS, AND RECORDING MEDIUM STORING INSTRUCTIONS TO PERFORM SOUND SYNTHESIS METHOD
2y 5m to grant Granted Dec 16, 2025
18/117,304
Patent 12499311
Language Model Preprocessing with Weighted N-grams
2y 5m to grant Granted Dec 16, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
74%
Grant Probability
93%
With Interview (+19.0%)
2y 10m
Median Time to Grant
High
PTA Risk
Based on 19 resolved cases by this examiner. Grant probability derived from career allow rate.
SPEECH FILTER FOR SPEECH PROCESSING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email