Last updated: April 19, 2026
Application No. 18/088,070
MANUAL-ENROLLMENT-FREE PERSONALIZED DENOISE

Non-Final OA §102§103§112
Filed
Dec 23, 2022
Examiner
HUTCHESON, CODY DOUGLAS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Zoom Video Communications, Inc.
OA Round
3 (Non-Final)
This examiner grants 62% of cases after interview

— +47.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 24 resolved cases, 2023–2026
Examiner Intelligence

HUTCHESON, CODY DOUGLAS View full profile →
Grants 62% of resolved cases
Career Allow Rate
15 granted / 24 resolved
+0.5% vs TC avg
Strong +47% interview lift
Without
With
+47.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
34 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
33.9%
-6.1% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
7.5%
-32.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 09/18/2025 has been entered.

Response to Arguments
1. Regarding the rejection of claims 1-20 under 35 U.S.C. § 101, Applicant’s arguments, see pgs. 7-9 of the Remarks filed 09/18/2025, with respect to claims 1-20 have been fully considered and are persuasive.  The rejection of claims 1-20 has been withdrawn. 


2. Regarding the rejection of claims 1, 8, and 14 under 35 U.S.C. § 103, Applicant's arguments filed 09/18/2025 have been fully considered but they are not persuasive. 

Applicant first argues that Wojcicki does not disclose the claimed limitation of “generating personalized denoised voice content of the first user account for the virtual meeting by applying the speaker embedding to the audio data associated with the virtual meeting using a denoise model” (see pg. 10, 1st para. of Remarks). Additionally, Applicant argues that Fanelli does not teach the claimed limitation of “filtering the audio data associated with the virtual meeting to discard multi-speaker voice content segments and one or more non-voice segments” (see pg. 10, 2nd para. of Remarks). The Examiner respectfully disagrees with both arguments. 
First, the claim as current written does not specifically require that the speaker embedding and the audio data both be separate inputs that are input to the denoise model, as argued on pg. 10, 1st para. of the Remarks. Under the broadest reasonable interpretation of the claim, the claim requires that denoised voice content is generated by applying a speaker embedding to the audio data using a denoise model. Wojcicki discloses a noise removal model (Fig. 2, 222) which applies an embedding network to audio samples (para. 0049 “In an embodiment, noise removal model 222 may include an embedding network that produces and uses embeddings for transforming audio samples. Basically, each audio sample may be represented by a vector having numeric elements corresponding to a plurality of dimensions defining a space.”) in order to generate denoise voice content (para. 0043 “Noise removal model 222 of background noise removal (BNR) module 220 processes the audio signals to attenuate (e.g., reduce or remove) the noise at operation 425.”). The operation of applying an embedding network to the audio samples using the noise removal model to generate audio signals with attenuated noise reads on the BRI of “generating personalized denoised voice content of the first user account for the virtual meeting by applying the speaker embedding to the audio data associated with the virtual meeting using a denoise model”.
Second, the teachings of Fanelli read on the BRI of “filtering the audio data associated with the virtual meeting to discard multi-speaker voice content segments and one or more non-voice segments”. Fanelli teaches a method of isolating speech segments from non-speech segments (e.g. e.g. Fig. 2A, “Music” and “Noise” blocks) using speech detections (VAD, speaker change detection, and overlapped speech detection; para. 0055) to obtain speech segments (Fig. 2B, SEGM(s)1-7)). This operation discards the non-speech segments (Fig. 2B, “Noise” and “Music” blocks are not used for further processing). This operation reads on the BRI of “filtering the audio data associated with the virtual meeting to discard multi-speaker voice content segments and one or more non-voice segments”. Furthermore, while Fanelli uses this operation in the context of speaker diarization, Wojcicki and Fanelli are both directed to the same field of speech processing, and thus are in the same field of endeavor and are analogous art to the claimed invention, and thus would have been obvious to combine.
Hence, Applicant’s arguments are not persuasive.

Claim Objections
3. Claims 1, 8, and 14 are objected to because of the following informalities: 
Claims 1, 8, and 14: “collecting one or more segments of voice content of a first user…” should instead be “collecting one or more segments of voice content of the first user” as antecedent basis is provided earlier in the claim.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


4. Claims 1-4, 6-21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Independent claims 1, 8, and 14 each recite the limitation "in response to determining the one or more speaker embeddings satisfy a similarity threshold…” in the fourth limitation. There is insufficient antecedent basis for this limitation in each independent claim as each independent claim does not have a “determining” step preceding the “in response to determining…” step.
Dependent claims 2-4, 6-7, and 21 inherit the rejection from independent claim 1. Dependent claims 9-13 inherit the rejection from independent claim 8. Dependent claims 15-20 inherit the rejection from independent claim 14.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

5. Claims 1, 4, 6, 8, 12-14, 18-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Wojcicki et al. (PGPUB No. 2024/0161765, hereinafter Wojcicki) in view of Fanelli et al. (PGPUB No. 2024/0160849, hereinafter Fanelli) and further in view of Kim et al. (US 2023/0419979 A1, hereinafter Kim).

Regarding claim 1, Wojcicki discloses A computer-implemented method comprising: receiving audio data associated with a first user in a virtual meeting (para. 0024 “Initially, computer device 102 enables a user 205 to join an online meeting 260. In an embodiment, computer device 102 includes a camera or other image capture device 926 (FIG. 9) to capture images (e.g., still images, video, etc.) of the user and a surrounding environment, a microphone or other sound sensing device 924 to capture speech or voice signals 210 of user 205 and produce audio signals 215 representing the speech signals for processing by computer device 102, and a display or monitor 930 to present meeting content to the user.”), wherein one or more other users are also connected to the virtual meeting via one or more client devices (para. 0020 “Environment 100 includes multiple computer devices 102 (collectively referred to as computer devices, participant devices, or platforms) operated by local users/participants, a meeting supervisor or server (also referred to as a “conference controller”) 104 configured to support online (e.g., web-based or over-a-network) collaborative meetings between the computer devices, and a communication network 106 communicatively coupled to the computer devices and the meeting supervisor.”); collecting one or more segments of voice content (Fig. 2, voice content (215 “Audio Signals”) is collected using MIC “924”; Paragraph 0043 “The speech transformation module may partition the audio signals into audio segments of any size for processing (e.g., each segment may correspond to a certain number of second of voice signals…a particular phoneme or word, etc.)”) of a first user connected to the virtual meeting (Figs. 1-2, voice content is collected from a first user (“User” 205) using “Computer Device” 102; Paragraph 0024 “Initially, computer device 102 enables a user 205 to join an online meeting 260”; para. 0040 “…the user may provide the user identifier to join the meeting, and/or the speech transformation module may determine the user identifier from user provided or other information (e.g., location of the user, user account, etc.).”)…; 
using an audio embedding model to generate one or more speaker embeddings based on the one or more collected segments of voice content of the first user (para. 0049 “…noise removal model 222 may include an embedding network that produces and uses embeddings for transforming audio samples...each audio sample may be represented by a vector…user speech may have similar embeddings or vector representations, thereby residing in a common area of the space defined by the embedding dimensions”); 
applying the… speaker embedding to the audio data associated with the virtual meeting using a denoise model to generate personalized denoised voice content of the first user (personalized denoised content of first user is generated using the noise removal model; Paragraph 0049 “The noise removal model may be trained to transform noisy audio signals with an embedding outside the area of user speech to audio signals having an embedding within the area of user speech, thereby effectively reducing or removing noise.”); and transmitting the personalized denoised voice content of the first user to the one or more client devices associated with the one or more other users (para. 0026 “The BNR module employs a noise removal model 222 to produce transformed audio or speech signals 270 that enhance the user speech while attenuating (e.g., reducing or removing) the noise. The noise removal model may include a machine learning model as described below. The transformed audio signals may be provided to online meeting 260 for distribution to meeting participants 280. …The transformed audio signals may be conveyed to the meeting participants via corresponding audio speakers 922 of computer devices 102.”).
	Wojcicki discloses receiving audio data from a virtual meeting and collecting segments of voice content of a first user during the virtual meeting. However, Wojcicki does not specifically disclose filtering [the one or more segments] to discard multi-speaker voice content segments and one or more non-voice content segments. Furthermore, Wojcicki does not specifically disclose generating an average embedding based on the one or more speaker embeddings.
	Fanelli discloses filtering the one or more segments to discard multi-speaker voice content segments and one or more non-voice segments (Speech segments are isolated from non-speech segments (e.g. Fig. 2A, “Music” and “Noise” blocks) using speech detections (VAD, speaker change detection, and overlapped speech detection) to obtain speech segments (Fig. 2B, SEGM(s)1-7), which reads on the operation of filtering the audio data to discard multi-speaker segments (via overlapped speech detection) and non-voice segments (via voice activity detection); para. 0055 “The blocks of each audio channel are processed by block-based embeddings extraction component 102, which performs feature extraction 109 on the blocks and applies voice activity detection (VAD) 113 to the features. In some embodiments, VAD 113 detects the speech of multiple speakers and performs overlapped speech detection 110. The speech detections are used to isolate label speech segments, i.e., portions of the blocks containing speech. The speech detections are based on a combination of results from VAD 113, speaker change detection 112 and overlapped speech detection 110.”; para. 0056 “The isolated speech segments are input into embedding extraction component 111 together with data for overlapped speech detection 110 and speaker change detection 112. Embedding extraction component 111 computes embeddings for each segment that is identified by speaker change detection 112 and the overlapped speech is discarded so that no embeddings are extracted from overlapped speech.”; see Fig. 2). Fanelli further teaches generating an average embedding based on the one or more speaker embeddings (para. 0060-0061 “Multiple embeddings of speech segment 1 are generated and features are extracted from each embedding of SEGM1, and the embeddings are statistically combined (e.g., by computing an average embedding from the multiple embeddings). An audio block is a uniform segmentation of the audio file (e.g., 1s, 2s, 3s , etc.). The above process is repeated for Block 2 and Block 3”).
Wojcicki and Fanelli are considered to be analogous to the claimed invention as they are both in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the collection of audio segments disclosed in Wojcicki to include a step where multi-speaker segments and non-voice segments are discarded before speaker embeddings are generated, and to further generate an average embedding based on the one or more speaker embeddings. Doing so would be beneficial, this would reduce noise and improve accuracy in the speaker embeddings (Fanelli, para. 0068).
Wojcicki in view of Fanelli does not specifically disclose generating the average embedding in response to determining the one or more speaker embeddings satisfy a similarity threshold.
Kim teaches in response to determining the one or more speaker embeddings satisfy a similarity threshold, generating an average embedding based on the one or more speaker embeddings (para. 0077 “As a particular example of this, an average or other speaker vector may be produced for each speaker based on that speaker's cluster of embedding vectors 306 for a completed local or global clustering operation. During a subsequent local or global clustering operation, cosine similarity values or other similarity values may be determined between centroids of clusters of embedding vectors 306 produced during the subsequent local or global clustering operation and the speaker vectors from at least one prior local or global clustering operation. If a similarity value is larger than a specified threshold (such as 0.65), the cluster identifier for the subsequent local or global clustering operation can be assigned to match the same identifier as used in the prior local or global clustering operation.”).
Wojcicki, Fanelli, and Kim are considered to be analogous to the claimed invention as they are both in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Kim in order to generate the average embedding in response to determining the one or more speaker embeddings satisfy a similarity threshold. Doing so would be beneficial, as similarity metrics help to determine whether vectors relate to the same speaker (Kim, para. 0077).

	Regarding claim 4, Wojcicki in view of Fanelli and Kim discloses wherein the audio data comprises: ambient audio content associated with a current physical location of an individual accessing the virtual meeting via the first user (Wojcicki, Paragraph 0026 “…(and audio signals 215) may include user speech or voice and/or noise (e.g. speech of competing individuals near the user, sounds of a surrounding environment ,noise from a microphone or other devices, etc.)”). 

	Regarding claim 6, Wojcicki in view of Fanelli and Kim discloses wherein the voice content of the first user comprises: 
voice content in audio data captured by a pre-defined audio capture device (Wojcicki, Fig. 9 outlines a pre-defined audio capture device 900, which captures audio using MIC 924; Fig. 2 outlines collection of voice content in audio data; Paragraph 0024 “Initially, computer device 102 enables a user 205 to join an online meeting 260.” ; Paragraph 0025 “Speech signals 210 from user 205 are captured by microphone 924.”) currently in use by an individual accessing the virtual meeting via the first user (Wojcicki, Voice content collected after the user joins an online meeting 260, meaning pre-defined audio capture device is currently in use); and wherein the computer-implemented method further comprises based on verifying current use of the pre-defined audio capture device (Wojcicki, Fig. 4, Verification of current use takes place at step 410; Paragraph 0039 “Computer device 102 of a user enables the user to join an online meeting at operation 410”; Paragraph 0040 “…the user may provide the user identifier to join the meeting”), initiating collection of the one or more segments of voice content of the first user (Wojcicki, Fig. 4, collection of voice content takes place after verifying current use at step 415; Paragraph 0040 “Speech transformation module 150 receives audio signals from microphone 924 of a computer device 102 at operation 415. The audio signals preferably include a combination of user speech and noise”).

Regarding claim 8, claim 8 is a non-transitory computer-readable medium claim with limitations similar to those in method claim 1, and thus is rejected under similar rationale.
Additionally, Wojcicki discloses A non-transitory computer-readable medium having a computer-readable program code embodied therein (Fig. 9, 904 “Memory Element(s)”; Paragraph 0095,0117) to be executed by one or more processors (Fig. 9, 902 “Processor(s)”, Paragraph 0095,0117), the program code including instructions for (para. 0095). 

	Regarding claim 12, Wojcicki in view of Fanelli and Kim discloses wherein the audio data comprises: ambient voice content associated with a current physical location of an individual accessing the virtual meeting via the first user, the ambient voice content different than the voice content of the first user (Wojcicki, Fig. 2, 205 “User” and 215 “Audio Signals”; Paragraph 0024 “computer device 102 enables a user 205 to join an online meeting”; Paragraph 0026 “…(and audio signals 215) may include user speech or voice and/or noise (e.g. speech of competing individuals near the user, sounds of a surrounding environment…)”).

	Regarding claim 13, claim 13 is rejected for analogous reasons to claim 6.

Regarding claim 14, claim 14 is a system claim with limitations similar to method claim 1, and is thus rejected under similar rationale.
Additionally, Wojcicki discloses A communication system comprising one or more processors configured to perform the operations of: (Fig. 9, 902 “Processor(s)”; Paragraph 0116).

	Regarding claim 18, Wojcicki in view of Fanelli and Kim discloses wherein the audio data comprises at least one of: non-voice content and additional voice content different than the voice content of the first user (Wojcicki, Fig. 2, 205 “User” and 215 “Audio Signals”; Paragraph 0024 “computer device 102 enables a user 205 to join an online meeting”; Paragraph 0026 “…(and audio signals 215) may include user speech or voice and/or noise (e.g. speech of competing individuals near the user, sounds of a surrounding environment, noise from a microphone or other devices, etc.)”).

	Regarding claim 19, claim 19 is rejected for analogous reasons to claim 4.
Regarding claim 20, claim 20 is rejected for analogous reasons to claim 6.

6. Claims 2-3, 9-11, and 15-17 are rejected under 35. U.S.C. 103 as being unpatentable over Wojcicki in view of Fanelli and Kim, and further in view of Tan et al. (PGPUB No. 2022/0375477, hereinafter Tan).

Regarding claim 2, Wojcicki in view of Fanelli and Kim discloses teaches collecting one or more segments of voice content of the first user. Wojcicki in view of Fanelli and Kim does not specifically disclose:
filtering respective segments of voice content of the first user according to a segment similarity criteria.
Tan teaches collecting one or more segments of voice content (Fig. 6, 605 and 610; Paragraph 0052 “At 605, biometric system 102 … receives audio. The audio may correspond to a call between a user and one or more call agents…”; Paragraph 0053 “At 610, biometric system 102 … divides the audio into segments.”), further comprising filtering the respective segments of voice content of the first user according to a segment similarity criteria (Voice content may be further filtered by comparing segments to a signature vector generated from user’s voice (Paragraph 0022 “Referring to FIG. 1, the ML subsystem 114 may generate a signature vector indicative of voice audio of the user.”), and filtering out segments that are not similar to user’s voice; Paragraph 0025 “The voice biometric subsystem 116 may remove segments from the audio, for example, if it does not match the signature vector. The voice biometric subsystem 116 may determine that a segment does not match the signature vector, for example, if the distance (e.g., the similarity score) between the signature vector and the vector representation of the segment does not satisfy a threshold (e.g., a similarity threshold).”).
Wojcicki, Fanelli, Kim, and Tan are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the collection of segments of voice content in Wojcicki to incorporate the teachings of Tan in order to filter the voice segments. Doing so would separate out portions of audio similar to a particular user’s voice from other interfering noise, which could improve quality in audio processing applications such as voice biometrics (Tan, Abstract).
	
Regarding claim 3, Wojcicki in view of Fanelli and Kim and further in view of Tan discloses wherein generating personalized denoised voice content of the first user account comprises: sending input to the audio embedding model based on one or more filtered respective similar segments of voice content (Wojcicki, Fig. 2, audio signals 215 are partitioned into audio segments (Paragraph 0043), and are fed into noise removal model 222 containing the audio embedding model (Paragraph 0049); by combining the teachings of Tan, filtered respective similar segments of voice content are sent as input to the audio embedding model).

Regarding claim 9, Wojcicki in view of Fanelli and Kim teaches collecting one or more segments of voice content of the first user account comprises: grouping respective segments of voice content of the first user in a buffer (Fanelli, Fig. 6, “Embeddings Extraction” 603 outputs embeddings (604) for segments of voice content (Paragraph 0056); the embeddings representing these segments of voice content are stored in a buffer 604 (Paragraph 80 “Audio block segmentation 108 is another step in pipeline 100 (see FIGS. 1, 2 and 6). The embeddings extracted by each speech segment (segment is a time window of speech where a unique speaker is talking), are stored and clustered after all the embeddings have been extracted from each segment over multiple audio blocks.”)).
Wojcicki, Fanelli, and Kim are considered to be analogous to the claimed as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Fanelli in order group respective segments of voice content in a buffer. Implementing this would limit the processing and memory usage required to process audio files (Fanelli, Paragraph 0081).
Wojcicki in view of Fanelli and Kim does not specifically disclose:
filtering the respective segments of voice content of the first user account according to a segment similarity criteria.
Tan teaches collecting one or more segments of voice content (Fig. 6, 605 and 610; Paragraph 0052 “At 605, biometric system 102 … receives audio. The audio may correspond to a call between a user and one or more call agents…”; Paragraph 0053 “At 610, biometric system 102 … divides the audio into segments.”), further comprising filtering the respective segments of voice content of the first user account according to a segment similarity criteria (Voice content may be further filtered by comparing segments to a signature vector generated from user’s voice (Paragraph 0022 “Referring to FIG. 1, the ML subsystem 114 may generate a signature vector indicative of voice audio of the user.”), and filtering out segments that are not similar to user’s voice; Paragraph 0025 “The voice biometric subsystem 116 may remove segments from the audio, for example, if it does not match the signature vector. The voice biometric subsystem 116 may determine that a segment does not match the signature vector, for example, if the distance (e.g., the similarity score) between the signature vector and the vector representation of the segment does not satisfy a threshold (e.g., a similarity threshold).”).
Wojcicki, Fanelli, Kim, and Tan are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Tan in order to filter the voice segments. Doing so would separate out portions of audio similar to a particular user’s voice from other interfering noise, which could improve quality in audio processing applications such as voice biometrics (Tan, Abstract).

Regarding claim 10, Wojcicki in view of Fanelli and Kim and further in view of Tan discloses filtering the respective segments of voice content upon determining a current amount of buffered segments meets a threshold amount (Tan, Paragraph 0027 “In some embodiments, the system may determine that a biometric should not be generated because the audio (e.g., as a whole) is not suitable for use as a biometric… In this example, the voice biometric subsystem 116 may determine that the audio 211 should not be used to generate a voice biometric because the threshold number of segments that need to remain to use the audio for generating a biometric may be three segments. Additionally or alternatively, the threshold may be a percentage (e.g., 10%, 30%, 65%, etc.). For example, if more than 30% of the segments are removed from the audio, the biometric system 102 may determine that the audio should not be used for generating a biometric for the user.”).
Wojcicki, Fanelli, Kim, and Tan are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to specifically filter respective segments of voice content upon determining a threshold amount of buffered segments has been reached. Doing so would be beneficial, as this would provide a metric for determining if sufficient data has been collected and prevent unsuitable data from being used for generating speaker embeddings, improving the accuracy of the model.

Regarding claim 11, Wojcicki in view of Fanelli and Kim and further in view of Tan discloses sending input to the audio embedding model based on one or more filtered respective similar segments of voice content (Wojcicki, Fig. 2, audio signals 215 are partitioned into audio segments (Paragraph 0043), and are fed into noise removal model 222 containing the audio embedding model (Paragraph 0049); by combining the teachings of Tan and Fanelli, filtered respective similar segments of voice content are sent as input to the audio embedding model).

Regarding claim 15, Wojcicki in view of Fanelli and Kim teaches collecting one or more segments of voice content of the first user. Wojcicki in view of Fanelli and Kim does not specifically disclose:
filtering respective segments of voice content of the first user according to a segment similarity criteria.
Tan teaches collecting one or more segments of voice content (Fig. 6, 605 and 610; Paragraph 0052 “At 605, biometric system 102 … receives audio. The audio may correspond to a call between a user and one or more call agents…”; Paragraph 0053 “At 610, biometric system 102 … divides the audio into segments.”), further comprising filtering respective segments of voice content of the first user according to a segment similarity criteria (Voice content may be further filtered by comparing segments to a signature vector generated from user’s voice (Paragraph 0022 “Referring to FIG. 1, the ML subsystem 114 may generate a signature vector indicative of voice audio of the user.”), and filtering out segments that are not similar to user’s voice; Paragraph 0025 “The voice biometric subsystem 116 may remove segments from the audio, for example, if it does not match the signature vector. The voice biometric subsystem 116 may determine that a segment does not match the signature vector, for example, if the distance (e.g., the similarity score) between the signature vector and the vector representation of the segment does not satisfy a threshold (e.g., a similarity threshold).”).
Wojcicki, Fanelli, Kim, and Tan are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Tan in order to filter the voice segments. Doing so would separate out portions of audio similar to a particular user’s voice from other interfering noise, which could improve quality in audio processing applications such as voice biometrics (Tan, Abstract).

Regarding claim 16, Wojcicki in view of Fanelli and Kim and further in view of Tan discloses grouping the respective segments of voice content of the first user account in a buffer (Fanelli, Fig. 6, “Embeddings Extraction” 603 outputs embeddings (604) for segments of voice content (Paragraph 0056); the embeddings representing these segments of voice content are stored in a buffer 604 (Paragraph 80 “Audio block segmentation 108 is another step in pipeline 100 (see FIGS. 1, 2 and 6). The embeddings extracted by each speech segment (segment is a time window of speech where a unique speaker is talking), are stored and clustered after all the embeddings have been extracted from each segment over multiple audio blocks.”)); and filtering the respective segments of voice content upon determining a current amount of buffered segments meet a threshold amount (Tan, Paragraph 0027 “In some embodiments, the system may determine that a biometric should not be generated because the audio (e.g., as a whole) is not suitable for use as a biometric… In this example, the voice biometric subsystem 116 may determine that the audio 211 should not be used to generate a voice biometric because the threshold number of segments that need to remain to use the audio for generating a biometric may be three segments. Additionally or alternatively, the threshold may be a percentage (e.g., 10%, 30%, 65%, etc.). For example, if more than 30% of the segments are removed from the audio, the biometric system 102 may determine that the audio should not be used for generating a biometric for the user.”).
Wojcicki, Tan, Kim, and Fanelli are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have grouped respective segments of voice content in a buffer as taught in Fanelli. Implementing this would limit the processing and memory usage required to process audio files (Fanelli, Paragraph 0081). Additionally, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have filtered the respective segments of voice content upon determining that a threshold amount of buffered segments have been collected. Doing so would be beneficial, as this would provide a metric for determining if sufficient data has been collected and prevent unsuitable data from being used for generating speaker embeddings, improving the accuracy of the model.

Regarding claim 17, Wojcicki in view of Fanelli and Kim and further in view of Tan discloses sending input to the audio embedding model based on one or more filtered respective similar segments of voice content (Wojcicki, Fig. 2, audio signals 215 are partitioned into audio segments (Paragraph 0043), and are fed into noise removal model 222 containing the audio embedding model (Paragraph 0049); by combining the teachings of Tan, filtered respective similar segments of voice content are sent as input to the audio embedding model).
	
	7. Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Wojcicki in view of Fanelli and Kim, and further in view of Brown et al. (PGPUB No. 2018/0359294, hereinafter Brown).

Regarding claim 7, Wojcicki in view of Fanelli and Kim discloses the pre-defined audio capture device comprises at least one microphone. However, Wojcicki in view of Fanelli and Kim does not specifically disclose the at least one microphone disposed on a headset device.
Brown teaches a pre-defined audio capture device (Fig. 1, headset 110), with at least one microphone disposed on a headset device (Paragraph 0012 “The headset 110 may include one or more microphones (e.g., microphones 430 in FIG. 4) that may be integrated within the housing of the headset 110”).
Wojcicki, Fanelli, Kim, and Brown are considered to be analogous to the claimed invention as Wojcicki as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the pre-defined audio capture device of Wojcicki to incorporate the teachings of Brown in order to have the pre-defined audio capture device be a headset. Doing so would allow for improved speech intelligibility for listeners using the spatial audio headset (Brown, Paragraph 0010).

Allowable Subject Matter
8. Claim 21 would be allowable if rewritten or amended to overcome the rejection under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, inherited from independent claim 1.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CODY DOUGLAS HUTCHESON/Examiner, Art Unit 2659     

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Dec 23, 2022
Application Filed
Dec 09, 2024
Non-Final Rejection — §102, §103, §112
Apr 10, 2025
Response Filed
May 15, 2025
Final Rejection — §102, §103, §112
Sep 18, 2025
Request for Continued Examination
Sep 22, 2025
Response after Non-Final Action
Dec 09, 2025
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/330,472
Patent 12603096
VOICE ENHANCEMENT METHODS AND SYSTEMS
2y 5m to grant Granted Apr 14, 2026
18/545,677
Patent 12591750
GENERATIVE LANGUAGE MODEL UNLEARNING
2y 5m to grant Granted Mar 31, 2026
18/163,230
Patent 12579447
TECHNIQUES FOR TWO-STAGE ENTITY-AWARE DATA AUGMENTATION
2y 5m to grant Granted Mar 17, 2026
18/217,880
Patent 12537018
METHOD AND SYSTEM FOR PREDICTING A MENTAL CONDITION OF A SPEAKER
2y 5m to grant Granted Jan 27, 2026
17/877,543
Patent 12530529
DOMAIN-SPECIFIC NAMED ENTITY RECOGNITION VIA GRAPH NEURAL NETWORKS
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
62%
Grant Probability
99%
With Interview (+47.1%)
2y 10m
Median Time to Grant
High
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allow rate.