Last updated: April 19, 2026
Application No. 18/550,429
SPEAKER DIARIZATION SUPPORTING EPISODICAL CONTENT

Non-Final OA §103
Filed
Sep 13, 2023
Examiner
ZHU, RICHARD Z
Art Unit
2654
Tech Center
2600 — Communications
Assignee
Dolby Laboratories Licensing Corporation
OA Round
3 (Non-Final)
Interview Optional

— +15.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 718 resolved cases, 2023–2026
Examiner Intelligence

ZHU, RICHARD Z View full profile →
Grants 69% — above average
Career Allow Rate
498 granted / 718 resolved
+7.4% vs TC avg
Strong +15% interview lift
Without
With
+15.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
32 currently pending
Career history
750
Total Applications
across all art units
Statute-Specific Performance

§101
16.0%
-24.0% vs TC avg
§103
54.5%
+14.5% vs TC avg
§102
19.7%
-20.3% vs TC avg
§112
4.2%
-35.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 718 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114 
A request for continued examination under, including the fee set forth in 37 CFR1.17(e), was filed in this application after final rejection. Since this application is eligiblefor continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e)has been timely paid, the finality of the previous Office action has been withdrawnpursuant to 37 CFR 1.114. Applicant's submission filed on 03/02/2026 has been entered.
Status of the Claims
Claims 1-23 are pending. 
Response to Applicant’s Arguments
In response to “Accordingly, the specification makes clear that the step of "statistically combining the multiple extracted embeddings into an average embedding for the identified segment," can be performed on d-vectors "to reduce noise and improve accuracy." Again, nowhere does Yoshioka disclose averaging d-vectors”.
In view of such amendments to claims 1 and 19, anticipation rejection under 35 USC 102 has been withdrawn. Upon further search and consideration, please see details of a new combination of references set forth below. 
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3 and 19-23 are rejected under 35 USC 103(a) as being unpatentable over Yoshioka et al. (US 11023690 B2) in view of Wang et al. (“Speaker Diarization with LSTM”).
Regarding Claim 1, Yoshioka discloses a method comprising: 
receiving, with at least one processor (Col 3, Rows 9-14 and Fig. 20, processing unit 2002), media data including one or more utterances (Col 5, Rows 1-18 and Col 10, Rows 26-30, Col 13, Row 66 – Cl 14, Row 2, meeting server receives captured audio signals from audio channels corresponding to streaming audio transmitted from corresponding multiple distributed devices); 
dividing, with the at least one processor, the media data into a plurality of blocks (Col 11, Rows 40-44, prior to recognizing speech and converting the speech to text, first separate the speech into separate channels; e.g., with an M-channel input, the output is N-channels; Col 14, Rows 3-6, separate speech from different speakers speaking at the same time into separate audio channels); 
identifying, with the at least one processor, segments of each block of the plurality of blocks associated with a single speaker (Col 12, Rows 50-60 and Col 14, Rows 10-18, perform overlap detection such that for each audio channel with multiple overlapping speakers, each speaker results in a separate audio channel and provide the separated audio channels for speech recognition; Col 14, Rows 47-54, Fig. 10, SR decoders 1040 and 1045 provides speech recognition processing and perform segmentation based on voice activity detection, speaker change detection); 
for each identified segment (Fig. 10, Col 14, Rows 58-64, speaker diarization modules 1050, 1055 receive output of SR decoder modules 1040, 1045 as an N-best list for each segment):
extracting, with the at least one processor, multiple embeddings for the identified segment in accordance with a machine learning model (Fig. 10, Col 14, Rows 58-64, speaker diarization modules 1050, 1055 receive output of SR decoder modules 1040, 1045 as an N-best list for each segment and uses a deep neural network to extract speaker embeddings such as d-vectors), 
clustering, with the at least one processor, the embeddings for the identified segments into clusters (Col 14, Row 64 – Col 15, Row 1, perform agglomerative clustering using the embedding features to generate speaker homogeneous subsegments); 
assigning, with the at least one processor, a speaker label to at least one of the embeddings for the identified segments in accordance with a result of the clustering (Col 15, Rows 1-6, a third operation assigns a speaker identifier to each of the subsegments of the speaker embeddings of the subsegment); and 
outputting, with the at least one processor, speaker diarization information associated with the media data based in part on the speaker labels (Col 15, Rows 6-7, assign a speaker label to each recognized word of the top speech recognition hypothesis).
Yoshioka does not disclose statistically combining the multiple embeddings into an average embedding for the identified segment.
Wang discloses a speaker diarization system (Abstract) receiving media data including one or more utterances (Fig. 1, receiving audio), 
dividing the media data into a plurality of blocks (2. Diarization with D-Vectors, “We use a Voice Activity Detector (VAD) to determine speech segments from the audio”), 
identifying segments of each block of the plurality of blocks associated with a single speaker (2. Diarization with D-Vectors, “We use a Voice Activity Detector (VAD) to determine speech segments from the audio, which are further divided into smaller non-overlapping segments using a maximal segment length limit”), and 
for each identified segment: extracting multiple embeddings for the identified segment and statistically combining the multiple embeddings into an average embedding for the identified segment (2. Diarization with D-Vectors, “For each segment, the corresponding d-vectors are first L2 normalized, then averaged to form an embedding of the segment.”),
clustering the average embeddings for the identified segments into clusters (3. Clustering, “Online clustering: A speaker label is immediately emitted once a segment is available, without seeing future segments”; 3.1. Naïve online clustering, “In this clustering algorithm, each cluster is represented by the centroid of all its corresponding embeddings. When a new segment embedding is available, we compute its similarities to centroids of all existing clusters.  If they are all smaller than the threshold, then create a new cluster containing only this embedding; otherwise, add this embedding to the most similar cluster and update the centroid”), 
and assigning a speaker label to the statistical embeddings for the identified segments in accordance with a result of the clustering (3. Clustering “Online Clustering: a speaker label is immediately emitted once a segment is available, without seeing future segments”).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to statistically combine the multiple embeddings into an average embedding for the identified segment, clustering the average embeddings for the identified segments into clusters, and assign a speaker label to the statistical embeddings for the identified segments in order to provide a state of the art speaker diarization system using a new d-vector based approach to speaker diarization (Wang, Abstract).
Regarding Claim 19, Yoshioka discloses a method comprising: 
receiving, with at least one processor (Col 3, Rows 9-14 and Fig. 20, processing unit 2002), media data including one or more utterances (Col 5, Rows 1-18 and Col 10, Rows 26-30, Col 13, Row 66 – Cl 14, Row 2, meeting server receives captured audio signals from audio channels corresponding to streaming audio transmitted from corresponding multiple distributed devices); 
dividing, with the at least one processor, the media data into a plurality of blocks (Col 11, Rows 40-44, prior to recognizing speech and converting the speech to text, first separate the speech into separate channels; e.g., with an M-channel input, the output is N-channels; Col 14, Rows 3-6, separate speech from different speakers speaking at the same time into separate audio channels); 
identifying, with the at least one processor, segments of each block of the plurality of blocks associated with a single speaker (Col 12, Rows 50-60 and Col 14, Rows 10-18, perform overlap detection such that for each audio channel with multiple overlapping speakers, each speaker results in a separate audio channel and provide the separated audio channels for speech recognition; Col 14, Rows 47-54, Fig. 10, SR decoders 1040 and 1045 provides speech recognition processing and perform segmentation based on voice activity detection, speaker change detection); 
for each identified segment (Fig. 10, Col 14, Rows 58-64, speaker diarization modules 1050, 1055 receive output of SR decoder modules 1040, 1045 as an N-best list for each segment):
extracting, with the at least one processor, multiple embeddings for the identified segment in accordance with a machine learning model (Fig. 10, Col 14, Rows 58-64, speaker diarization modules 1050, 1055 receive output of SR decoder modules 1040, 1045 as an N-best list for each segment and uses a deep neural network to extract speaker embeddings such as d-vectors), 
clustering, with the at least one processor, the embeddings for the identified segments into clusters (Col 14, Row 64 – Col 15, Row 1, perform agglomerative clustering using the embedding features to generate speaker homogeneous subsegments); 
assigning, with the at least one processor, a speaker label to at least one of the embeddings for the identified segments in accordance with a result of the clustering (Col 15, Rows 1-6, a third operation assigns a speaker identifier to each of the subsegments of the speaker embeddings of the subsegment); and 
outputting, with the at least one processor, speaker diarization information associated with the media data based in part on the speaker labels (Col 15, Rows 6-7, assign a speaker label to each recognized word of the top speech recognition hypothesis).
Yoshioka does not disclose statistically combining the multiple embeddings into an average embedding for the identified segment and perform spectral clustering of the average embeddings for the identified segments into clusters.
Wang discloses a speaker diarization system (Abstract) receiving media data including one or more utterances (Fig. 1, receiving audio), 
dividing the media data into a plurality of blocks (2. Diarization with D-Vectors, “We use a Voice Activity Detector (VAD) to determine speech segments from the audio”), 
identifying segments of each block of the plurality of blocks associated with a single speaker (2. Diarization with D-Vectors, “We use a Voice Activity Detector (VAD) to determine speech segments from the audio, which are further divided into smaller non-overlapping segments using a maximal segment length limit”), and 
for each identified segment: extracting multiple embeddings for the identified segment and statistically combining the multiple embeddings into an average embedding for the identified segment (2. Diarization with D-Vectors, “For each segment, the corresponding d-vectors are first L2 normalized, then averaged to form an embedding of the segment.”),
spectral clustering the average embeddings for the identified segments into clusters (3. Clustering, “Offline clustering: Speaker labels are produced after the embeddings of all segments are available”; 3.4. Spectral offline clustering, construct affinity matrix A comprising cosine similarity between ith segment embedding and jth segment embedding, apply a sequence of refinement operations on the affinity matrix A to smooth and denoise the data based on temporal locality of speech data (“contiguous speech segments should have similar embeddings, and hence similar values in the affinity matrix”), perform eigen-decomposition of the refined affinity matrix to replace ith segment embedding by corresponding dimension in these eigen-vectors, and perform K-means algorithm to cluster these new embeddings), 
and assigning a speaker label to the statistical embeddings for the identified segments in accordance with a result of the clustering (3.4 Spectral offline clustering, step 4, replace the ith segment embedding by the corresponding dimension in these eigen-vectors, use K-means algorithm to cluster these new embeddings, and produce speaker labels).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to statistically combine the multiple embeddings into an average embedding for the identified segment, perform spectral clustering of the average embeddings for the identified segments into clusters, and assign a speaker label to the statistical embeddings for the identified segments in order to provide a state of the art speaker diarization system using a new d-vector based approach to speaker diarization (Wang, Abstract).
Regarding Claims 2 and 20, Yoshioka discloses before dividing the media data into a plurality of blocks, performing, with the at least one processor, a spatial conversion on the media data (Col 12, Rows 24-31, given M channels, perform blind beamforming to estimate M-dimensional spatial covariance matrices of speech and background noise; the matrices capture spatial statistics of the speech and noise; per Col 11, Rows 40-43, this is done prior to recognize speech and converting the speech to text to output N-channel beamformed signal).
Regarding Claims 3 and 21, Yoshioka discloses wherein performing the spatial conversion on the media data comprises: converting a first plurality of channels of the media data into a second plurality of channels different than the first plurality of channels (Col 11, Rows 40-44, prior to recognizing speech and converting the speech to text, first separate the speech into separate channels; e.g., with an M-channel input, the output is N-channels); and
dividing the media data into a plurality blocks includes independently dividing each of the second plurality of channels into blocks (Col 12, Rows 50-64, using speech separation neural network to separate overlapped speech signals of different users; per Col 13, Rows 1-6, the model outputs two continuous streams of speech such that when there is overlapping speech between two speakers, each speaker will occupy a distinct output stream; per Col 14, Rows 16-20, for each audio channel with multiple overlapping speakers, each speaker results in a separate audio channel).
Regarding Claim 22, Yoshioka discloses a non-transitory computer-readable storage medium storing at least one program for execution by at least one processor of an electronic device (Col 25, Rows 58-59), the at least one program including instructions for performing the method of claim 1 (Col 25, Rows 58-60).
Regarding Claim 23, Yoshioka discloses a system (Fig. 20) comprising: at least one processor (Col 25, Rows 4-5, processing unit 2002); and a memory (Col 25, Rows 4-5, memory 2003) coupled to the at least one processor storing at least one program for execution by the at least one processor, the at least one program including instructions for performing the method of claim 1 (Col 25, Rows 58-60).
Claim 4 is rejected under 35 USC 103(a) as being unpatentable over Yoshioka et al. (US 11023690 B2) in view of Wang et al. (“Speaker Diarization with LSTM”) as applied to claim 1, in further view of Luo et al. (CN112331182B).
Regarding Claim 4, Yoshioka does not disclose determine the media data corresponding to a first media type and a second media type. 
Luo teaches enhancing voice recognition accuracy of the user (p. 7, ¶¶2-3) by training a plurality of machine learning models based on voice data for a plurality of users (p. 7, ¶6, “voice data corresponding to each user in the preset sample library is first determined…simultaneously the voice data of the target users in the preset sample library are determined to be sample voice data of the target users”; p. 7, ¶10 “voice features corresponding to the sample voice data are firstly input into an attention layer, attention points corresponding to the existing voice features are calculated by using the attention layer, when attention points corresponding to the voice features are calculated by the attention layer, an embedded matrix in the trained GPT-2 model can be obtained”; i.e., training a plurality of embedded matrix / model for each target user) further comprising: 
in accordance with a determination that the media data corresponds to a first media type, the machine learning model is generated from a first set of training data (p. 7, ¶6, determine voice data corresponding to a first user; p. 7, ¶10, use GPT-2 model to generate or train a first embedding model according to voice features of the first user); and 
in accordance with a determination the first media data corresponds to a second media type different than the first media type, the machine learning model is generated from a second set of training data different than the first set of training data (p. 7, ¶6, determine voice data corresponding to a second user; p. 7, ¶10, use GPT-2 model to generate or train a first embedding model according to voice features of the second user).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to generate a plurality of machine learning models / embedding models in accordance to media data’s media type particular to a target user (i.e., voice features of particular users) in order to calculate embedding corresponding to the voice feature of a particular user (Luo, p. 9, ¶5, “, then a query vector, a key vector, and a value vector corresponding to the voice feature can be determined according to the preset embedding matrix”). 
Claims 5-7 are rejected under 35 USC 103(a) as being unpatentable over Yoshioka et al. (US 11023690 B2) in view of Wang et al. (“Speaker Diarization with LSTM”) as applied to claim 1, in further view of Ma et al. (US 2020/0043471 A1).
Regarding Claims 5-7, Yoshioka does not disclose prior to clustering, and in accordance with determining that an optimization criteria is met, further optimizing the extracted embeddings for the identified segments.
Ma teaches using historical voice data of users to train user voice models for determining a user identity associated with a current voice data (Abstract) comprising prior to clustering, and in accordance with determining that an optimization criteria is met, further optimizing extracted embeddings for the identified segments (¶73, before clustering is performed on historical voice feature vectors, perform dimensionality reduction on the historical voice feature vectors on which dimensionality reduction is not performed);
prior to clustering, and in accordance with determination that an optimization criteria is not met, foregoing further optimizing the extracted embeddings for the identified segments (¶73, before clustering is performed on historical voice feature vectors, perform dimensionality reduction on the historical voice feature vectors on which dimensionality reduction is not performed; i.e., forgo dimensionality reduction on voice feature vectors  / embeddings on which dimensionality reduction has been performed);
wherein optimizing the extracted embeddings for identified segments includes performing at least one of dimensionality reduction of the extracted embeddings or embedding optimization of the extracted embeddings (¶39, reduce dimensionality of the historical voice feature vector and perform clustering on the historical voice feature vector upon dimensionality reduction).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to optimize the extracted embeddings for the identified segments in accordance to an optimization criteria prior to clustering in order to reduce dimensionality of the extracted embeddings for the identified segments (Ma, ¶39).
Claim 8 is rejected under 35 USC 103(a) as being unpatentable over Yoshioka et al. (US 11023690 B2) in view of Wang et al. (“Speaker Diarization with LSTM”) and Ma et al. (US 2020/0043471 A1) as applied to claim 7, in further view of Luo et al. (CN112331182B).
Regarding Claim 8, Yoshioka does not disclose wherein embedding optimization includes: training the machine learning model for maximizing separability between the extracted embeddings for identified segments.
Luo teaches enhancing voice recognition accuracy of user (p. 7, ¶¶2-3) by training a plurality of machine learning models based on voice data / audio files for a plurality of users to maximize separability between extracted embeddings for identifying voice segments (p. 7, ¶6, “voice data corresponding to each user in the preset sample library is first determined…simultaneously the voice data of the target users in the preset sample library are determined to be sample voice data of the target users”; p. 7, ¶10 “voice features corresponding to the sample voice data are firstly input into an attention layer, attention points corresponding to the existing voice features are calculated by using the attention layer, when attention points corresponding to the voice features are calculated by the attention layer, an embedded matrix in the trained GPT-2 model can be obtained”; i.e., training a distinct embedded matrix / model for each target user) and updating extracted embeddings by applying the machine learning model for maximizing the separability between the extracted embeddings for identified segments to the extracted embeddings identified segments (p. 9, ¶5 “Specifically, the preset embedding matrix is determined by a trained GPT-2 model, that is, the preset embedding matrix can be obtained by training the GPT-2 model, then a query vector, a key vector, and a value vector corresponding to the voice feature can be determined according to the preset embedding matrix”; i.e., apply an embedding model trained according to the voice feature of a target user to process voice data of the target user).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to update extracted embeddings by applying the machine learning model trained to maximize separability between the extracted embeddings for identified segments to the extracted embeddings identified segments in order to calculate embedding corresponding to the voice feature of a particular user (Luo, p. 9, ¶5, “, then a query vector, a key vector, and a value vector corresponding to the voice feature can be determined according to the preset embedding matrix”). 
Claims 9-11, 15, and 18 are rejected under 35 USC 103(a) as being unpatentable over Yoshioka et al. (US 11023690 B2) in view of Wang et al. (“Speaker Diarization with LSTM”) as applied to claim 1, in further view of Huang et al. (CN 114155840 A).
Regarding Claim 9, Yoshioka does not disclose wherein the clustering comprises: for each identified segment: determining a respective length of the segment.
Huang teaches method for distinguishing speech originator (Fig. 1, p. 8, ¶4) comprising identifying segments associated with a single speaker (p. 8, ¶7, Fig. 1, step 110, divide speech signal into a plurality of speech segments by voice recognition processing on each voice segment and dividing each voice segment into each participant), extracting embeddings for the identified segments (p. 8, ¶18, Fig. 1, step 120, extract voiceprint features and spectrum features of each speech segment and fused into feature vectors of each speech segment), and clustering the embeddings for the identified segments into clusters (p. 10, ¶4, Fig. 1 step 130, clustering each speech segment according to the feature vector to distinguish the speech initiator of each speech segment) wherein the clustering comprises: for each identified segment: determining a respective length of the segment (p. 10, ¶9, determine short speech segments (for example, with a duration of 1s to 2s) and long speech segments);
 in accordance with a determination that the respective length of the segment is greater than a threshold length (p. 10, ¶6, “a long speech segment is divided from the speech segments according to whether the duration is greater than a second threshold”), assigning the embeddings associated with the respective identified segment according to a first clustering process (p. 10, ¶6, “carrying out hierarchical clustering on each long voice segment”); and 
in accordance with a determination that the respective length of the segment is not greater than a threshold length (p. 10, ¶5, “speech segments include short speech segments having a duration less than or equal to a second threshold”), assigning the embeddings associated with the respective identified segment according to a second clustering process different from the first clustering process (p. 10, ¶9, “short speech segments (for example, with a duration of 1s to 2s) do not participate in AHC (Hierarchical Clustering) so as to avoid adverse effects on the clustering center. When K-means clustering is performed, the phrase voice segments and the long voice segments are clustered together”; p. 10, ¶11, “therefore, the interference of the short voice segmentation to the clustering center can be prevented, and the clustering accuracy of the short voice segmentation is improved”).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to for each identified segment: determining a respective length of the segment and perform clustering accordingly in order to eliminate the impact of short speech segments on clustering performance since these short speech segments are easy to be judged wrongly in the clustering process and further have wrong influence on the next clustering (Huang, p. 10, ¶5).
Regarding Claim 10, Yoshioka as modified by Huang further comprising: selecting a first clustering process from a plurality of clustering processes based in part on a determination of a quantity of distinct speakers associated with the media data (Huang, p. 6, ¶8, “determines whether to divide any one first speech segment into a plurality of second speech segments again according to an interval duration of each participant in the first speech segment”; p. 10, ¶9, “short speech segments (for example, with a duration of 1s to 2s) do not participate in AHC (Hierarchical Clustering) so as to avoid adverse effects on the clustering center. When K-means clustering is performed, the phrase voice segments and the long voice segments are clustered together”; selecting Agglomerative Hierarchical Clustering or K-means clustering based on whether speech segments are long speech segments or short speech segments where short speech segments are produced based on interval durations of participants in a speech segment).
Regarding Claim 11, Yoshioka does not disclose the first clustering process includes spectral clustering.
Wang discloses wherein the first clustering process includes spectral clustering (3.4. Spectral offline clustering).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to use spectral clustering of embeddings for the identified segments into clusters to accurately identify a speaker with labels (Wang, 3.4. Spectral offline clustering, step 4., cluster new embeddings and produce speaker labels).
Regarding Claim 15, Yoshioka does not disclose computing a voiceprint distance metric between a voiceprint embedding and a reference point of each cluster.
Huang teaches computing a voiceprint distance metric between a voiceprint embedding and a reference point of each cluster (p. 11, ¶9 “The fusion unit 62 extracts the voiceprint features and the spectrum features of each speech segment and fuses them into feature vectors of each speech segment”; p. 11, ¶10, “The clustering unit 63 performs clustering processing on each voice segment according to the feature vector”; in view of p. 6, ¶10, “determines clustering centers, and performs clustering processing on the speech segments according to the clustering centers”; i.e., clustering respective voice segments into respective clusters based on feature vector (fused voiceprint features and spectrum features) distance metric to respective cluster centers); 
computing a distance from each reference point to each embedding belonging to that cluster (p. 10, ¶12, “re-determining the attribution relationship of each long voice segment to each candidate clustering center according to the state probability…the state probability is determined according to the distance from the feature vector of the long speech segment to the candidate cluster center”; p. 10, ¶14, “For example, after AHC, candidate cluster centers and the distance of each speech segment to each cluster center are obtained”; i.e., when redetermining the attribution relationship according to the state probability, compute distance from the feature vector of the long speech segment to the candidate cluster center);
computing, for each cluster, a probability distribution of the distances of the embeddings from the reference point for that cluster (p. 10, ¶12, “the state probability is determined according to the distance from the feature vector of the long speech segment to the candidate cluster center”; p. 10, ¶14 “determining the state probability of the voice segmentation relative to the clustering center according to the distance from each voice segmentation to the different clustering centers by taking the different clustering centers as different states, and setting the transition probability among the states”; i.e., compute state probability for respective candidate cluster centers requires computing transition probability among the states); 
for each probability distribution, computing a probability that the voiceprint distance belongs to the probability distribution (p. 10, ¶14, compute the state probability of the voice segmentation relative to the clustering center); 
ranking the probabilities (p. 10, ¶15, “the Viterbi algorithm is utilized to optimize the attribution relationship of each voice segment to each clustering center”); 
assigning the voiceprint to one of the clusters based on the ranking (p. 10, ¶15, “the Viterbi algorithm is utilized to optimize the attribution relationship of each voice segment to each clustering center, and then the clustering center is recalculated to be used as the basis of K-means clustering”); and 
combining a speaker identity associated with the voiceprint with the speaker diarization information (p. 11, ¶10 “The clustering unit 63 performs clustering processing on each voice segment according to the feature vector, and distinguishes the voice initiator of each voice segment”).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement clustering to further comprising computing a probability that a voiceprint distance belongs to probability distribution of the distances of the embeddings from reference point / cluster center for respective clusters in order to optimize the attribution relationship of each voice segment to each clustering center (Huang, p. 10, ¶15) when distinguishing the voice initiator of each voice segment (Huang, p. 11, ¶10) and combining speaker identity associated with the voiceprint with the speaker diarization information (Yoshioka, Col 17, Rows 22-25, create a best speaker-attributed word transcript).
Regarding Claim 18, Yoshioka discloses generating one or more analytics files or visualizations associated with the media data based in part on the assigned speaker labels (Col 4, Rows 53-57 and Col 10, Rows 15-18, display transcript on a device of the user comprising text and speaker identification).
Claims 12-13 are rejected under 35 USC 103(a) as being unpatentable over Yoshioka et al. (US 11023690 B2) in view of Wang et al. (“Speaker Diarization with LSTM”) as applied to claim 1, in further view of Zheng et al. (CN 112562693 A).
Regarding Claims 12-13, Yoshioka does not disclose wherein the media data includes a plurality of related files.
Zheng teaches determining a speaker based on clustering of media data (p. 8, ¶16, “a method for determining a speaker based on clustering”) wherein the media data includes a plurality of related files (p. 9, ¶2 “the voice information to be recognized is an integral audio file…the voice information to be recognized, that is, the integral audio file, is split and split into a plurality of minimum recognition units, that is, the voice recognition segments” and ¶7, “Here, since the voice recognition section is an audio file”; i.e., voice information to be recognized is an integral audio file comprising a plurality of split audio files / voice recognition units or sections) and selecting a plurality of the related files as the media data, wherein selecting the plurality of related files is based in part on at least one of: 
a content similarity associated with the plurality of related files; 
a metadata similarity associated with the plurality of related files; or 
received data corresponding to a request to process a specific set of files (p. 8, ¶17, “s101, voice information to be recognized is obtained, and a plurality of voice recognition fragments corresponding to the voice information to be recognized are determined”; i.e., determine the plurality of voice recognition fragments means determine a plurality of audio files “since the voice recognition section is an audio file”).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to process the media data by selecting the plurality of related files as the media data in order to recognize audio from a plurality of mutually independent users, each independent user is provided with a plurality of voice segments (Zheng, p. 9, ¶2).
Claim 14 is rejected under 35 USC 103(a) as being unpatentable over Yoshioka et al. (US 11023690 B2) in view of Wang et al. (“Speaker Diarization with LSTM”) and Zheng et al. (CN 112562693 A) as applied to claim 12, in further view of Luo et al. (CN112331182B).
Regarding Claim 14, Yoshioka does not disclose wherein the machine learning model is selected from a plurality of machine learning models in accordance with one or more properties shared by each of the plurality of related audio files.
Luo teaches enhancing voice recognition accuracy of the user (p. 7, ¶¶2-3) by training a plurality of machine learning models based on voice data / audio files for a plurality of users (p. 7, ¶6, “voice data corresponding to each user in the preset sample library is first determined…simultaneously the voice data of the target users in the preset sample library are determined to be sample voice data of the target users”; p. 7, ¶10 “voice features corresponding to the sample voice data are firstly input into an attention layer, attention points corresponding to the existing voice features are calculated by using the attention layer, when attention points corresponding to the voice features are calculated by the attention layer, an embedded matrix in the trained GPT-2 model can be obtained”; i.e., training a plurality of embedded matrix / model for each target user) and selecting a machine learning model in accordance with one or more properties shared by each of a plurality of related audio files (p. 9, ¶5 “Specifically, the preset embedding matrix is determined by a trained GPT-2 model, that is, the preset embedding matrix can be obtained by training the GPT-2 model, then a query vector, a key vector, and a value vector corresponding to the voice feature can be determined according to the preset embedding matrix”; i.e., select an embedding model trained according to the voice feature of a target user to process voice data of the target user).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to select a machine learning model / embedding model in accordance with one or more properties shared by each of a plurality of related audio files (per Zheng, p. 9, ¶2 and ¶7, an integral audio file with a plurality of voice recognition segments / audio files corresponding to a plurality of users) to perform embedding (Luo, p. 9, ¶5 “then a query vector, a key vector, and a value vector corresponding to the voice feature can be determined according to the preset embedding matrix”) in order to calculate embedding corresponding to the voice feature of a particular user (Luo, p. 9, ¶5, “, then a query vector, a key vector, and a value vector corresponding to the voice feature can be determined according to the preset embedding matrix”). 
Claims 16-17 are rejected under 35 USC 103(a) as being unpatentable over Yoshioka et al. (US 11023690 B2) in view of Wang et al. (“Speaker Diarization with LSTM”) and Huang et al. (CN 114155840 A) as applied to Claim 15, in further view of Zheng et al. (CN 112562693 A).
Regarding Claims 16-17, Yoshioka as modified by Huang does not disclose comparing each probability with a confidence threshold.
Zheng teaches determining if a to-be-identified voiceprint signal belongs to a cluster or category of speaker by comparing a corresponding “binding posterior probability to each voiceprint signal to be identified” with a confidence threshold “certain probability” (p. 13, ¶10, “binding posterior probability to each voiceprint signal to be identified so as to enable the voiceprint signal to belong to a certain cluster in the number of speakers with certain probability”), where posterior probabilities / distributions are modeled as folded Gaussian distributions (p. 13, ¶1, “wherein I represents the ith Gaussian distribution of the Gaussian mixture model”).
Note that Viterbi algorithm optimizes the attribution relationship of each voice segment to each clustering center (Huang, p. 10, ¶15); i.e., finding the highest state probability and corresponding distance from the feature vector of the long voice segment to the candidate clustering center (Huang, p. 10, ¶12).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to determine if a speaker associated with a probability has spoken based on comparing each probability with a confidence threshold (i.e., for a voice segment and a candidate cluster center, determine if posterior probability / state probability of the voice segment is within a certain probability of the candidate cluster center) to attribute each voiceprint signal / voice segment to a cluster in a certain posterior probability / state probability (Zheng, p. 13, ¶11).
Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor Hai Phan whose telephone number is 571-272-6338. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RICHARD Z ZHU/Primary Examiner, Art Unit 2654                                                                                                                                                                                                        03/09/2026
Read full office action
Prosecution Timeline

Sep 13, 2023
Application Filed
Jun 21, 2025
Non-Final Rejection — §103
Sep 23, 2025
Response Filed
Dec 04, 2025
Final Rejection — §103
Mar 02, 2026
Examiner Interview Summary
Mar 02, 2026
Request for Continued Examination
Mar 02, 2026
Applicant Interview (Telephonic)
Mar 06, 2026
Response after Non-Final Action
Mar 10, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/247,441
Patent 12592228
SPEECH INTERACTION METHOD ,AND APPARATUS, COMPUTER READABLE STORAGE MEDIUM, AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 31, 2026
18/365,694
Patent 12592222
APPARATUSES, COMPUTER PROGRAM PRODUCTS, AND COMPUTER-IMPLEMENTED METHODS FOR ADAPTING SPEECH RECOGNITION CONFIDENCE SCORES BASED ON EXPECTED RESPONSE
2y 5m to grant Granted Mar 31, 2026
18/510,086
Patent 12586574
ELECTRONIC DEVICE FOR PROCESSING UTTERANCE, OPERATING METHOD THEREOF, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
18/520,336
Patent 12579978
NETWORKED DEVICES, SYSTEMS, & METHODS FOR INTELLIGENTLY DEACTIVATING WAKE-WORD ENGINES
2y 5m to grant Granted Mar 17, 2026
17/957,934
Patent 12572739
GENERATING MACHINE INTERPRETABLE DECOMPOSABLE MODELS FROM REQUIREMENTS TEXT
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
69%
Grant Probability
85%
With Interview (+15.4%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 718 resolved cases by this examiner. Grant probability derived from career allow rate.
SPEAKER DIARIZATION SUPPORTING EPISODICAL CONTENT

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email