Last updated: April 19, 2026
Application No. 18/383,261
Voice Data Generation Method, Voice Data Generation Apparatus And Computer-Readable Recording Medium

Non-Final OA §101§102§103
Filed
Oct 24, 2023
Examiner
CHAVEZ, RODRIGO A
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Kia Corporation
OA Round
1 (Non-Final)
This examiner grants 50% of cases after interview

— +37.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 228 resolved cases, 2023–2026
Examiner Intelligence

CHAVEZ, RODRIGO A View full profile →
Grants 50% of resolved cases
Career Allow Rate
115 granted / 228 resolved
-11.6% vs TC avg
Strong +37% interview lift
Without
With
+37.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
22 currently pending
Career history
250
Total Applications
across all art units
Statute-Specific Performance

§101
16.4%
-23.6% vs TC avg
§103
53.1%
+13.1% vs TC avg
§102
20.9%
-19.1% vs TC avg
§112
5.6%
-34.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 228 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/24/2023 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-14 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
The Supreme Court has long held that “[l]aws of nature, natural phenomena, and abstract ideas are not patentable.” Alice Corp. Pty. Ltd. v. CLS Bank Int’l, 134 S. Ct. 2347, 2354 (2014) (quoting Assoc. for Molecular Pathology v. Myriad Genetics, Inc., 133 S. Ct. 2107, 2116 (2013) (internal quotation marks omitted)). The “abstract ideas” category embodies the longstanding rule that an idea, by itself, is not patentable. Alice Corp., 134S. Ct. at 2355 (quoting Gottschalk v. Benson, 409 U.S. 63, 67 (1972).
In Alice, the Supreme Court sets forth an analytical “framework for distinguishing patents that claim laws of nature, natural phenomena, and abstract ideas [or mental processes ] from those that claim patent-eligible applications of those concepts.”  Id. at 2355 (citing Mayo Collaborative Servs. v. Prometheus Labs., Inc., 132 S. Ct. 1289, 1296–97 (2012)).  The first step in the analysis is to “determine whether the claims at issue are directed to one of those patent-ineligible concepts.”  Id.  If the claims are directed to a patent-ineligible concept, the second step in the analysis is to consider the elements of the claims “individually and ‘as an ordered combination’” to determine whether there are additional elements that “‘transform the nature of the claim’ into a patent-eligible application.”  Id. (quoting Mayo, 132 S. Ct. at 1298, 1297).  In other words, the second step is to “search for an ‘inventive concept’—i.e., an element or combination of elements that is ‘sufficient to ensure that the patent in practice amounts to significantly more than a patent upon the [ineligible concept] itself’”.  Id. (brackets in original) (quoting Mayo, 132 S. Ct. at 1294).  The prohibition against patenting an abstract idea “‘cannot be circumvented by attempting to limit the use of the formula to a particular technological environment’ or adding ‘insignificant post-solution activity.’”  Bilski v. Kappos, 561 U.S. 593, 610–11 (2010) (citation omitted).

Step 1: This part of the eligibility analysis evaluates whether the claim falls within any statutory category. See MPEP 2106.03. Independent Claim 1 recites the method of determining a number of speakers and number of voice segments for each speaker for voice data generation, arranging the determined number of voice segments and generating voice data based on the arranging, and thus is a process (a series of steps or acts). A process is a statutory category of invention. Independent Claim 8 recites an apparatus comprising a processor and memory configured to execute a method similar to Claim 1, and additionally, a training of a learning model based on the generated voice data. An apparatus is a Statutory category of invention. Dependent claims 2-7 and 9-14 are dependent on claims 1 and 8, respectively, and therefore recite their respective statutory classes.
	Step 2A, Prong One: This part of the eligibility analysis evaluates whether the claim recites a judicial exception. As explained in MPEP 2106.04, subsection II, a claim “recites” a judicial exception when the judicial exception is “set forth” or “described” in the claim. In applying the framework set out in Alice, examiner found Applicant’s claims 1 and 8 are directed to a patent-ineligible abstract concept of arranging voice segments of identified speakers and generating voice data from the arranging.  The steps of Applicant’s claims 1-14 are an abstract concept that would fall under the judicial exception of mental processes. Specifically, the claims recite the step of “determining a number of a plurality of speakers to be used for voice data generation.” Under broadest reasonable interpretation, the claim recites a determination that may be performed in the human mind by listening to a conversation between multiple people and counting how many distinct voices are heard. The limitation does not positively recite any voice data generation and thus is simply interpreted as intended use. Therefore, this step is directed to a mental process. Furthermore, the step of “determining a number of voice segments for each of the plurality of speakers” recites steps that are directed to a mental process. Similar to the previous limitation, the limitation recites a determination that may be performed in the human mind by keeping count of how many times each distinct speaker speaks during the conversation. Thus, the limitation is directed to a mental process. Further, the claim recites “arranging the determined number of voice segments for each of the plurality of speakers, wherein the arranging of the determined number of voice segments comprises, based on an end point of a voice segment of a first speaker of the plurality of speakers, determining a start point of a voice segment of a second speaker”. The claim does not place any limits on how the arranging is performed other than identifying start points and endpoints of each speaker’s turn in the conversation. Thus, under the broadest reasonable interpretation the claim elements are directed to any process of organization of information regarding the start and end points of each speaker’s turn, which may be performed in the mind or with pen and paper. Therefore, the above steps are also directed to mental processes. Furthermore, the step of “generating, based on the arranging, voice data” also recites a mental process. The recited element does not place any limit on how the voice data is generated or what kind of data is generated. Under broadest reasonable interpretation, the generated voice data may be, but not limited to, a human using pen and paper to arrange each voice segment where the words associated with each voice segments are simply written on paper as the “voice data” or also may be a human repeating with his or her own voice the voice segments that were identified for each speaker, etc. Finally, the step of “train, based on the generated voice data, a learning model associated with speaker diarization” does not fall under the mental processes class because using this data to train a learning model cannot be performed in the human mind. Further, this step is not a mathematical algorithm, or does not describe a method of organizing human activity. The claims recite limitations that taken in combination, recite at least a series of mental processes. 

Step 2A, Prong Two: This part of the eligibility analysis evaluates whether the claim as a whole integrates the recited judicial exception into a practical application of the exception. This evaluation is performed by (1) identifying whether there are any additional elements recited in the claim beyond the judicial exception, and (2) evaluating those additional elements individually and in combination to determine whether the claim as a whole integrates the exception into a practical application. See MPEP 2106.04(d). 
As discussed above, independent claim 8 recites training, based on the generated voice data, a learning model associated with speaker diarization, as an additional element beyond the judicial exception. The examiner has found, however, that the training step provides no further detail and is recited at such a high-level of generality that this limitation is merely a post-solution step. The claim does not provide any details regarding how the training is performed and/or what kind of learning model is being trained. Therefore, this step is an insignificant extra-solution activity and does not integrate the judicial exception into a practical application. See MPEP 2106.05(g). 	Furthermore, independent Claim 8 further recites “at least one processor” and “at least one memory storing instructions” as additional elements beyond the judicial exception. However, these additional elements do not amount to significantly more than the abstract idea because the additional elements constitute a generic computer environment. Alice, 134 S. Ct. at 2357. The Claims need meaningful limitations that go beyond generally linking the use of an abstract idea to a particular technological environment. Therefore, the steps are all abstract and the Claim as a whole is abstract. “[S]imply appending generic computer functionality to lend speed or efficiency to the performance of an otherwise abstract concept does not meaningfully limit claim scope for purposes of patent eligibility.” CLS Bank, 2013 U.S. App. LEXIS 9493, at *29 (citing Bancorp, 687 F.3d at 1278, and Dealertrack, Inc. v. Huber, 674 F.3d 1315, 1333-34 (Fed. Cir. 2012) (finding that the claimed computer-aided clearinghouse process is a patent-ineligible abstract idea)); SiRF Tech., Inc. v. Int'l Trade Comm'n, 601 F.3d 1319, 1333 (Fed. Cir. 2010) (“In order for the addition of a machine to impose a meaningful limit on the scope of a claim, it must play a significant part in permitting the claimed method to be performed, rather than function solely as an obvious mechanism for permitting a solution to be achieved more quickly, i.e., through the utilization of a computer for performing calculations.”).	Additionally, dependent claims 2-7 and 9-14 do not provide any additional elements that integrate the judicial exception into a practical application. The claims simply describe that the arranging is performed based on an arbitrary time interval that is determined based on a probability distribution. Using a probability distribution falls under the mathematical algorithms class and fails to integrate the judicial exception into a practical application. The dependent claims further recite methods such as labeling and indexing which also do not integrate the judicial exception into a practical application.

Step 2B:  This part of the eligibility analysis evaluates whether the claim as a whole amounts to significantly more than the recited exception, i.e., whether any additional element, or combination of additional elements, adds an inventive concept to the claim. See MPEP 2106.05.
At step 2A, prong two, the additional elements of training the learning model and the “processor” and “memory” were found to be insignificant extra-solution activity and a generic computer environment. At Step 2B, the re-evaluation of the insignificant extra-solution activity consideration takes into account whether or not the extra-solution activity is well understood, routine, and conventional in the field. See MPEP 2106.05(g). Here, the step of outputting the post-mask speech signal is mere data manipulation that is recited at a high level of generality. Therefore, this limitation remains insignificant extra-solution activity even upon reconsideration and does not amount to significantly more. Even when considered in combination, these additional elements represent mere instructions to apply an exception and insignificant extra-solution activity, and therefore do not provide an inventive concept. 	Additionally, dependent claims 2-7 and 9-14 do not add an inventive concept.

In conclusion, Examiner notes that none of recited steps in Applicant's claims 1-14 refer to a specific machine by reciting structural limitations of any apparatus or to any specific operations that would cause a machine to be the mechanism to perform these steps.  Although the claims may be processed by a computing system having a processor, the computing system is merely a general purpose computing system. Therefore, all of the claims 1-14 are abstract.

	Claims 15-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.
	Claims 15-20 are directed to a computer-readable recording medium storing instructions. The broadest reasonable interpretation of a claim drawn to a computer-readable recording medium (also called machine readable medium and other such variations) typically covers forms of non-transitory tangible media and transitory propagating signals per se in view of the ordinary and customary meaning of computer readable media, particularly when the specification is silent.  See MPEP 2111.01. When the broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected under 35 U.S.C. § 101 as covering non-statutory subject matter.  See In re Nuijten, 500 F.3d 1346, 1356-57 (Fed. Cir. 2007) (transitory embodiments are not directed to statutory subject matter) and Interim Examination Instructions for Evaluating Subject Matter Eligibility Under 35 U.S.C. § 101, Aug. 24, 2009; p. 2. Although the specification of the instant application considers, in pg. 20, lines 15-16, the computer-readable recording medium “may be” a “non-transitory computer-readable medium”, the specification is silent regarding the exclusion of transitory propagating signals per se.	The USPTO recognizes that applicants may have claims directed to computer readable media that cover signals per se, which the USPTO must reject under 35 U.S.C. § 101 as covering both non-statutory subject matter and statutory subject matter.  In an effort to assist the patent community in overcoming a rejection or potential rejection under 35 U.S.C. § 101 in this situation, the USPTO suggests the following approach.  A claim drawn to such a computer readable medium that covers both transitory and non-transitory embodiments may be amended to narrow the claim to cover only statutory embodiments to avoid a rejection under 35 U.S.C. § 101 by adding the limitation “non-transitory” to the claim.  Cf.  Animals - Patentability, 1077 Off. Gaz. Pat. Office 24 (April 21, 1987) (suggesting that applicants add the limitation “non-human” to a claim covering a multi-cellular organism to avoid a rejection under 35 U.S.C. § 101).  Such an amendment would typically not raise the issue of new matter, even when the specification is silent because the broadest reasonable interpretation relies on the ordinary and customary meaning that includes signals per se.  The limited situations in which such an amendment could raise issues of new matter occur, for example, when the specification does not support a non-transitory embodiment because a signal per se is the only viable embodiment such that the amended claim is impermissibly broadened beyond the supporting disclosure.  See, e.g., Gentry Gallery, Inc. v. Berkline Corp., 134 F.3d 1473 (Fed. Cir. 1998).


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 2, 4-6, 8, 9, 11-13, 15, 16 and 18-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lyren et al. (US Patent 9,584,946; hereinafter “Lyren”).

As per claims 1 and 15, Lyren discloses:
A method performed by a computing device and computer-readable recording medium storing instructions that, when executed, cause: 	determining a number of a plurality of speakers to be used for voice data generation (Lyren; Col. 6, lines 15-18 - …the audio diarization system is provided with or determines a number of speakers, such as a known number of speakers in a radio or television archive broadcast…; see also Col. 20, lines 21-33); 	determining a number of voice segments for each of the plurality of speakers (Lyren; Col. 6, lines 41-56 - the technique divides the audio input into a number of segments and then iteratively chooses clusters that closely match to repeatedly reduce an overall number of clusters. Clusters can be modeled with GMM in which a distance metric identifies closest clusters. The process repeats until each speaker has one cluster; see also Col. 18, lines 66-67 - Block 710 states determine how many tracks/segments are in the audio input…; see also Col. 19, lines 1-18); 	arranging the determined number of voice segments for each of the plurality of speakers (Lyren; Col. 19, lines 53-60 - an example embodiment determines one or more of which tracks and/or segments to internally localize to a user, which tracks and/or segments to externally localize to the user, which tracks to omit from localization processing, and which tracks to omit from output to the user (arranging)...; see also Col. 16, lines 24-43 - …After the music passes the threshold loudness value, the music is moved to originate at a SLP that is remote from Alice. Alternatively, the audio system reduces the loudness of the music segment, or omits the music segment from the audio output to Alice so she does not hear the music (multiple examples of audio data arrangement); see also Col. 10, lines 15-27 – another example of voice segment arranging where the segments are clustered based on who is speaking and whether they are speech or non-speech), wherein the arranging of the determined number of voice segments comprises, based on an end point of a voice segment of a first speaker of the plurality of speakers, determining a start point of a voice segment of a second speaker (Lyren; Col. 6, lines 57-67 - the audio diarization system 110 executes or processes temporal distance metrics to determine speaker change locations, such as temporal locations when one speaker stops speaking and another begins speaking. For a particular location in the audio input, the system determines a statistical similarity of the audio on each side of this location and then determines segment boundaries based on a distance curve for this statistical similarity (determining start point based on an end point). Consider another example in which the audio diarization system executes or processes heuristic rules to determine the speaker change locations); and	generating, based on the arranging, voice data (Lyren; Col. 10, lines 28-29 - The audio diarization system 200 outputs the audio output 250 (e.g., one or more sound tracks or segments)).	As per claims 2 and 16, Lyren discloses:	The method and computer-readable recording medium of claims 1 and 15, wherein the arranging of the determined number of voice segments comprises arranging the start point of the voice segment of the second speaker at a position spaced apart from the end point of the voice segment of the first speaker by an arbitrary time interval (Lyren; Col. 19, lines 53-60 - an example embodiment determines one or more of which tracks and/or segments to internally localize to a user, which tracks and/or segments to externally localize to the user, which tracks to omit from localization processing, and which tracks to omit from output to the user (arranging)...; see also Col. 16, lines 24-43 - …After the music passes the threshold loudness value, the music is moved to originate at a SLP that is remote from Alice. Alternatively, the audio system reduces the loudness of the music segment, or omits the music segment from the audio output to Alice so she does not hear the music (omitting a music segment from the audio output constitutes spacing of the speech segments by an arbitrary time interval, where the interval is an arbitrary amount of time filling the space where the omitted sound used to be)).

As per claims 4 and 18, Lyren discloses:	The method and computer-readable recording medium of claims 2 and 16, wherein the arranging of the determined number of voice segments comprises arranging the start point of the voice segment of the second speaker before or after the end point of the voice segment of the first speaker (Lyren; Col. 6, lines 57-67 - the audio diarization system 110 executes or processes temporal distance metrics to determine speaker change locations, such as temporal locations when one speaker stops speaking and another begins speaking. For a particular location in the audio input, the system determines a statistical similarity of the audio on each side of this location and then determines segment boundaries based on a distance curve for this statistical similarity (determining start point based on an end point). Consider another example in which the audio diarization system executes or processes heuristic rules to determine the speaker change locations).

As per claims 5 and 19, Lyren discloses:
The method and computer-readable recording medium of claims 1 and 15, wherein the arranging of the determined number of voice segments comprises arranging first voice segments of the plurality of speakers, wherein the first voice segments are associated with a first index, and after completing the arranging of the first voice segments, arranging second voice segments of the plurality of speakers, wherein the second voice segments are associated with a second index (Lyren; Col. 6, lines 41-56 - the technique divides the audio input into a number of segments and then iteratively chooses clusters that closely match to repeatedly reduce an overall number of clusters. Clusters can be modeled with GMM in which a distance metric identifies closest clusters. The process repeats until each speaker has one cluster… A single cluster for the audio input is iteratively divided until a number of clusters represent the number of speakers; see also Col. 8, lines 59-64 - the system segments the audio input into file chunks for each unique speaker and/or non-speech sound (indexing). The system clusters these file chunks into groups so the audio input is partitioned into homogenous segments according to an identity of the speaker and/or non-speech sound; see also Col. 10, lines 15-27 – speaker segment labeling as a form of indexing).

As per claims 6 and 20, Lyren discloses:	The method and computer-readable recording medium of claims 1 and 15, wherein each of the arranged voice segments are labeled with a tag indicating a speaker of the plurality of speakers (Lyren; Col. 10, lines 15-27 - clustering identifies which speaker segments are the same and/or different and can group the clusters (such as providing one cluster for each speaker or each voice). Clustering can also label and/or identify segments of the non-speech and speech. For example, the system labels a non-speech segment as “music 3” and labels a recognized speaker speech segment as “Alice.”).
	As per claim 8, Lyren discloses:
An apparatus comprising: 	at least one processor (Lyren; Fig. 14, item 1424; Col. 34, lines 42-44 - a processor or processing unit 1424 (such as one or more microprocessors and/or microcontrollers)); and 	at least one memory storing instructions that, when executed by the at least one processor (Lyren; Fig. 14, item 1420; Col. 34, line 41 - computer readable medium (CRM) or memory 1420), cause the apparatus to:	determine a number of a plurality of speakers to be used for voice data generation (Lyren; Col. 6, lines 15-18 - …the audio diarization system is provided with or determines a number of speakers, such as a known number of speakers in a radio or television archive broadcast…; see also Col. 20, lines 21-33); 	determine a number of voice segments for each of the plurality of speakers (Lyren; Col. 6, lines 41-56 - the technique divides the audio input into a number of segments and then iteratively chooses clusters that closely match to repeatedly reduce an overall number of clusters. Clusters can be modeled with GMM in which a distance metric identifies closest clusters. The process repeats until each speaker has one cluster; see also Col. 18, lines 66-67 - Block 710 states determine how many tracks/segments are in the audio input…; see also Col. 19, lines 1-18); 	arrange the determined number of voice segments for each of the plurality of speakers (Lyren; Col. 19, lines 53-60 - an example embodiment determines one or more of which tracks and/or segments to internally localize to a user, which tracks and/or segments to externally localize to the user, which tracks to omit from localization processing, and which tracks to omit from output to the user (arranging)...; see also Col. 16, lines 24-43 - …After the music passes the threshold loudness value, the music is moved to originate at a SLP that is remote from Alice. Alternatively, the audio system reduces the loudness of the music segment, or omits the music segment from the audio output to Alice so she does not hear the music (multiple examples of audio data arrangement); see also Col. 10, lines 15-27 – another example of voice segment arranging where the segments are clustered based on who is speaking and whether they are speech or non-speech), wherein arranging the determined number of voice segments comprises, based on an end point of a voice segment of a first speaker of the plurality of speakers, determining a start point of a voice segment of a second speaker (Lyren; Col. 6, lines 57-67 - the audio diarization system 110 executes or processes temporal distance metrics to determine speaker change locations, such as temporal locations when one speaker stops speaking and another begins speaking. For a particular location in the audio input, the system determines a statistical similarity of the audio on each side of this location and then determines segment boundaries based on a distance curve for this statistical similarity (determining start point based on an end point). Consider another example in which the audio diarization system executes or processes heuristic rules to determine the speaker change locations);	generate, based on the arranging, voice data (Lyren; Col. 10, lines 28-29 - The audio diarization system 200 outputs the audio output 250 (e.g., one or more sound tracks or segments)); and 	train, based on the generated voice data, a learning model associated with speaker diarization (Lyren; Col. 15, lines 9-22 - an audio diarization system segments each call and each audio source for a duration of a training period, such as several minutes, several hours, several days, or several weeks. The system creates and refines models for sounds played to the user or received by the system, such as calls and sounds that the user does and does not localize. During the training period, due to refinement from multiple calls and playing multiple other sound sources, an increase occurs in the quality of the models built to identify various voices, sound types, sounds, or segments, and an increase in the accuracy of the models in identifying the various sounds. The system can refer to the saved mature models to process future calls for which the user or an electronic device requests segmentation and/or localization).

As per claim 9, the claim recites an apparatus as dependent on claim 8, and recites similar language as the method of claim 2. Thus the claim is rejected similar to claim 2. 	As per claim 11, the claim recites an apparatus as dependent on claim 9, and recites similar language as the method of claim 4. Thus the claim is rejected similar to claim 4.	As per claim 12, the claim recites an apparatus as dependent on claim 8, and recites similar language as the method of claim 5. Thus the claim is rejected similar to claim 5.

As per claim 13, the claim recites an apparatus as dependent on claim 8, and recites similar language as the method of claim 6. Thus the claim is rejected similar to claim 6.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 3, 7, 10, 14 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Lyren in view of Reshef (US PG Pub 20220157322).

As per claims 3, 10 and 17, Lyren discloses:	The method, apparatus and computer-readable recording medium of claims 2, 9 and 16, upon which claims 3, 10 and 17 depend.	Lyren, however, fails to disclose wherein the arbitrary time interval is determined according to a probability distribution selected from a group of probability distributions, wherein the group of probability distributions comprises at least one of: a normal distribution, a continuous uniform distribution, or a Student's t-distribution.	Reshef does teach wherein the arbitrary time interval is determined according to a probability distribution selected from a group of probability distributions, wherein the group of probability distributions comprises at least one of: a normal distribution, a continuous uniform distribution, or a Student's t-distribution (Reshef; Fig. 5, item 110; p. 0057-0058 - Processor 36 uses these new speaker state identifications in refining the segmentation of the audio stream (determine arbitrary time interval), at a segmentation refinement step 110. To avoid errors at this stage, the processor typically applies a threshold to the speaker state probability values, so that only speaker state identifications having high measures of confidence are used in the resegmentation; see also p. 0049 - refining the segmentation of a conversation, in accordance with an embodiment of the invention… The present method uses a statistical model, such as a Gaussian Mixture Model (GMM), to characterize the speakers in the conversation, together with a state-based model, such as a Hidden Markov Model (HMM), to track transitions between speakers (normal distribution)… the refinement process uses a statistical model such as a GMM, which is a probabilistic model that represents data as a combination of multiple normal (Gaussian) distributions, and an HMM model to select most probable segmentation time points (intervals) for each speaker and refine the segmentation time points in multiple iterations until the most probable segments for each independent speaker are determined)
Therefore, it would have been obvious to one of ordinary skill in the art to modify the method, apparatus and computer-readable recording medium of Lyren to include wherein the arbitrary time interval is determined according to a probability distribution selected from a group of probability distributions, wherein the group of probability distributions comprises at least one of: a normal distribution, a continuous uniform distribution, or a Student's t-distribution, as taught by Reshef, in order to give an accurate rendition of the conversation during a conference (Reshef; p. 0003).	As per claims 7 and 14, Lyren discloses:	The method and apparatus of claims 1 and 8, wherein the arranging of the determined number of voice segments comprises: arranging a plurality of first voice segments of the plurality of speakers, wherein the plurality of first voice segments are associated with a first index (Lyren; Col. 6, lines 41-56 - the technique divides the audio input into a number of segments and then iteratively chooses clusters that closely match to repeatedly reduce an overall number of clusters. Clusters can be modeled with GMM in which a distance metric identifies closest clusters. The process repeats until each speaker has one cluster… A single cluster for the audio input is iteratively divided until a number of clusters represent the number of speakers; see also Col. 8, lines 59-64 - the system segments the audio input into file chunks for each unique speaker and/or non-speech sound (indexing). The system clusters these file chunks into groups so the audio input is partitioned into homogenous segments according to an identity of the speaker and/or non-speech sound; see also Col. 10, lines 15-27 – speaker segment labeling as a form of indexing), and wherein at least two of the plurality of first voice segments are arranged based on: an end point of a preceding one of the at least two of the plurality of first voice segments (Lyren; Col. 6, lines 57-67 - the audio diarization system 110 executes or processes temporal distance metrics to determine speaker change locations, such as temporal locations when one speaker stops speaking and another begins speaking. For a particular location in the audio input, the system determines a statistical similarity of the audio on each side of this location and then determines segment boundaries based on a distance curve for this statistical similarity (determining start point based on an end point). Consider another example in which the audio diarization system executes or processes heuristic rules to determine the speaker change locations); and arranging a plurality of second voice segments of the plurality of speakers, wherein the plurality of second voice segments are associated with a second index (Lyren; Col. 6, lines 41-56 - the technique divides the audio input into a number of segments and then iteratively chooses clusters that closely match to repeatedly reduce an overall number of clusters. Clusters can be modeled with GMM in which a distance metric identifies closest clusters. The process repeats until each speaker has one cluster… A single cluster for the audio input is iteratively divided until a number of clusters represent the number of speakers; see also Col. 8, lines 59-64 - the system segments the audio input into file chunks for each unique speaker and/or non-speech sound (indexing). The system clusters these file chunks into groups so the audio input is partitioned into homogenous segments according to an identity of the speaker and/or non-speech sound; see also Col. 10, lines 15-27 – speaker segment labeling as a form of indexing), and wherein at least two of the plurality of second voice segments are arranged based on: an end point of a preceding one of the at least two of the plurality of second voice segments (Lyren; Col. 6, lines 57-67 - the audio diarization system 110 executes or processes temporal distance metrics to determine speaker change locations, such as temporal locations when one speaker stops speaking and another begins speaking. For a particular location in the audio input, the system determines a statistical similarity of the audio on each side of this location and then determines segment boundaries based on a distance curve for this statistical similarity (determining start point based on an end point). Consider another example in which the audio diarization system executes or processes heuristic rules to determine the speaker change locations), wherein a voice segment of the plurality of second voice segments that precedes the remaining voice segments of plurality of second voice segments is arranged based on: an end point of a last voice segment of the plurality of first voice segments (Lyren; Col. 6, lines 57-67 - the audio diarization system 110 executes or processes temporal distance metrics to determine speaker change locations, such as temporal locations when one speaker stops speaking and another begins speaking. For a particular location in the audio input, the system determines a statistical similarity of the audio on each side of this location and then determines segment boundaries based on a distance curve for this statistical similarity (determining start point based on an end point). Consider another example in which the audio diarization system executes or processes heuristic rules to determine the speaker change locations).	Lyren, however, fails to disclose wherein at least two of the plurality of first voice segments are arranged based on a first time offset associated with at least one probability distribution; wherein at least two of the plurality of second voice segments are arranged based on: a second time offset associated with the at least one probability distribution; wherein a voice segment of the plurality of second voice segments that precedes the remaining voice segments of plurality of second voice segments is arranged based on: a third time offset associated with the at least one probability distribution.	Reshef does teach wherein at least two of the plurality of first voice segments are arranged based on a first time offset associated with at least one probability distribution; wherein at least two of the plurality of second voice segments are arranged based on: a second time offset associated with the at least one probability distribution; wherein a voice segment of the plurality of second voice segments that precedes the remaining voice segments of plurality of second voice segments is arranged based on: a third time offset associated with the at least one probability distribution (Reshef; Fig. 5, item 110; p. 0057-0058 - Processor 36 uses these new speaker state identifications in refining the segmentation of the audio stream (offset), at a segmentation refinement step 110. To avoid errors at this stage, the processor typically applies a threshold to the speaker state probability values, so that only speaker state identifications having high measures of confidence are used in the resegmentation; see also p. 0049 - refining the segmentation of a conversation, in accordance with an embodiment of the invention… The present method uses a statistical model, such as a Gaussian Mixture Model (GMM), to characterize the speakers in the conversation, together with a state-based model, such as a Hidden Markov Model (HMM), to track transitions between speakers (normal distribution)… the refinement process uses a statistical model such as a GMM, which is a probabilistic model that represents data as a combination of multiple normal (Gaussian) distributions, and an HMM model to select most probable segmentation time points (intervals) for each speaker and refine the segmentation time points in multiple iterations until the most probable segments for each independent speaker are determined)
Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and apparatus of Lyren to include wherein at least two of the plurality of first voice segments are arranged based on a first time offset associated with at least one probability distribution; wherein at least two of the plurality of second voice segments are arranged based on: a second time offset associated with the at least one probability distribution; wherein a voice segment of the plurality of second voice segments that precedes the remaining voice segments of plurality of second voice segments is arranged based on: a third time offset associated with the at least one probability distribution, as taught by Reshef, in order to give an accurate rendition of the conversation during a conference (Reshef; p. 0003).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes:
	Ghaemmaghami (US PG Pub 20190304470) discloses methods and systems for performing automatic diarization of sound recordings including speech from one more speakers. The automatic diarization has a development or training phase and a utilization or evaluation phase. In the development or training phase background models and hyperparameters are generated from already annotated sound recordings. These models and hyperparameters are applied during the evaluation or utilization phase to diarization new or not previously diarized or annotated recordings (Ghaemmaghami; Abstract).	Fanelli (US PG Pub 20240160849) discloses embodiments for speaker diarization supporting episodical content. In an embodiment, a method comprises: receiving media data including one or more utterances; dividing the media data into a plurality of blocks; identifying segments of each block of the plurality of blocks associated with a single speaker; extracting embeddings for the identified segments in accordance with a machine learning model, wherein extracting embeddings for identified segments further comprises statistically combining extracted embeddings for identified segments that correspond to a respective continuous utterance associated with a single speaker; clustering the embeddings for the identified segments into clusters; and assigning a speaker label to each of the embeddings for the identified segments in accordance with a result of the clustering. In some embodiments, a voiceprint is used to identify a speaker and the speaker identity for a speaker label (Fanelli; Abstract).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139. The examiner can normally be reached Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at 5712727602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RODRIGO A CHAVEZ/Examiner, Art Unit 2658                                                                                                                                                                                                        

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Oct 24, 2023
Application Filed
Mar 18, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/175,355
Patent 12597430
MULTI-CHANNEL SIGNAL GENERATOR, AUDIO ENCODER AND RELATED METHODS RELYING ON A MIXING NOISE SIGNAL
2y 5m to grant Granted Apr 07, 2026
17/579,750
Patent 12579984
DATA AUGMENTATION SYSTEM AND METHOD FOR MULTI-MICROPHONE SYSTEMS
2y 5m to grant Granted Mar 17, 2026
17/513,419
Patent 12541653
ENTERPRISE COGNITIVE SOLUTIONS LOCK-IN AVOIDANCE
2y 5m to grant Granted Feb 03, 2026
17/532,315
Patent 12542136
DYNAMICALLY CONFIGURING A WARM WORD BUTTON WITH ASSISTANT COMMANDS
2y 5m to grant Granted Feb 03, 2026
17/450,015
Patent 12531077
METHOD AND APPARATUS IN AUDIO PROCESSING
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
50%
Grant Probability
88%
With Interview (+37.3%)
3y 5m
Median Time to Grant
Low
PTA Risk
Based on 228 resolved cases by this examiner. Grant probability derived from career allow rate.