DETAILED ACTION
1. This action is responsive to Application no.18/652,648 filed 5/1/2024. All claims have been examined and are currently pending.
Notice of Pre-AIA or AIA Status
2. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
3. The information disclosure statement (IDS) submitted is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Specification
4. The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
Allowable Subject Matter
5. Claim 20 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Regarding claim 20 Ziv teaches:
0017: The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process.
0020: Linguistic or speech pattern rules or models used to identify the homogeneous speech segments can be provided in a file 108 to the transcription server 104.
0022: Such speech frames are long enough to perform meaningful spectral analysis in relation to the temporal characteristics of the speech signal, yet they are short enough to give fine granularity to the output. The frames may then be grouped into utterances separated by non-speech segments in the audio file. Each utterance is a segment of speech likely attributed to a single speaker. Non-speech segments in the audio file can be identified by an evaluation of the energy envelope of each of the frames to segment the audio data into a plurality of utterances. In an embodiment, the utterances can be identified through Voice Activity Detection (VAD) as explained in further detail herein with respect to FIGS. 5 and 6.
[0052] As described above, embodiments of the methods as disclosed herein can segment the audio data file based upon signal entropy. FIG. 6 is a flowchart that depicts an exemplarily embodiment of a method 600 of voice activity detection (VAD). VAD may exemplarily be used in audio file segmentation in embodiments of diarization as disclosed herein. As disclosed in further detail, energy values over time can be traced according to the method of FIG. 5. The speech-presence probability estimated for each frame based on these values.
[0076] After one or more of the features highlighted above are calculated, an activity probability Q for each frame cab be calculated at 616 as a combination of the speech probabilities for the Band energy (P.sub.B), Total energy (P.sub.E), Energy Peakiness (P.sub.P), and Residual Energy (P.sub.R) computed as described above for each frame;
But does not teach all of the limitations of the dependent claim.
Claim Rejections - 35 USC § 102
6. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
7. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
8. Claims 1, 5, 9-10, 14, 18-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Ziv et al (2014/0074467).
Regarding claim 1 Ziv teaches One or more non-transitory computer readable media comprising instructions that, when executed by one or more hardware processors (figure 2; para: 12 system; 13: speaker diarization system and method; 0018: It is to be understood that embodiments of the methods of diarization as disclosed herein may be performed by a computer processor executing computer readable code that causes the computer processor to carry out the functions and features as described herein), causes performance of operations comprising:
accessing a plurality of audio content segments comprised in a first set of audio content, each of the segments comprising intelligible speech (fig 4; 0016: the identification of segments in an audio file, such as an audio stream or recording; 0017 speech segments);
pruning the plurality of audio content segments based on one or more pruning criteria to determine a pruned plurality of audio content segments (fig 4 406; 0017 The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process.);
analyzing the pruned plurality of audio content segments to select a number of speakers for labeling the audio content (fig 4; 0021: The blind diarization is characterized as such as the identities of the speakers (e.g. agent, customer) are not known and therefore the diarization 110 discriminates between a first speaker (speaker 1) and a second speaker (speaker 2); 0027); and
based at least on the selected number of speakers for labeling the audio content, analyzing one or more embeddings computed for the audio content to label portions of the audio content with corresponding speaker identifiers ([0028] At 114 a second diarization, an "agent" diarization, is undertaken to identify which of speaker 1 and speaker 2 is the agent and which speaker is the customer.; 0016: segments; 0067: frame features
figures 4;6;
** 0017: The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process.
0020: Linguistic or speech pattern rules or models used to identify the homogeneous speech segments can be provided in a file 108 to the transcription server 104.
0022: Such speech frames are long enough to perform meaningful spectral analysis in relation to the temporal characteristics of the speech signal, yet they are short enough to give fine granularity to the output. The frames may then be grouped into utterances separated by non-speech segments in the audio file. Each utterance is a segment of speech likely attributed to a single speaker. Non-speech segments in the audio file can be identified by an evaluation of the energy envelope of each of the frames to segment the audio data into a plurality of utterances. In an embodiment, the utterances can be identified through Voice Activity Detection (VAD) as explained in further detail herein with respect to FIGS. 5 and 6.
[0052] As described above, embodiments of the methods as disclosed herein can segment the audio data file based upon signal entropy. FIG. 6 is a flowchart that depicts an exemplarily embodiment of a method 600 of voice activity detection (VAD). VAD may exemplarily be used in audio file segmentation in embodiments of diarization as disclosed herein. As disclosed in further detail, energy values over time can be traced according to the method of FIG. 5. The speech-presence probability estimated for each frame based on these values.
[0076] After one or more of the features highlighted above are calculated, an activity probability Q for each frame cab be calculated at 616 as a combination of the speech probabilities
Regarding claim 5 Ziv teaches The media of claim 1, wherein pruning the plurality of audio content segments based on the one or more pruning criteria comprises:
determining a ratio of speech to silence in a particular audio content segment of the plurality of audio content segments (0024: This speech segmentation can also be filtered to remove non-speech segments based upon a basic energy envelope analysis of the audio file, particularly those segments not identified as homogeneous speaker segments. In a non-limiting example, non-speech segments can be removed for which a particular energy percentile in a segment is below a minimum energy threshold or if a predetermined dynamic energy range percentile is below a minimum dynamic energy range threshold.
0042: filtering the audio file to remove non-speech segments at 406. This can be identified by long silence intervals); and,
in response to the ratio of speech to silence not meeting one or more conditions, pruning the audio content segment from the plurality of audio content segments
(0024: This speech segmentation can also be filtered to remove non-speech segments; 0042: filtering the audio file to remove non-speech segments at 406. This can be identified by long silence intervals
0043: segmentation may be performed with voice activity detection (VAD) that seeks to identify segments of the audio data that are likely to contain speech apart from segments that are likely to be non-speech. The signal entropy can be used to identify silent or pause intervals in the audio file, which can also serve to segment the audio file. More detailed exemplary embodiments of this segmentation are disclosed herein.).
Regarding claim 9 Ziv teaches The media of claim 1, wherein pruning the plurality of audio content segments based on the one or more pruning criteria comprises:
determining that an audio content segment is a member of a minority cluster or has a low neighbor density ([0026] The clusters are then evaluated against a minimum cluster size requirement. The minimum cluster size requirement may exemplarily be 10% of the total words in the audio file as identified by the transcription, or at least 15 words; however, these are merely exemplary and are not intending to the limiting on the cluster size criteria that may be used. All of the clusters meeting this minimum size requirement are then compared to select the two most distinct clusters in terms of BIC score. In an alternative embodiment, the two largest clusters may be selected.; 0045); and,
in response to the audio content segment being a member of the minority cluster or having the low neighbor density, pruning the audio content segment from the plurality of audio content segments (0026; [0045] In still further embodiments, if there are more than two clusters when two speakers are expected, then two representative clusters may be selected by first removing any cluster that does not contain a predetermined number if words. Once the clusters are filtered with a minimum size criteria, then the two most distinct clusters as evaluated in terms of the BIC score are selected as the first speaker cluster and the second speaker cluster.).
Regarding claim 10 Ziv teaches A method of analyzing embeddings, comprising:
accessing a plurality of audio content segments comprised in a first set of audio content, each of the segments comprising intelligible speech;
pruning the plurality of audio content segments based on one or more pruning criteria to determine a pruned plurality of audio content segments;
analyzing the pruned plurality of audio content segments to select a number of speakers for labeling the audio content; and
based at least on the selected number of speakers for labeling the audio content, analyzing one or more embeddings computed for the audio content to label portions of the audio content with corresponding speaker identifiers; and
wherein the method is performed by at least one device including a hardware processor.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning
Claim 14 recites limitations similar to claim 5 and is rejected for similar rationale and reasoning
Claim 18 recites limitations similar to claim 9 and is rejected for similar rationale and reasoning
Regarding claim 19 Ziv teaches A system comprising:
at least one device including a hardware processor; the system being configured to perform operations comprising:
accessing a plurality of audio content segments comprised in a first set of audio content, each of the segments comprising intelligible speech;
pruning the plurality of audio content segments based on one or more pruning criteria to determine a pruned plurality of audio content segments;
analyzing the pruned plurality of audio content segments to select a number of speakers for labeling the audio content; and
based at least on the selected number of speakers for labeling the audio content, analyzing one or more embeddings computed for the audio content to label portions of the audio content with corresponding speaker identifiers.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning
Claim Rejections - 35 USC § 103
9. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
10. Claims 2-3, 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Ziv et al (2014/0074467) in view of Doornenbal et al (2018/0365323).
Regarding claim 2 Ziv teaches The media of claim 1, wherein pruning the plurality of audio content segments based on the one or more pruning criteria to determine the pruned plurality of audio content segments comprises:
pruning the subset of the plurality of audio content segments from the plurality of audio content segments to determine the pruned plurality of audio content segments (17: filtering out noise and other non-speech segments that can interfere with the diarization process)
but does not specifically teach where
Doornenbal (2018/0365323) teaches
determining a subset of the plurality of audio content segments having a corresponding number of tokens that does not meet a minimum token threshold, wherein a token of an audio content segment of the plurality of audio content segments represents a contiguous portion of the audio content segment ([0048] In some embodiments, it may be preferable that the snippet not be too short or too long. To address this consideration predetermined values for a lower threshold (i.e., a too short length) and an upper threshold (i.e., a too long length) may be determined. The section token count value may then be compared to these predetermined values and if the section token values is less than the lower threshold or greater than the upper threshold, the snippet corresponding to that section token count may be discarded.); and
pruning the subset of the plurality of audio content segments from the plurality of audio content segments to determine the pruned plurality of audio content segments (0048: corresponding to that section token count may be discarded).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Doornenbal to further discard irrelevant segments for improved diarization.
Ziv already teaches 0017: The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process; and one could look to Doornenbal to further incorporate additional criteria to prune/filter the segments for improved diarization, to more accurately identify speech segments in an audio file, such as an audio stream or recording, resulting in more accurate diarization and/or speech adaptation (Ziv 0016), and still presenting a reasonable expectation of success.
Regarding claim 3 Doornenbal teaches The media of claim 2, wherein the operations further comprise:
determining the number of tokens in the audio content segment at least by:
partitioning the audio content segment into a plurality of portions (0032: text corpus may be annotated with tokens); and
assigning each portion of the plurality of portions to a corresponding token (0032: text corpus may be annotated with tokens).
Rejected for similar rationale and reasoning as claim 2
Claim 11 recites limitations similar to claim 2 and is rejected for similar rationale and reasoning
Claim 12 recites limitations similar to claim 3 and is rejected for similar rationale and reasoning
11. Claims 4 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Ziv et al (2014/0074467) in view of Wang et al (2003/0023430).
Regarding claim 4 Ziv teaches The media of claim 1, wherein the operations further comprise,
accessing a second plurality of audio content segments comprised in a second set of audio content (Ziv (fig 4; 0016: the identification of segments in an audio file, such as an audio stream or recording; 0017 speech segments));
pruning the second plurality of audio content segments, to remove any audio content segments that do not meet a specified condition, to generate a second pruned plurality of audio content segments (Ziv (fig 4 406; 0017 The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process.);
pruning the second plurality of audio content segments, to remove any audio content segments that do not meet the updated condition, to generate a third pruned plurality of audio content segments (0017)
analyzing the third pruned plurality of audio content segments to select a second number of speakers for labeling the second set of audio content (Ziv (fig 4; 0021: The blind diarization is characterized as such as the identities of the speakers (e.g. agent, customer) are not known and therefore the diarization 110 discriminates between a first speaker (speaker 1) and a second speaker (speaker 2); 0027)); and
based at least on the second number of speakers, analyzing one or more embeddings computed for the second set of audio content to label portions of the second set of audio content with corresponding speaker identifiers (Ziv [0028] At 114 a second diarization, an "agent" diarization, is undertaken to identify which of speaker 1 and speaker 2 is the agent and which speaker is the customer.; 0016: segments; 0067: frame features).
but does not specifically teach where Wang et al (2003/0023430) teaches
determining that a number of the second pruned plurality of audio content segments is below a threshold number of segments (claim 12: components…is not more than the predetermined number);
in response to determining that the number of the second pruned plurality of audio content segments is below a threshold number of segments:
adjusting the specified condition for the pruning criteria to generate an updated condition (claim 12: decreasing the threshold);
{pruning the second plurality of audio content segments, to remove any audio content segments that do not meet the updated condition, to generate a third pruned plurality of audio content segments; }
determining that a number of the third pruned plurality of audio content segments is not below a threshold number of segments (claim 12: components…more than a predetermined number; not more than the predetermined number).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Wang to further discard irrelevant segments (while adjusting the condition to ensure enough segments remain) for improved diarization.
Ziv already teaches 0017: The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process, and one could look to Wang to further monitor the number of segments and adjust the condition accordingly to ensure there are still enough segments for improved diarization, to more accurately identify speech segments in an audio file, such as an audio stream or recording, resulting in more accurate diarization and/or speech adaptation (Ziv 0016), while still presenting a reasonable expectation of success.
Claim 13 recites limitations similar to claim 4 and is rejected for similar rationale and reasoning
12. Claims 6 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Ziv et al (2014/0074467) in view of Ahlenius (2004/0122666).
Regarding claim 6 Ziv does not specifically teach where Ahlenius (2004/0122666) teaches The media of claim 1, wherein pruning the plurality of audio content segments based on the one or more pruning criteria comprises:
determining a confidence score from an ASR module for a particular audio content segment of the plurality of audio content segments (0022 terms having a recognition value); and,
in response to the confidence score not meeting one or more conditions, pruning the audio content segment from the plurality of audio content segments (0022: filter may also discard all terms having a recognition value below a specific confidence value).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Ahlenius to further remove irrelevant segments for improved diarization.
Ziv already teaches 0017: The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process, and one could look to Ahlenius to further prune/filter the segments for improved diarization and still presenting a reasonable expectation of success.
Claim 15 recites limitations similar to claim 6 and is rejected for similar rationale and reasoning
13. Claims 7 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Ziv et al (2014/0074467) in view of Nishiyama et al (2021/0174790).
Regarding claim 7 Ziv does not specifically teach where Nishiyama et al (2021/0174790) teaches The media of claim 1, wherein pruning the plurality of audio content segments based on the one or more pruning criteria comprises:
determining a ratio of overlapped speech to total speech in a particular audio content segment of the plurality of audio content segments (0037: overlapping ratio); and,
in response to the ratio of overlapped speech to total speech not meeting one or more conditions, pruning the audio content segment from the plurality of audio content segments (0037: extract the conversation voice data in which the overlapping ratio…is not more than a predetermined threshold).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Nishiyama to further remove irrelevant segments for improved diarization.
Ziv already teaches 0017: The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process
[0042] In an embodiment, the audio file is split into a plurality of short overlapping frames. The over lapping frames are segmented into a plurality of homogenous speech segments by first filtering the audio file to remove non-speech segments at 406.
And one could look to Nishiyama to further prune/filter the segments for improved diarization while still presenting a reasonable expectation of success.
Claim 16 recites limitations similar to claim 7 and is rejected for similar rationale and reasoning
14. Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Ziv et al (2014/0074467) in view of Sundaram et al (2025/0077769).
Regarding claim 8 Ziv does not specifically teach where Sundaram teaches The media of claim 1, wherein pruning the plurality of audio content segments based on the one or more pruning criteria comprises:
determining a language model lexical correlation value corresponding to a set of tokens associated with a particular audio content segment of the plurality of audio content segments (13; 15; 18; 38: tokens …linguistic similarities); and,
in response to the language model lexical correlation value not meeting one or more conditions, pruning the audio content segment from the plurality of audio content segments (0018: remove tokens from the initial multilingual vocabulary based on the grouping of languages with linguistic similarities may include to determine one or more languages relevant to a locale handled by the contact center system, and remove tokens from the initial multilingual vocabulary associated with languages not within a group of languages relevant to the locale handled by the contact center system.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Sundaram to further remove irrelevant segments for improved diarization.
Ziv already teaches 0017: The identification and use of identifiable speech segments as the input for the diarization can further facilitate filtering out noise and other non-speech segments that can interfere with the diarization process
0020: Linguistic or speech pattern rules or models used to identify the homogeneous speech segments can be provided in a file 108 to the transcription server 104.
And one could look to Sundaram to further prune/filter the segments for improved diarization and still presenting a reasonable expectation of success.
Claim 17 recites limitations similar to claim 8 and is rejected for similar rationale and reasoning
Conclusion
15. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541. The examiner can normally be reached Monday-Friday 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.
For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SHAUN ROBERTS/Primary Examiner, Art Unit 2655