DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on 17 October 2024 and 24 February 2026 is/are being considered by the examiner.
Claim Objections
Claims 2, 8, 10, 13, and 19 are objected to because of the following informalities:
Regarding claim 2, and mutatis mutandis claim 13, the phrase “that that” at line 1 should read as “
Regarding claim 8, the phrase “the particular human is a particular human” at line 2 should read as “the particular human is a first particular human”.
Regarding claim 10, and mutatis mutandis claim 19, the phrase “a particular human” at line 14 should read as “[[a]] the particular human”.
Appropriate correction is required.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-6, 11-17 and 20 is/are rejected under 35 U.S.C. 102(a)(1) and 35 U.S.C. 102(a)(2) as being anticipated by Chen (U.S. Pat. App. Pub. No. 2021/0233541, hereinafter Chen).
Regarding claim 1, Chen discloses A computing system (Systems and methods for “receiving and analyzing” audio signals, such as from “telephone calls” as implemented through a computing device, such as an analytics server 102.; Chen, ¶ [0023], [0031]) comprising: a storage device configured to store a front-end neural network and a back-end model (“The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium.,” which is configured to store the front end (e.g., input layers 601 and embedding extractors 606) and the classification model (e.g., FC layer 608 and classifier 612 for receiving LDA transformed embeddings).; Chen, ¶ [0092]-[0093], [0107]); and processing circuitry having access to the storage device and (“The analytics server 102 of the call analytics system 101 may be any computing device comprising one or more processors and software, and capable of performing the various processes and tasks described herein.”; Chen, ¶ [0035]) configured to: receive a test audio data sample (“during the deployment operational phase, the server receives the inbound audio signal for the speaker”; Chen, ¶ [0059]); process, by executing the front-end neural network, the test audio data sample to extract one or more embeddings from the front-end neural network (The server “applies the neural network” to the inbound audio {process, by executing the front-end neural network, the test audio data sample…} “to extract the inbound embeddings, including, for example, an inbound spoofprint and an inbound voiceprint {...one or more embeddings from the front-end neural network}. “; Chen, ¶ [0059]); process, by executing the back-end model, the one or more embeddings to determine a likelihood (“a classifier 612 for performing any number of scoring and classification operations based upon the embeddings”; Chen, ¶ [0090]) that indicates whether the test audio data sample represents speech by a particular human (“The classifier 612 uses the spoof embeddings to determine whether the given input layers 601 is ‘genuine’ or ‘spoofed’.”; Chen, ¶ [0092]); and output an indication as to whether the test audio data sample represents genuine speech by the particular human (“By executing the classifier 612, the server classifies an inbound audio signal as genuine or spoofed based on the neural network architecture’s 600 output(s)” where “the server may authenticate the inbound audio signal according to the results of the classifier’s 612 determination” where “Following classification of an inbound audio signal (e.g., genuine or spoofed), the server the employs or transmits the outputted determination to one or more downstream operations.”; Chen, ¶ [0029], [0093]).
Regarding claim 2, the rejection of claim 1 is incorporated. Chen discloses all of the elements of the current invention as stated above. Chen further discloses wherein the likelihood that that indicates whether the test audio data sample represents speech by the particular human comprises a likelihood that the speech corresponding to the test audio data sample is genuine speech (“ if the similarity score for the voiceprints or the combined embeddings satisfies a corresponding predetermined threshold, then the analytics server 102 determines that the caller and the enrollee are likely the same person or that the inbound call is genuine or spoofed (e.g., synthetic speech).”; Chen, ¶ [0045]).
Regarding claim 3, the rejection of claim 1 is incorporated. Chen discloses all of the elements of the current invention as stated above. Chen further discloses wherein the output further indicates a likelihood that the test audio data represents synthetic audio data that is generated to imitate speech performed by the particular human (“if the similarity score for the voiceprints or the combined embeddings satisfies a corresponding predetermined threshold, then the analytics server 102 determines that the caller and the enrollee are likely the same person or that the inbound call is genuine or spoofed (e.g., synthetic speech),” where a determination regarding an enrollee that an inbound call is synthetic speech, is a determination that the inbound call is synthetic enrollee speech {generated to imitate speech performed by the particular human}; Chen, ¶ [0045]).
Regarding claim 4, the rejection of claim 1 is incorporated. Chen discloses all of the elements of the current invention as stated above. Chen further discloses wherein the storage device is further configured to store general training data comprising a set of genuine audio data samples and a set of synthetic audio data samples (the system “obtains the training audio signals, including clean audio signals and noise samples” where “clean audio signals may include speech originating from any number speakers, where the quality allows the server identify the speech—i.e., the clean audio signal contains little or no degradation” and “the server performs one or more data augmentation operations using the clean training audio samples and/or to generate simulated audio samples...thereby generating a larger set of training audio signals” and which “may be stored in non-transitory storage media accessible to the server or received via a network or other data source.”; Chen, ¶ [0068]-[0070]), and wherein the processing circuitry is configured to: train the front-end neural network based on the set of genuine audio data samples and the set of synthetic audio data samples (“the server uses the training audio signals to train one or more neural network architectures” where the “server feeds each training audio signal to the neural network architecture, which the neural network architecture uses to generate the predicted output by applying the current state of the neural network architecture to the training audio signal.”; Chen, ¶ [0070]).
Regarding claim 5, the rejection of claim 1 is incorporated. Chen discloses all of the elements of the current invention as stated above. Chen further discloses wherein the storage device is further configured to store general training data comprising a set of genuine audio data samples and a set of synthetic audio data samples (the system “obtains the training audio signals, including clean audio signals and noise samples” where “clean audio signals may include speech originating from any number speakers, where the quality allows the server identify the speech—i.e., the clean audio signal contains little or no degradation” and “the server performs one or more data augmentation operations using the clean training audio samples and/or to generate simulated audio samples...thereby generating a larger set of training audio signals” and which “may be stored in non-transitory storage media accessible to the server or received via a network or other data source.”; Chen, ¶ [0068]-[0070]), and wherein the processing circuitry is further configured to: identify, based on the general training data, a set of patterns corresponding to the set of genuine audio data samples (Discloses “the determined genuine and spoof classifications, as indicated by supervised labels or previously generated clusters” where previously generated clusters is an identification of a set of patterns corresponding to the “determined genuine…classifications”; Chen, ¶ [0092]); identify, based on the general training data, a set of patterns corresponding to the set of synthetic audio data samples (As above, “the determined genuine and spoof classifications, as indicated by supervised labels or previously generated clusters” where previously generated clusters is an identification of a set of patterns corresponding to the “determined... spoof classifications”; Chen, ¶ [0092]); and train the front-end neural network by configuring the front-end neural network with the set of patterns corresponding to the set of genuine audio data samples and the set of patterns corresponding to the set of synthetic audio data samples (“the result of training the neural network architecture is to minimize the amount of error between a predicted output (e.g., neural network architecture outputted of genuine or spoofed; extracted features; extracted feature vector) and an expected output (e.g., label associated with the training audio signal indicating whether the particular training signal is genuine or spoofed; label indicating expected features or feature vector of the particular training signal).”; Chen, ¶ [0070], [0092]).
Regarding claim 6, the rejection of claim 5 is incorporated. Chen discloses all of the elements of the current invention as stated above. Chen further discloses wherein by executing the front-end neural network, the processing circuitry is configured to: process the test audio data sample to extract the set of embeddings based on one or more patterns present in the test audio data sample (“The server feeds each training audio signal to the neural network architecture, which the neural network architecture uses to generate the predicted output by applying the current state of the neural network architecture to the training audio signal.”; Chen, ¶ [0070]), the set of patterns corresponding to the set of genuine audio data samples (Discloses “the determined genuine and spoof classifications, as indicated by supervised labels or previously generated clusters” which is a set of patterns corresponding to the “determined genuine…classifications”; Chen, ¶ [0092]), and the set of patterns corresponding to the set of synthetic audio data samples (As above, “the determined genuine and spoof classifications, as indicated by supervised labels or previously generated clusters” which is a set of patterns corresponding to the “determined... spoof classifications”; Chen, ¶ [0092]).
Regarding claim 11, the rejection of claim 1 is incorporated. Chen discloses all of the elements of the current invention as stated above. Chen further discloses wherein the front-end neural network comprises a deep neural network (DNN) (Discloses a “ResNet neural network architecture” which is a deep neural network.; Chen, ¶ [0028]).
Regarding claim 12, Chen discloses A method comprising (Systems and methods for “receiving and analyzing” audio signals, such as from “telephone calls” as implemented through a computing device, such as an analytics server 102, having a computer readable or processor readable storage medium; Chen, ¶ [0023], [0031], [0107]): receiving, by processing circuitry having access to a storage device (“The analytics server 102 of the call analytics system 101 may be any computing device comprising one or more processors and software, and capable of performing the various processes and tasks described herein,” as stored on computer readable media.; Chen, ¶ [0035], [0107]), a test audio data sample, (“during the deployment operational phase, the server receives the inbound audio signal for the speaker”; Chen, ¶ [0059]) wherein the storage device is configured to store a front-end neural network and a back-end model (“The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium.,” which is configured to store the front end (e.g., input layers 601 and embedding extractors 606) and the classification model (e.g., FC layer 608 and classifier 612 for receiving LDA transformed embeddings).; Chen, ¶ [0092]-[0093], [0107]); processing, by executing the front-end neural network by the processing circuitry, the test audio data sample to extract one or more embeddings from the front-end neural network (The server “applies the neural network” to the inbound audio {process, by executing the front-end neural network, the test audio data sample…} “to extract the inbound embeddings, including, for example, an inbound spoofprint and an inbound voiceprint {...one or more embeddings from the front-end neural network}. “; Chen, ¶ [0059]); processing, by executing the back-end model by the processing circuitry (“a classifier 612 for performing any number of scoring and classification operations based upon the embeddings”; Chen, ¶ [0090]), the one or more embeddings to determine a likelihood that indicates whether the test audio data sample represents speech by a particular human (“The classifier 612 uses the spoof embeddings to determine whether the given input layers 601 is ‘genuine’ or ‘spoofed’.”; Chen, ¶ [0092]); and outputting, by the processing circuitry, an indication as to whether the test audio data sample represents genuine speech by the particular human (“By executing the classifier 612, the server classifies an inbound audio signal as genuine or spoofed based on the neural network architecture’s 600 output(s)” where “the server may authenticate the inbound audio signal according to the results of the classifier’s 612 determination” where “Following classification of an inbound audio signal (e.g., genuine or spoofed), the server the employs or transmits the outputted determination to one or more downstream operations.”; Chen, ¶ [0029], [0093]).
Regarding claim 13, the rejection of claim 12 is incorporated. Claim 13 is substantially the same as claim 2 and is therefore rejected under the same rationale as above.
Regarding claim 14, the rejection of claim 12 is incorporated. Claim 14 is substantially the same as claim 3 and is therefore rejected under the same rationale as above.
Regarding claim 15, the rejection of claim 12 is incorporated. Claim 15 is substantially the same as claim 4 and is therefore rejected under the same rationale as above.
Regarding claim 16, the rejection of claim 12 is incorporated. Claim 16 is substantially the same as claim 5 and is therefore rejected under the same rationale as above.
Regarding claim 17, the rejection of claim 16 is incorporated. Claim 17 is substantially the same as claim 6 and is therefore rejected under the same rationale as above.
Regarding claim 20, Chen discloses A computer-readable medium comprising instructions that, when executed by a processor, cause the processor to (Systems and methods for “receiving and analyzing” audio signals, such as from “telephone calls” as implemented through a computing device, such as an analytics server 102, having a computer readable or processor readable storage medium; Chen, ¶ [0023], [0031], [0107]): receive a test audio data sample (“during the deployment operational phase, the server receives the inbound audio signal for the speaker”; Chen, ¶ [0059]), wherein the processor is in communication with a storage device is configured to store a front-end neural network and a back-end model (“The analytics server 102 of the call analytics system 101 may be any computing device comprising one or more processors and software, and capable of performing the various processes and tasks described herein” and “The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium.,” which is configured to store the front end (e.g., input layers 601 and embedding extractors 606) and the classification model (e.g., FC layer 608 and classifier 612 for receiving LDA transformed embeddings).; Chen, ¶ [0035], [0092]-[0093], [0107]); process, by executing the front-end neural network, the test audio data sample to extract one or more embeddings from the front-end neural network (The server “applies the neural network” to the inbound audio {process, by executing the front-end neural network, the test audio data sample…} “to extract the inbound embeddings, including, for example, an inbound spoofprint and an inbound voiceprint {...one or more embeddings from the front-end neural network}. “; Chen, ¶ [0059]); process, by executing the back-end model, the one or more embeddings to determine a likelihood (“a classifier 612 for performing any number of scoring and classification operations based upon the embeddings”; Chen, ¶ [0090]) that indicates whether the test audio data sample represents speech by a particular human (“The classifier 612 uses the spoof embeddings to determine whether the given input layers 601 is ‘genuine’ or ‘spoofed’.”; Chen, ¶ [0092]); and output an indication as to whether the test audio data sample represents genuine speech by the particular human (“By executing the classifier 612, the server classifies an inbound audio signal as genuine or spoofed based on the neural network architecture’s 600 output(s)” where “the server may authenticate the inbound audio signal according to the results of the classifier’s 612 determination” where “Following classification of an inbound audio signal (e.g., genuine or spoofed), the server the employs or transmits the outputted determination to one or more downstream operations.”; Chen, ¶ [0029], [0093]).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 7-9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chen as applied to claims 1 and 12 above, and further in view of Lopez Espejo (U.S. Pat. App. Pub. No. 2021/0125619, hereinafter Lopez Espejo).
Regarding claim 7, the rejection of claim 1 is incorporated. Chen discloses all of the elements of the current invention as stated above. Chen further discloses wherein the storage device is further configured to store individual speaker data comprising a set of genuine audio data samples corresponding to the particular human (“The server can further train the speaker recognition engine to recognize a particular speaker... using enrollee audio signals having speech segments involving the enrollee. {a set of genuine audio data samples corresponding to the particular human}”; Chen, ¶ [0024]). However, Chen fails to expressly recite wherein the processing circuitry is further configured to: adapt the back-end model based on the set of genuine audio data samples corresponding to the particular human to enroll the particular human.
Lopez Espejo teaches systems and methods for “authenticating a user or speaker.” (Lopez Espejo, ¶ [0002]). Regarding claim 7, Lopez Espejo teaches wherein the processing circuitry is further configured to: adapt the back-end model based on the set of genuine audio data samples corresponding to the particular human to enroll the particular human (“reference voiceprint may be updated taking into account the speech voiceprint of the (successfully) authenticated user,” where the “The score x may be determined by e.g., a classifier trained to determine the score x of the speech voiceprint vt (s′), representing the input speech signal of speaker s′ at time t, against the reference voiceprint et−1 (s) of speaker s at past time t−1.” As such, by both generating the reference voiceprint and updating the reference voiceprint, the back-end model, specifically the classifier, is adapted based on the “speech voiceprint of the (successfully) authenticated user” which corresponds to a set of genuine audio data samples corresponding to the particular human, and where the reference voice print is originally generated as part of the enrollment process.; Lopez Espejo, ¶ [0027]-[0028], [0049]-[0050]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the spoof detection and speaker recognition systems of Chen to incorporate the teachings of Lopez Espejo to include wherein the processing circuitry is further configured to: adapt the back-end model based on the set of genuine audio data samples corresponding to the particular human to enroll the particular human. Chen teaches generating enrollee voiceprints and spoofprints during an enrollment phase to act as a baseline for future deployment comparisons. However, a well-known problem in the field of biometric authentication is that systems relying on statistically eroded acoustic embeddings suffer from performance degradation over time, due to the natural drift in a genuine speaker’s voice as well as changes in recording conditions. Lopez Espejo specifically recognizes and solves this problem, where Lopez Espejo discloses a recursive updating scheme for the user’s reference voiceprint using newly acquired genuine speech from successful authentications. By recursively updating the reference voiceprint as fed into the classifier to generate the raw score, the overall calibration pipeline (sigma(x)) is inherently adapted and kept mathematically calibrated to the user’s acoustic reality, providing the established benefit of more reliable recognitions for user authentication and spoof avoidance, as recognized by Lopez Espejo. (Lopez Espejo, ¶ [0011]-[0012], [0049], [0051]).
Regarding claim 8, the rejection of claim 7 is incorporated. Chen discloses all of the elements of the current invention as stated above. Chen further discloses wherein the set of genuine audio data samples is a first set of genuine audio data samples, (“The server can further train the speaker recognition engine to recognize a particular speaker... using enrollee audio signals having speech segments involving the enrollee. {a set of genuine audio data samples}” where the enrollee audio signals may be specific to a first enrollee; Chen, ¶ [0024]) wherein the particular human is a particular human, (The particular human necessarily remains the same particular human.; Chen, ¶ [0024]) wherein the individual speaker data comprises a second set of genuine audio data samples corresponding to a second particular human (Discloses “identify[ing] and extract[ing] embeddings representing the low-level features” of a plurality of “speakers” and a “particular enrollee”, thus it is understood that the systems as applied to a single enrollee can be applied to any number of enrollees. As such, the enrollee audio signals as described with reference to the first enrollee, are also generated for the second or any subsequent enrollee.; Chen, ¶ [0024])However, Chen fail(s) to expressly recite wherein the processing circuitry is further configured to: adapt the back-end model based on the second set of genuine audio data samples corresponding to the second particular human to enroll the second particular human.
The relevance of Lopez Espejo is described above with relation to claim 7. Regarding claim 8, Lopez Espejo teaches wherein the processing circuitry is further configured to: adapt the back-end model based on the second set of genuine audio data samples corresponding to the second particular human to enroll the second particular human (Teaches a system for recursively authenticating a user that adapts its scoring and calibration pipeline based on user’s genuine audio data, where the “reference voiceprint may be updated taking into account the speech voiceprint of the (successfully) authenticated user,” and where the “The score x may be determined by e.g., a classifier trained to determine the score x of the speech voiceprint vt (s′), representing the input speech signal of speaker s′ at time t, against the reference voiceprint et−1 (s) of speaker s at past time t−1” and “score calibration σ(x) of the authentication may be also performed” on that score. As such, by both generating the reference voiceprint and updating the reference voiceprint, the back-end model, specifically the calibration model, is adapted based on the “speech voiceprint of the (successfully) authenticated user” which corresponds to a set of genuine audio data samples corresponding to the particular human, and where the reference voice print is both originally generated as part of the enrollment process and iteratively updated based on speech performed by a particular human.; Lopez Espejo, ¶ [0027]-[0028], [0049]-[0051]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the spoof detection and speaker recognition systems of Chen to incorporate the teachings of Lopez Espejo to include wherein the processing circuitry is further configured to: adapt the back-end model based on the second set of genuine audio data samples corresponding to the second particular human to enroll the second particular human. Chen teaches generating enrollee voiceprints and spoofprints during an enrollment phase to act as a baseline for future deployment comparisons. However, a well-known problem in the field of biometric authentication is that systems relying on statistically eroded acoustic embeddings suffer from performance degradation over time, due to the natural drift in a genuine speaker’s voice as well as changes in recording conditions. Lopez Espejo specifically recognizes and solves this problem, where Lopez Espejo discloses a recursive updating scheme for the user’s reference voiceprint using newly acquired genuine speech from successful authentications. By recursively updating the reference voiceprint as fed into the classifier to generate the raw score, the overall calibration pipeline (sigma(x)) is inherently adapted and kept mathematically calibrated to the user’s acoustic reality, providing the established benefit of more reliable recognitions for user authentication and spoof avoidance, as recognized by Lopez Espejo. (Lopez Espejo, ¶ [0011]-[0012], [0049], [0051]).
Regarding claim 9, the rejection of claim 7 is incorporated. Chen discloses all of the elements of the current invention as stated above. However, Chen fail(s) to expressly recite wherein to adapt the back-end model based on the set of genuine audio data samples corresponding to the particular human, the processing circuitry is configured to: adapt a calibration model of the back-end model based on the set of genuine audio data samples corresponding to the particular human such that the calibration model is configured to calibrate an output of the back-end model to indicate the likelihood that the test audio data sample represents speech performed by a particular human.
The relevance of Lopez Espejo is described above with relation to claim 7. Regarding claim 8, Lopez Espejo teaches wherein to adapt the back-end model based on the set of genuine audio data samples corresponding to the particular human, the processing circuitry is configured to: adapt a calibration model of the back-end model based on the set of genuine audio data samples corresponding to the particular human such that the calibration model is configured to calibrate an output of the back-end model to indicate the likelihood that the test audio data sample represents speech performed by a particular human (Teaches a system for recursively authenticating a user that adapts its scoring and calibration pipeline based on user’s genuine audio data, where the “reference voiceprint may be updated taking into account the speech voiceprint of the (successfully) authenticated user,” and where the “The score x may be determined by e.g., a classifier trained to determine the score x of the speech voiceprint vt (s′), representing the input speech signal of speaker s′ at time t, against the reference voiceprint et−1 (s) of speaker s at past time t−1” and “score calibration σ(x) of the authentication may be also performed” on that score. As such, by both generating the reference voiceprint and updating the reference voiceprint, the back-end model, specifically the calibration model, is adapted based on the “speech voiceprint of the (successfully) authenticated user” which corresponds to a set of genuine audio data samples corresponding to the particular human, and where the reference voice print is both originally generated as part of the enrollment process and iteratively updated based on speech performed by a particular human.; Lopez Espejo, ¶ [0027]-[0028], [0049]-[0051]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the spoof detection and speaker recognition systems of Chen to incorporate the teachings of Lopez Espejo to include wherein to adapt the back-end model based on the set of genuine audio data samples corresponding to the particular human, the processing circuitry is configured to: adapt a calibration model of the back-end model based on the set of genuine audio data samples corresponding to the particular human such that the calibration model is configured to calibrate an output of the back-end model to indicate the likelihood that the test audio data sample represents speech performed by a particular human. Chen teaches generating enrollee voiceprints and spoofprints during an enrollment phase to act as a baseline for future deployment comparisons. However, a well-known problem in the field of biometric authentication is that systems relying on statistically eroded acoustic embeddings suffer from performance degradation over time, due to the natural drift in a genuine speaker’s voice as well as changes in recording conditions. Lopez Espejo specifically recognizes and solves this problem, where Lopez Espejo discloses a recursive updating scheme for the user’s reference voiceprint using newly acquired genuine speech from successful authentications. By recursively updating the reference voiceprint as fed into the classifier to generate the raw score, the overall calibration pipeline (sigma(x)) is inherently adapted and kept mathematically calibrated to the user’s acoustic reality, providing the established benefit of more reliable recognitions for user authentication and spoof avoidance, as recognized by Lopez Espejo. (Lopez Espejo, ¶ [0011]-[0012], [0049], [0051]).
Regarding claim 18, the rejection of claim 12 is incorporated. Claim 18 is substantially the same as claim 7 and is therefore rejected under the same rationale as above.
Claims 10 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chen as applied to claim 1 and 12 above, and further in view of Non-Patent Literature to Ferrer (Ferrer, L., McLaren, M. and Brummer, N., 2021. A Speaker Verification Backend with Robust Performance across Conditions. arXiv preprint arXiv:2102.01760, hereinafter Ferrer).
Regarding claim 10, the rejection of claim 1 is incorporated. Chen discloses all of the elements of the current invention as stated above. However, Chen fails to expressly recite wherein the back-end model comprises a linear discriminant analysis (LDA) model, a probabilistic LDA (PLDA) model, and a calibration model, wherein to process the one or more embeddings to determine the likelihood, the processing circuitry is configured to: transform, by executing the LDA model, the one or more embeddings based on an LDA formula; normalize the transformed one or more embeddings based on a mean of the transformed one or more embeddings and a variance of the transformed one or more embeddings; determine, by executing the PLDA model according to a PLDA formula, two or more scores based on the transformed and normalized one or more embeddings; calibrate, by executing the calibration model, the two or more scores; and calculate, based on the calibrated two or more scores, the likelihood that the test audio data sample represents speech performed by a particular human.
Ferrer teaches an improvement to the standard speaker verification backend. (Ferrer, ¶ Abstract). Regarding claim 10, Ferrer teaches wherein the back-end model comprises a linear discriminant analysis (LDA) model, a probabilistic LDA (PLDA) model, and a calibration model, (Discloses “linear discriminant analysis (LDA)”, followed by “probabilistic linear discriminant analysis (PLDA)” and followed by “a calibration stage”; Ferrer, ¶ pg. 4, line 29 - pg. 5, line 10) wherein to process the one or more embeddings to determine the likelihood, the processing circuitry is configured to: transform, by executing the LDA model, the one or more embeddings based on an LDA formula (“The speaker embeddings are then typically transformed using linear discriminant analysis (LDA),” which is based on the LDA formula; Ferrer, ¶ pg. 4, line 29 - pg. 5, line 10); normalize the transformed one or more embeddings based on a mean of the transformed one or more embeddings and a variance of the transformed one or more embeddings (The transformed speaker embeddings from the LDA are “then mean-[normalized]” and “variance-normalized”; Ferrer, ¶ pg. 4, line 29 - pg. 5, line 10); determine, by executing the PLDA model according to a PLDA formula, two or more scores based on the transformed and normalized one or more embeddings (“probabilistic linear discriminant analysis (PLDA) is used to obtain scores for each speaker verification trial.” which is two or more scores in light of the transformed and normalized embeddings.; Ferrer, ¶ pg. 4, line 29 - pg. 5, line 10); calibrate, by executing the calibration model, the two or more scores (using the scores for each speaker verification form the PLDA, “a calibration stage” is then “used to convert the scores produced by PLDA into log-likelihood-ratios (LLRs)”; Ferrer, ¶ pg. 4, line 29 - pg. 5, line 10); and calculate, based on the calibrated two or more scores, the likelihood that the test audio data sample represents speech performed by a particular human (“log-likelihood-ratios (LLRs) that can be used to make cost-effective Bayes decisions” for speaker verification processes; Ferrer, ¶ pg. 4, line 29 - pg. 5, line 10).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the spoof detection and speaker recognition systems of Chen to incorporate the teachings of Ferrer to include wherein the back-end model comprises a linear discriminant analysis (LDA) model, a probabilistic LDA (PLDA) model, and a calibration model, wherein to process the one or more embeddings to determine the likelihood, the processing circuitry is configured to: transform, by executing the LDA model, the one or more embeddings based on an LDA formula; normalize the transformed one or more embeddings based on a mean of the transformed one or more embeddings and a variance of the transformed one or more embeddings; determine, by executing the PLDA model according to a PLDA formula, two or more scores based on the transformed and normalized one or more embeddings; calibrate, by executing the calibration model, the two or more scores; and calculate, based on the calibrated two or more scores, the likelihood that the test audio data sample represents speech performed by a particular human. Chen teaches projecting extracted neural embeddings using a front end neural network and projecting those embeddings into a more discriminative subspace using linear discriminate analysis (LDA) to maximize interclass variance between genuine and spoofed audio. However, Chen relies on a simple distance scoring module to calculate the final verification score (e.g., FIG. 7, element 716). A known limitation in the art of biometric authentication is that raw distance scores are uncalibrated and do not represent a true statistical likelihood or probability, making it difficult to set reliable authentication threshold across different users and environments. See, for example, Non Patent Literature to Dehak (Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P. and Ouellet, P., 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), pp.788-798, hereinafter Dehak) which establishes the deficiencies of cosine distance, indicating it fundamentally “does not explicitly model the within speaker variability.” (Dehak, pg. 793, Section V-B). Ferrer teaches that to solve this issue and achieve reliable probabilistic scoring, the “standard method” in the art for processing neural embeddings involves taking LDA-transformed embeddings and processing them though a backend composed of PLDA and global logistic regression score calibration. PLDA is universally recognized in the art as the standard mathematical framework for calculating robust log-likelihood ratios (probabilities) rather than raw distances. It would have been obvious to one having ordinary skill in the art to complete the classification pipeline of Chen by applying the remainder of Ferrer’s standard PLDA and calibration backend to those LDA projected embeddings to achieve the predictable result of highly robust, mathematically calibrated performance in unseen conditions, as recognized by Ferrer. (Ferrer, ¶ pg. 5, inclusive).
Regarding claim 19, the rejection of claim 12 is incorporated. Claim 19 is substantially the same as claim 10 and is therefore rejected under the same rationale as above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kaskari (U.S. Pat. App. Pub. No. 2022/0093106) discloses systems and methods for speaker verification comprise optimizing a neural network including generating a plurality of embedding vectors configured to differentiate audio samples by speaker, computing a generalized negative log-likelihood loss (GNLL) value for the training batch based, at least in part, on the embedding vectors, and modifying weights of the neural network to reduce the GNLL value.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627. The examiner can normally be reached 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Sean E Serraguard/Primary Examiner, Art Unit 2657