Last updated: April 19, 2026
Application No. 18/046,441
VOICE-BASED AUTHENTICATION

Non-Final OA §101§103
Filed
Oct 13, 2022
Examiner
CHUNG, DANIEL WONSUK
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Collegenet Inc.
OA Round
2 (Non-Final)
This examiner grants 54% of cases after interview

— +37.5% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 44 resolved cases, 2023–2026
Examiner Intelligence

CHUNG, DANIEL WONSUK View full profile →
Grants 54% of resolved cases
Career Allow Rate
24 granted / 44 resolved
-7.5% vs TC avg
Strong +38% interview lift
Without
With
+37.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
33 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
25.2%
-14.8% vs TC avg
§103
52.3%
+12.3% vs TC avg
§102
17.3%
-22.7% vs TC avg
§112
5.2%
-34.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 44 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending and have been examined.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1 and 9 the limitations of “presenting a user with a series of dictionary words”, “recording first uttered words from the user corresponding to the series of dictionary words”, “assigning a confidence score to each of the first uttered words based on a comparison of the first uttered words with standard pronunciations of corresponding words in the series, wherein the comparison of the first uttered words with the standard pronunciations is performed by a speech recognition engine and comprises performing acoustic feature extraction on the first uttered words and using one or more acoustic models or language models in a decoding process to identify the first uttered words, and wherein at least some of the first uttered words have confidence scores in a lower range and are deemed to be low-confidence words”, “receiving voice input in the form of second uttered words in response to a challenge prompt including at least one of the low-confidence words”, “assigning an authentication score for each of the second uttered words based on a comparison of the second uttered words with the standard pronunciations of corresponding words in the challenge prompt”, and “granting access to a resource based at least in part on a determination that at least one of the second uttered words has an authentication score within a predefined range of the confidence score of the at least one low-confidence word”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  More specifically, the mental process of a human writing dictionary words on paper using pen or pencil, hearing words corresponding to the words written, thinking of confidence score for the utter words utilizing rules or instructions comparing dialects with standard pronunciations, writing a prompt on paper and hearing uttered words, thinking of an authentication score for the words, and writing specific words on paper based on the authentication score and confidence score.  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas.  Accordingly, the claims recite an abstract idea.
Regarding claim 10 the limitations of “receiving voice input in the form of uttered words from a user”, “comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and one or more low-confidence corpus words , wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words”, “determining similarity scores for the individual uttered words based on the comparing of the uttered words with the authentication text”, “identifying a plurality of high-confidence uttered words and at least one low- confidence uttered word based on the similarity scores”, “comparing the high-confidence uttered words with a threshold”, “determining that the at least one low-confidence uttered word corresponds to any of the one or more low-confidence corpus words”, and “granting access to a resource based at least in part on a comparison of the high- confidence uttered words with a threshold and on the determination that the at least one low-confidence uttered word corresponds to any of the one or more low-confidence corpus words in a challenge prompt”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  More specifically, the mental process of a human hearing words, comparing the heard words with an authentication text utilizing rules or instructions comparing dialects with standard pronunciations, determining similarity scores in the mind, and writing specific words on paper based on the based on a threshold score.  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas.  Accordingly, the claims recite an abstract idea.
Regarding claim 20 the limitations of “receiving voice input in the form of uttered words from a user”, “comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and a plurality of low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words;”, “identifying a plurality low-confidence uttered words based on the comparison of the uttered words with the authentication text”, “determining that the low-confidence uttered words correspond to the low- confidence corpus words of the authentication text”, and “granting access to a resource based at least in part on the determination that the low- confidence uttered words correspond to the low-confidence corpus words of the authentication text”, as drafted, are processes that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components.  More specifically, the mental process of a human hearing words, comparing the heard words with an authentication text utilizing rules or instructions comparing dialects with standard pronunciations, and writing specific words on paper based on the based on whether the heard words correspond to the authentication text.  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the --Mental Processes-- grouping of abstract ideas.  Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application because the recitation of a non-transitory computer readable medium in claim 9 and 10, and computer system in claim 20 reads to generalized computer components, based upon the claim interpretation wherein the structure is interpreted using P0051-P0058 in the specification. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using generalized computer components to write words on paper, hear words, determine confidence score or authentication score, and determine if words heard correspond to challenge prompt or authentication text amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are not patent eligible.
With respect to claim 2, the claim recites “wherein the challenge prompt comprises a pre-defined script”, which reads on a human writing prompt words on paper using a pen or pencil.  No additional limitations are present.
	With respect to claim 3, the claim recites “wherein the challenge prompt includes words that are randomly selected from the series of dictionary words”, which reads on a human writing random dictionary words on paper using a pen or pencil.  No additional limitations are present.
With respect to claim 4, the claim recites “wherein the granting access to the resource is performed as part of a multi-factor authentication process”, which reads on a human deciding to reveal words on paper depending on multiple factors.  No additional limitations are present.
With respect to claim 5, the claim recites “performing analysis of a pitch, rhythm or speaking speed of the voice input, wherein granting access to the resource is further based on the analysis”, which reads on a human deciding to reveal words on paper depending on analysis of rhythm of voice heard.  No additional limitations are present.
With respect to claim 6, the claim recites “wherein the second uttered words are spoken”, which reads on a human hearing words that are spoken.  No additional limitations are present.
With respect to claim 7, the claim recites “wherein the second uttered words are sung”, which reads on a human hearing words that are sung.  No additional limitations are present.
With respect to claim 8, the claim recites “wherein the resource comprises a user account, a document, a building, or a vehicle”, which reads on a human thinking of words in a document.  No additional limitations are present.
With respect to claim 11, the claim recites “wherein the uttered words are uttered in response to a challenge prompt”, which reads on a human hearing words after challenge prompt.  No additional limitations are present.
With respect to claim 12, the claim recites “the steps further comprising generating the authentication text for presentation as part of a challenge prompt”, which reads on a human writing prompt words on paper using a pen or pencil.  No additional limitations are present.
With respect to claim 13, the claim recites “comparing a percentage of the uttered words in the voice input identified as high-confidence with a corresponding percentage threshold” and “comparing the number of uttered words in the voice input identified as high-confidence with a corresponding number threshold”, which reads on a human comparing heard words with words in the authentication text in the mind.  No additional limitations are present.
With respect to claim 14, the claim recites “wherein the authentication text is randomly selected from the corpus of words”, which reads on a human randomly selecting words in the mind.  No additional limitations are present.
With respect to claim 15, the claim recites “wherein the steps are performed as part of a multi-factor authentication process”, which reads on a human deciding to reveal words on paper depending on multiple factors.  No additional limitations are present.
With respect to claim 16, the claim recites “wherein the steps further comprise analyzing a pitch, rhythm or speaking speed of the voice input, and wherein granting access to the resource is further based on the pitch, rhythm, or speaking speed”, which reads on a human deciding to reveal words on paper depending on analysis of rhythm of voice heard.  No additional limitations are present.
With respect to claim 17, the claim recites “wherein the uttered words are spoken”, which reads on a human hearing words that are spoken.  No additional limitations are present.
With respect to claim 18, the claim recites “wherein the uttered words are sung”, which reads on a human hearing words that are sung.  No additional limitations are present.
With respect to claim 19, the claim recites “wherein the resource comprises a user account, a document, a building, or a vehicle”, which reads on a human thinking of words in a document.  No additional limitations are present.
These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8-17, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ortiz et al. (U.S. PG Pub No. 20210173916), hereinafter Ortiz, in view of Bohra et al. (U.S. PG Pub No. 20230138820), hereinafter Bohra, and in further view of Zeljkovic et al. (U.S. PG Pub No. 20120130714), hereinafter Zeljkovic.

Regarding claim 1 and 9 Ortiz teaches:
(Claim 1) A computer-implemented method for staging and carrying out authentication of a user, the method comprising, by a computer system: (P0049, Systems, devices, methods, and computer readable media are provided in various embodiments having regard to authentication using dynamic passphrases.)
(Claim 9) A non-transitory computer-readable medium having stored thereon computer-executable instructions configured to cause a computer system to authenticate a user by performing steps comprising: (P0454, Computing devices having at least one processor configured to execute software instructions stored on a computer readable tangible, non-transitory medium.; P0049, Systems, devices, methods, and computer readable media are provided in various embodiments having regard to authentication using dynamic passphrases.)
presenting a user with a series of dictionary words; (P0022, The dynamically generated passphrases can, in a simpler embodiment, be dictionary words.; P0030, A requesting individual who is not authenticated seeks access to computer resources associated with an authenticated individual (e.g., Tom's bank account). The system controlling access to the computer resources (e.g., a bank server), dynamically generates, requests, or selects a passphrase (e.g., from a dictionary, a book, etc.) having a plurality of phonemes and presents the passphrase to the user (e.g., transmits the passphrase to the requesting individual's device, or displays the passphrase, etc.).)
recording first uttered words from the user corresponding to the series of dictionary words; (P0029, The dynamically generated passphrase(s), when spoken, require an individual to adjust their features (e.g., facial or auditory) to speak a first set of words (i.e., dynamically generated passphrase(s)) including a plurality of phonemes that are captured in audio and/or video.; P0030, The system controlling access to the computer resources (e.g., a bank server), dynamically generates, requests, or selects a passphrase (e.g., from a dictionary, a book, etc.) having a plurality of phonemes.)
assigning an authentication score for each of the second uttered words based on a comparison of the second uttered words with the standard pronunciations of corresponding words in the challenge prompt, wherein the comparison of the second uttered words with the standard pronunciations comprises performing acoustic feature extraction on the second uttered words and using one or more acoustic models or language models in a decoding process to identify the first uttered words; and (P0277, At step 1740, the predicted word is compared to the provided word (i.e. actual word) and a determination is made as to if they match. At step 1760, a match confidence score may be generated. If the score is too low, the person in the video may be determined to be an imposter at step 1750.; P0278, If the score is above a certain threshold, at step 1770, then the person in the video may be determined to be a real parson matching the provided Identity.; P0139, The neural network may use the video to train one or more transfer functions therein (e.g., for each token or transition) to predict whether image, video, or audio data contains Alice (as opposed to another individual) saying the particular phoneme, phoneme transition, or syllable. In an embodiment, the neural network is originally a baseline neural network that is then trained individually for each authenticated individual (e.g., customized using the reading of the set of words one or more times, for example, until a threshold accuracy or confidence level is reached).)
granting access to a resource based at least in part on a determination that at least one of the second uttered words has an authentication score within a predefined range of the confidence score of the at least one low-confidence word. (P0118, The system can be used to secure physical or electronic resources, such as door access, access to online resources (e.g., online banking, cloud drives) protected by a challenge mechanism established using at least the dynamically generated passphrases to establish answers that would be readily apparent to an individual.; P0427, The compared feature vectors may be compared to determine whether the feature vectors are sufficiently similar (e.g., satisfying a threshold Indicative of feature similarity). For example, similar to the determination of distances between entry vectors in regards to cluster analysis as described herein, the threshold indicative of feature similarity may be based on a distance or orientation between the two feature vectors.; P0432, In addition to an amount of matching phonemes determined, In example embodiments, the system may require a particular degree of confidence In order to authenticate the individual. For example, the model data architecture may use an activation function that Is more difficult to trigger in order to output a classification. In example embodiments, the degrees of confidence may be described as the distance between the reference data feature vectors and the feature vectors In the requesting Information. For example, the degree of confidence may require that the distance Is limited to a certain percentage of the length of the reference feature vector.).)  
Ortiz does not specifically teach:
assigning a confidence score to each of the first uttered words based on a comparison of the first uttered words with standard pronunciations of corresponding words in the series, wherein the comparison of the first uttered words with the standard pronunciations is performed by a speech recognition engine and comprises performing acoustic feature extraction on the first uttered words and using one or more acoustic models or language models in a decoding process to identify the first uttered words, and wherein at least some of the first uttered words have confidence scores in a lower range and are deemed to be low-confidence words; 
receiving voice input in the form of second uttered words in response to a challenge prompt including at least one of the low-confidence words;
Bohra, however, teaches:
assigning a confidence score to each of the first uttered words based on a comparison of the first uttered words with standard pronunciations of corresponding words in the series, wherein the comparison of the first uttered words with the standard pronunciations is performed by a speech recognition engine and comprises performing acoustic feature extraction on the first uttered words and using one or more acoustic models or language models in a decoding process to identify the first uttered words, and wherein at least some of the first uttered words have confidence scores in a lower range and are deemed to be low-confidence words; (P0004, The name mispronunciation feature can obtain a reference pronunciation for the name using a user identifier associated with the name; and can obtain a pronunciation score for the name based on a comparison between the reference pronunciation for the name and the portion of the audio input corresponding to the name.; P0054, The matching algorithm can use speech features to do matching between reference pronunciation and audio of the name from the online meeting to produce acoustic scores. The features can include, but are not limited to, emphasis, prosody and phoneme level features and other acoustic signals like a signal-to-noise ratio (SNR), confidence coming out of an audio model part of speech recognition. Each of these acoustic scores (e.g., the SNR, the confidence score, and the phoneme level scores) can be combined to produce the pronunciation score.; P0055, The pronunciation score can identify how much the extracted audio matches the reference pronunciation. For example, the pronunciation score can be a value, such as a confidence level that the name was pronounced correctly.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to assign confidence score to each uttered words according to standard pronunciations.  It would have been obvious to combine the references because the assignment of confidence scores to words is a known technique to yield a predictable result of detecting words that are mispronounced.  (Bohra P0055)
Ortiz in view of Bohra does not specifically teach:
receiving voice input in the form of second uttered words in response to a challenge prompt including at least one of the low-confidence words;
Zeljkovic, however, teaches:
receiving voice input in the form of second uttered words in response to a challenge prompt including at least one of the low-confidence words; (P0057, Computing device typically includes a microphone or a connector to a microphone to capture the user speech.; P0033, The system can use a very large vocabulary and number of sentences, thus increasing the difficulty for a thief to steal sufficient quantities of appropriate speech and generate utterances in real time. The system can optimize custom vocabulary and prosody for distinctiveness. The system can customize the lexicon for each individual speaker to optimize for phonemes (individual speech sounds), phoneme clusters, words, and prosodic patterns (speech tunes) that provide the best speaker discrimination on an individual, demographic group, societal, or other group basis.; P0066, Lexicon for an English professor may include more advanced words from a very large vocabulary. Further, the specific stress patterns, intonation, pronunciation, and so forth of the various words in the lexicon can be used to more quickly and/or accurately verify the speaker.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include low-confidence word in a challenge prompt.  It would have been obvious to combine the references because including a low confidence word maximize the ability to discriminate between speakers while minimizing length to reduce the time commitment for authentication.  (Zeljkovic P0070)

Regarding claim 2 Ortiz in view of Bohra and further view of Zeljkovic teach claim 1:
Ortiz further teaches:
wherein the challenge prompt comprises a pre-defined script. (P0349, The reference phrase may be selected from a predefined set of reference phrases.)  

Regarding claim 3 Ortiz in view of Bohra and further view of Zeljkovic teach claim 1:
Ortiz further teaches:
wherein the challenge prompt includes words that are randomly selected from the series of dictionary words. (P0385, For example, the passphrase generation unit may have a dictionary of words or source, and include an algorithm to select from the stored dictionary. Continuing the example, the passphrase generation unit may be trained to select passages from a novel (e.g., Dracula) randomly, and assess whether the words of the passage Include the phonemes or phoneme transitions present In the reference data set.)  

Regarding claim 4 Ortiz in view of Bohra and further view of Zeljkovic teach claim 1:
Ortiz further teaches:
wherein the granting access to the resource is performed as part of a multi-factor authentication process. (P0127, The verification question and the dynamically generated passphrase allow for efficient and user-friendly two factor authentication—the first factor requiring the captured media data to match a reference media data, and the second factor requiring knowledge of information stored In the profile.)  

Regarding claim 5 Ortiz in view of Bohra and further view of Zeljkovic teach claim 1:
Ortiz further teaches:
performing analysis of a pitch, rhythm or speaking speed of the voice input, wherein granting access to the resource is further based on the analysis. (P0346, Features of an individual, when speaking a phoneme, may be sufficiently unique to provide for an accurate authentication mechanism when comparing between authenticated samples and submitted samples. … Auditory features such as pitch, tone, etc. (e.g., a particular regional accent may pronounce certain phonemes differently).)

Regarding claim 6 Ortiz in view of Bohra and further view of Zeljkovic teach claim 1:
Ortiz further teaches:
wherein the second uttered words are spoken. (P0029, In the embodiments described herein, the dynamically generated passphrase(s), when spoken, require an Individual to adjust their features (e.g., facial or auditory) to speak a first set of words (i.e., dynamically generated passphrase(s)) Including a plurality of phonemes that are captured In audio and/or video.)  

Regarding claim 8 Ortiz in view of Bohra and further view of Zeljkovic teach claim 1:
Ortiz further teaches:
wherein the resource comprises a user account, a document, a building, or a vehicle. (P0008, The system can be used to secure physical or electronic resources, such as door access, access to online resources (e.g., online banking, cloud drives) protected by a challenge mechanism established using at least the dynamically generated passphrases to establish answers that would be readily apparent to an individual.)  

Regarding claim 10 Ortiz teaches:
A non-transitory computer-readable medium having stored thereon computer-executable instructions configured to cause a computer system to authenticate a user by performing steps comprising: (P0454, Computing devices having at least one processor configured to execute software instructions stored on a computer readable tangible, non-transitory medium.; P0049, Systems, devices, methods, and computer readable media are provided in various embodiments having regard to authentication using dynamic passphrases.)
receiving voice input in the form of uttered words from a user; (P0029, The dynamically generated passphrase(s), when spoken, require an individual to adjust their features (e.g., facial or auditory) to speak a first set of words (i.e., dynamically generated passphrase(s)) including a plurality of phonemes that are captured in audio and/or video.)
comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and one or more low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words; (P0277, At step 1740, the predicted word is compared to the provided word (i.e. actual word) and a determination is made as to if they match. At step 1760, a match confidence score may be generated. If the score is too low, the person in the video may be determined to be an imposter at step 1750.; P0278, If the score is above a certain threshold, at step 1770, then the person in the video may be determined to be a real parson matching the provided Identity.; P0139, The neural network may use the video to train one or more transfer functions therein (e.g., for each token or transition) to predict whether image, video, or audio data contains Alice (as opposed to another individual) saying the particular phoneme, phoneme transition, or syllable. In an embodiment, the neural network is originally a baseline neural network that is then trained individually for each authenticated individual (e.g., customized using the reading of the set of words one or more times, for example, until a threshold accuracy or confidence level is reached).
determining that the at least one low-confidence uttered word corresponds to any of the one or more low-confidence corpus words; and (P0277, At step 1730, the trained AI algorithm may predict a word spoken by the authenticated individual in the video. At step 1740, the predicted word is compared to the provided word (i.e. actual word) and a determination is made as to if they match. At step 1760, a match confidence score may be generated. If the score is too low, the person in the video may be determined to be an imposter at step 1750.)
granting access to a resource based at least in part on a comparison of the high- confidence uttered words with a threshold and on the determination that the at least one low-confidence uttered word corresponds to any of the one or more low-confidence corpus words associated with the challenge prompt. (P0118, The system can be used to secure physical or eleactronic resources, such as door access, access to online resources (e.g., online banking, cloud drives) protected by a challenge mechanism established using at least the dynamically generated passphrases to establish answers that would be readily apparent to an individual.; P0277, At step 1730, the trained AI algorithm may predict a word spoken by the authenticated individual in the video. At step 1740, the predicted word is compared to the provided word (i.e. actual word) and a determination is made as to if they match. At step 1760, a match confidence score may be generated. If the score is too low, the person in the video may be determined to be an imposter at step 1750.)  
Ortiz does not specifically teach:
comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and one or more low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words; 
determining similarity scores for the individual uttered words based on the comparing of the uttered words with the authentication text; 
identifying a plurality of high-confidence uttered words and at least one low- confidence uttered word based on the similarity scores; 
comparing the high-confidence uttered words with a threshold; 
Bohra, however, teaches:
determining similarity scores for the individual uttered words based on the comparing the uttered words with the authentication text; (P0052, The name mispronunciation feature can obtain a pronunciation score for the name based on a comparison between the reference pronunciation for the name and the portion of the audio input corresponding to the name.)
identifying a plurality of high-confidence uttered words and at least one low- confidence uttered word based on the similarity scores; (P0052, The name mispronunciation feature can obtain a pronunciation score for the name based on a comparison between the reference pronunciation for the name and the portion of the audio input corresponding to the name.; P0058, Mispronunciation feature can determine whether the pronunciation score is below a threshold.)
comparing the high-confidence uttered words with a threshold; (P0058, Mispronunciation feature can determine whether the pronunciation score is below a threshold.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to assign similarity score to each uttered words according to standard pronunciations.  It would have been obvious to combine the references because the assignment of confidence scores to words is a known technique to yield a predictable result of detecting words that are mispronounced.  (Bohra P0055)
Ortiz in view of Bohra does not specifically teach:
comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and one or more low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words;
Zeljkovic, however, teaches:
comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and one or more low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words; (P0057, Computing device typically includes a microphone or a connector to a microphone to capture the user speech.; P0033, The system can use a very large vocabulary and number of sentences, thus increasing the difficulty for a thief to steal sufficient quantities of appropriate speech and generate utterances in real time. The system can optimize custom vocabulary and prosody for distinctiveness. The system can customize the lexicon for each individual speaker to optimize for phonemes (individual speech sounds), phoneme clusters, words, and prosodic patterns (speech tunes) that provide the best speaker discrimination on an individual, demographic group, societal, or other group basis.; P0066, Lexicon for an English professor may include more advanced words from a very large vocabulary. Further, the specific stress patterns, intonation, pronunciation, and so forth of the various words in the lexicon can be used to more quickly and/or accurately verify the speaker.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include high and low confidence word in a challenge prompt.  It would have been obvious to combine the references because including a high and low confidence word maximize the ability to discriminate between speakers while minimizing length to reduce the time commitment for authentication.  (Zeljkovic P0070)

Regarding claim 11 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10:
Ortiz further teaches:
wherein the uttered words are uttered in response to a challenge prompt. (P0435, The system may generate a new dynamically generated password, and prompt the individual to again authenticate via a video of the individual speaking the new passphrase.)  

Regarding claim 12 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10:
Ortiz further teaches:
the steps further comprising generating the authentication text for presentation as part of a challenge prompt. (P0030, The system controlling access to the computer resources (e.g., a bank server), dynamically generates, requests, or selects a passphrase (e.g., from a dictionary, a book, etc.) having a plurality of phonemes and presents the passphrase to the user (e.g., transmits the passphrase to the requesting Individual's device, or displays the passphrase, etc.).)  

Regarding claim 13 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10:
Ortiz further teaches:
comparing a percentage of the uttered words in the voice input identified as high-confidence with a corresponding percentage threshold; or comparing the number of uttered words in the voice input identified as high-confidence with a corresponding number threshold. (P0045, The third party computing device validates the challenge response data set by validating against the facial representation extracted by the model data architecture to establish that the challenged individual speaking the passphrase satisfies an output of the model data architecture at an acceptable confidence threshold value (e.g., a pre-defined values).)  

Regarding claim 14 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10:
Ortiz further teaches:
wherein the authentication text is randomly selected from the corpus of words. (P0385, For example, the passphrase generation unit may have a dictionary of words or source, and include an algorithm to select from the stored dictionary. Continuing the example, the passphrase generation unit may be trained to select passages from a novel (e.g., Dracula) randomly, and assess whether the words of the passage Include the phonemes or phoneme transitions present In the reference data set.)  

Regarding claim 15 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10:
Ortiz further teaches:
wherein the steps are performed as part of a multi-factor authentication process. (P0127, The verification question and the dynamically generated passphrase allow for efficient and user-friendly two factor authentication—the first factor requiring the captured media data to match a reference media data, and the second factor requiring knowledge of information stored In the profile.)  

Regarding claim 16 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10:
Ortiz further teaches:
wherein the steps further comprise analyzing a pitch, rhythm or speaking speed of the voice input, and wherein granting access to the resource is further based on the pitch, rhythm, or speaking speed. (P0346, Features of an individual, when speaking a phoneme, may be sufficiently unique to provide for an accurate authentication mechanism when comparing between authenticated samples and submitted samples. … Auditory features such as pitch, tone, etc. (e.g., a particular regional accent may pronounce certain phonemes differently).)  

Regarding claim 17 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10:
Ortiz further teaches:
wherein the uttered words are spoken. (P0029, In the embodiments described herein, the dynamically generated passphrase(s), when spoken, require an Individual to adjust their features (e.g., facial or auditory) to speak a first set of words (i.e., dynamically generated passphrase(s)) Including a plurality of phonemes that are captured In audio and/or video.)  

Regarding claim 19 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10:
Ortiz further teaches:
wherein the resource comprises a user account, a document, a building, or a vehicle. (P0008, The system can be used to secure physical or electronic resources, such as door access, access to online resources (e.g., online banking, cloud drives) protected by a challenge mechanism established using at least the dynamically generated passphrases to establish answers that would be readily apparent to an individual.)  

Regarding claim 20 Ortiz teaches:
A computer system comprising a memory and a processor, the computer system being programmed to perform steps comprising: (P0111, Embodiments of methods, systems, and apparatus are described through reference to the drawings.; P0153, A processor or processing device can execute instructions stored in memory to configure or implement various components.)
receiving voice input in the form of uttered words from a user; (P0029, The dynamically generated passphrase(s), when spoken, require an individual to adjust their features (e.g., facial or auditory) to speak a first set of words (i.e., dynamically generated passphrase(s)) including a plurality of phonemes that are captured in audio and/or video.)
comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and a plurality of low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words; (P0277, At step 1740, the predicted word is compared to the provided word (i.e. actual word) and a determination is made as to if they match. At step 1760, a match confidence score may be generated. If the score is too low, the person in the video may be determined to be an imposter at step 1750.; P0278, If the score is above a certain threshold, at step 1770, then the person in the video may be determined to be a real parson matching the provided Identity.; P0139, The neural network may use the video to train one or more transfer functions therein (e.g., for each token or transition) to predict whether image, video, or audio data contains Alice (as opposed to another individual) saying the particular phoneme, phoneme transition, or syllable. In an embodiment, the neural network is originally a baseline neural network that is then trained individually for each authenticated individual (e.g., customized using the reading of the set of words one or more times, for example, until a threshold accuracy or confidence level is reached).); P0004, The name mispronunciation feature can obtain a reference pronunciation for the name using a user identifier associated with the name; and can obtain a pronunciation score for the name based on a comparison between the reference pronunciation for the name and the portion of the audio input corresponding to the name.; P0054, The matching algorithm can use speech features to do matching between reference pronunciation and audio of the name from the online meeting to produce acoustic scores. The features can include, but are not limited to, emphasis, prosody and phoneme level features and other acoustic signals like a signal-to-noise ratio (SNR), confidence coming out of an audio model part of speech recognition. Each of these acoustic scores (e.g., the SNR, the confidence score, and the phoneme level scores) can be combined to produce the pronunciation score.; P0055, The pronunciation score can identify how much the extracted audio matches the reference pronunciation. For example, the pronunciation score can be a value, such as a confidence level that the name was pronounced correctly.
determining that the low-confidence uttered words correspond to the low- confidence corpus words of the authentication text; and (P0277, At step 1730, the trained AI algorithm may predict a word spoken by the authenticated individual in the video. At step 1740, the predicted word is compared to the provided word (i.e. actual word) and a determination is made as to if they match. At step 1760, a match confidence score may be generated. If the score is too low, the person in the video may be determined to be an imposter at step 1750.)
granting access to a resource based at least in part on the determination that the low- confidence uttered words correspond to the low-confidence corpus words of the authentication text. (P0118, The system can be used to secure physical or electronic resources, such as door access, access to online resources (e.g., online banking, cloud drives) protected by a challenge mechanism established using at least the dynamically generated passphrases to establish answers that would be readily apparent to an individual.; P0277, At step 1730, the trained AI algorithm may predict a word spoken by the authenticated individual in the video. At step 1740, the predicted word is compared to the provided word (i.e. actual word) and a determination is made as to if they match. At step 1760, a match confidence score may be generated. If the score is too low, the person in the video may be determined to be an imposter at step 1750.)
Ortiz does not specifically teach:
comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and a plurality of low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words;
identifying a plurality low-confidence uttered words based on the comparison of the uttered words with the authentication text;
Bohra, however, teaches:
identifying a plurality low-confidence uttered words based on the comparison of the uttered words with the authentication text; (P0052, The name mispronunciation feature can obtain a pronunciation score for the name based on a comparison between the reference pronunciation for the name and the portion of the audio input corresponding to the name.; P0058, Mispronunciation feature can determine whether the pronunciation score is below a threshold.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to assign confidence score to each uttered words according to standard pronunciations.  It would have been obvious to combine the references because the assignment of confidence scores to words is a known technique to yield a predictable result of detecting words that are mispronounced.  (Bohra P0055)
Ortiz in view of Bohra does not specifically teach:
comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and a plurality of low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words;
Zeljkovic, however, teaches:
comparing the uttered words of the voice input with an authentication text including a plurality of high-confidence corpus words and a plurality of low-confidence corpus words, wherein the comparing of the uttered words with the authentication text is performed by a speech recognition engine and comprises performing acoustic feature extraction on the uttered words and using one or more acoustic models or language models in a decoding process to identify the uttered words; (P0057, Computing device typically includes a microphone or a connector to a microphone to capture the user speech.; P0033, The system can use a very large vocabulary and number of sentences, thus increasing the difficulty for a thief to steal sufficient quantities of appropriate speech and generate utterances in real time. The system can optimize custom vocabulary and prosody for distinctiveness. The system can customize the lexicon for each individual speaker to optimize for phonemes (individual speech sounds), phoneme clusters, words, and prosodic patterns (speech tunes) that provide the best speaker discrimination on an individual, demographic group, societal, or other group basis.; P0066, Lexicon for an English professor may include more advanced words from a very large vocabulary. Further, the specific stress patterns, intonation, pronunciation, and so forth of the various words in the lexicon can be used to more quickly and/or accurately verify the speaker.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include high and low confidence word in a challenge prompt where the comparison is performed through acoustic feature attraction.  It would have been obvious to combine the references because including high and low confidence word maximize the ability to discriminate between speakers while minimizing length to reduce the time commitment for authentication.  (Zeljkovic P0070)

Claims 7 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Ortiz in view of Bohra, in view of Zeljkovic, and further view of McClintock et al. (U.S. Patent No. 10587594).

Regarding claim claim 7 Ortiz in view of Bohra and further view of Zeljkovic teach claim 1.
Ortiz in view of Bohra and further view of Zeljkovic does not specifically teach:
wherein the second uttered words are sung. 
McClintock, however, teaches:
wherein the second uttered words are sung. (Col. 10, Lines 9-32, The user may select particular lyrics to sing during authentication.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have words sung.  It would have been obvious to combine the references because the authentication text can be words from a song and the system allows for authentication words to be sung during input.  (McClintock Col. 6, Lines 64-68, Col. 7, Lines 1-17)

Regarding claim 18 Ortiz in view of Bohra and further view of Zeljkovic teach claim 10.
Ortiz in view of Bohra and further view of Zeljkovic does not specifically teach:
wherein the uttered words are sung. 
McClintock, however, teaches:
wherein the uttered words are sung. (Col. 10, Lines 9-32, The user may select particular lyrics to sing during authentication.)  
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have words sung.  It would have been obvious to combine the references because the authentication text can be words from a song and the system allows for authentication words to be sung during input.  (McClintock Col. 6, Lines 64-68, Col. 7, Lines 1-17)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL WONSUK CHUNG whose telephone number is (571)272-1345. The examiner can normally be reached Monday - Friday (7am-4pm)[PT].
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, PIERRE-LOUIS DESIR can be reached at (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DANIEL W CHUNG/Examiner, Art Unit 2659        

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Oct 13, 2022
Application Filed
May 16, 2025
Non-Final Rejection — §101, §103
Nov 19, 2025
Response Filed
Mar 07, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/984,768
Patent 12579471
DATA AUGMENTATION AND BATCH BALANCING METHODS TO ENHANCE NEGATION AND FAIRNESS
2y 5m to grant Granted Mar 17, 2026
17/812,782
Patent 12493892
METHOD AND SYSTEM FOR EXTRACTING CONTEXTUAL PRODUCT FEATURE MODEL FROM REQUIREMENTS SPECIFICATION DOCUMENTS
2y 5m to grant Granted Dec 09, 2025
17/706,303
Patent 12400078
INTERPRETABLE EMBEDDINGS
2y 5m to grant Granted Aug 26, 2025
18/441,766
Patent 12387000
PRIVACY-PRESERVING AVATAR VOICE TRANSMISSION
2y 5m to grant Granted Aug 12, 2025
17/842,986
Patent 12380875
SPEECH SYNTHESIS WITH FOREIGN FRAGMENTS
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

2-3
Expected OA Rounds
54%
Grant Probability
92%
With Interview (+37.5%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 44 resolved cases by this examiner. Grant probability derived from career allow rate.