DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on December 29, 2025, has been entered.
Response to Arguments
Applicant’s arguments, filed December 29, 2025, regarding the rejection of claims 12 and 14 – 38 under 35 U.S.C. 103 have been considered but they are not persuasive.
On page 11 of Applicant’s response, Applicant argues “Further, claim 12 now includes that in response to a score meeting or exceeding a threshold, the system will "perform an action based on natural language processing of the audio signal of the user, the action including one or more of authentication, grant of physical or virtual access, and enable purchase of goods or services." The remaining independent claims include similar features. Zhang does not mention or suggest such features and the other cited references do not cure such deficiencies.”
However, Zhang et al. (US Patent Application Publication No. 2021/0256979), hereinafter Zhang, recites, in paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication.", and recites, in paragraph 0038, lines 1-12, "In a possible design method, the recognition unit is specifically configured to: calculate a first degree of matching between the first voice component and the first voiceprint model of the authorized user; if the first matching degree is greater than a first threshold, determine that the first voice component matches the first voiceprint model of the authorized user; calculate a second degree of matching between the second voice component and the second voiceprint model of the authorized user; and if the second matching degree is greater than a second threshold, determine that the second voice component matches the second voiceprint model of the authorized user.", disclosing a score, based on analysis of the bone conduction signal and the audio signal, being greater than or equal to a preselected threshold. Zhang further recites, in paragraph 0049, lines 1-4, "In a possible design method, the sending unit is further configured to send an authentication success message or an unlocking instruction to the terminal if the voicing user is an authorized user." And further recites, in paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP.", disclosing performing an action based on natural language processing of the audio signal of the user, the action including one or more of authentication, grant of physical or virtual access, and enable purchase of goods or services, where generating an operation instruction corresponding to voice information reads on performing an action based on natural language processing of the audio signal of the user, sending an authentication success message reads on the action being authentication, and displaying a payment screen of a WeChat APP reads on the action being enabling the purchase of goods or services.
Therefore, the rejections of claims 12 and 14 – 38 under 35 U.S.C. 103 are maintained.
Applicant’s remaining arguments, filed December 29, 2025, with respect to claims 1 – 12 and 14 – 38 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 – 8 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (US Patent Application Publication No. 2021/0256979), hereinafter Zhang, in view of Zhong et al. (US Patent No. 10,535,364), hereinafter Zhong, Lesso (US Patent No. 10,878,825), and Blouet et al. (US Patent Application Publication No. 2023/0153408), hereinafter Blouet.
Regarding claim 1, Zhang discloses a system for two-way authentication of a user (Paragraph 0068, lines 1-4, "As shown in FIG. 1, a voice control method provided in an embodiment of this application may be applied to a voice control system including a wearable device 11 and a terminal 12."), the system comprising:
a wearable device including: a sensor to receive a bone conduction signal based on a user's speech (Paragraph 0070, lines 1-14, "As shown in FIG. 2, the wearable device 11 may specifically include a first voice sensor 201 disposed outside the wearable device 11 and a second voice sensor 202 disposed inside the wearable device 11. An inside of the wearable device 11 refers to a side that is directly in contact with a user when the user uses the wearable device 11, and an outside of the wearable device 11 refers to a side that is not directly in contact with the user. For example, the first voice sensor 201 may be an air conduction microphone, and the second voice sensor 202 may be a sensor capable of collecting a vibration signal generated when the user makes a sound, such as a bone conduction microphone, an optical vibration sensor, an acceleration sensor, or an air conduction microphone."),
and a communication circuitry configured to transmit the bone conduction signal (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition.");
and a smart device including: (a) a microphone configured to receive an audio signal from the user's speech (Paragraph 0074, lines 1-2, "As shown in FIG. 3, the terminal 12 in the voice control system may be specifically a mobile phone 100."; Paragraph 0086, lines 1-3, "The audio circuit 109, a speaker 113, and a microphone 114 may provide an audio interface between the user and the mobile phone 100."),
and (b) a bone and audio authentication circuitry configured to: receive the bone conduction signal from the communication circuitry (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition."),
analyze the bone conduction signal and the audio signal (Paragraph 0007, lines 1-7, "It can be learned that when collecting the voice information of the voicing user, the wearable device collects two pieces of voice information (that is, the first voice component and the second voice component) by using two voice sensors. In this way, the terminal may separately perform voiceprint recognition on the two pieces of voice information."; Paragraph 0021, lines 6-8, "For example, the first voice sensor is an air conduction microphone, and the second voice sensor is a bone conduction microphone."),
and authenticate the user in response to analysis of the bone conduction signal and the audio signal (Paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication.").
Zhang does not specifically disclose: convert the bone conduction signal to a bone conduction binary waveform and the audio signal to an audio signal binary waveform, determine a consistency score based on a comparison of the bone conduction binary waveform and the audio signal binary waveform, in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.
Zhong teaches:
convert the bone conduction signal to a bone conduction binary waveform and the audio signal to an audio signal binary waveform (Column 5, lines 13-25, "The voice activity data 122 may comprise information indicative of whether the user 102 wearing the HMWD 106 is speaking at a particular time. For example, voice activity data 122 may include a single bit binary value in which that a “0” represents no speech by the user 102 and a “1” indicates that the user 102 is speaking. In some implementations, the voice activity data 122 may include a timestamp. For example, the timestamp may be indicative of the time for which the determination of the voice activity data 122 is deemed to be relevant, such as the time of data acquisition, time of processing, and so forth. One or more of the BC signal data 116 or the AC signal data 118 are processed to determine presence of speech."; Column 8, lines 30-39, "At 210, BC energy data 214 is generated. The BC energy data 214 may be generated by determining the energy is greater than a threshold minimum value and less than a threshold maximum value. In one implementation, the BC energy data 214 may comprise a single bit binary value or flag in which a “1” indicates the portion of the BC signal data 116 assessed has an energy value that is within the threshold range, while a “0” indicates the portion of the BC signal data 116 assessed has an energy value that is outside of this threshold range."; Column 9, lines 31-39, "At 310, the value of the energy is compared to a threshold value and AC energy data 312 is generated. For example, the value of the energy of the AC signal data 118 may be determined to be greater than the threshold energy value. In one implementation, the AC energy data 312 may comprise a single bit binary value or flag in which a “1” indicates the AC signal data 118 has an energy that is within the threshold range, while a “0” indicates the AC signal data 118 has a BC energy value that is outside of this range."; Determining a single bit binary value in which a “1” indicates a portion of a bone conduction (BC) signal has an energy value that is within a threshold range and a “0” indicates the portion of the BC signal has an energy value that is outside of the threshold range reads on converting the bone conduction signal to a bone conduction binary waveform, and determining a single bit binary value in which a “1” indicates a portion of an air conduction (AC) signal has an energy value that is within a threshold range and a “0” indicates the portion of the AC signal has an energy value that is outside of the threshold range reads on converting the audio signal to an audio signal binary waveform.),
determine a consistency score based on a comparison of the bone conduction binary waveform and the audio signal binary waveform (Column 11, line 26 – Column 12, line 9, "The comparison data 412 may comprise a single bit binary value or flag in which a “1” indicates the two signals are correlated sufficiently to be deemed indicative of the same source, while a “0” indicates the two signals are not indicative of the same source. In other implementations, the comparison data 412 may include the flag, information indicative of the degree of correlation, and so forth. At 414, voice activity data 122 is determined. This determination is based on one or more of the comparison data 412, the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, the AC energy data 312, and so forth. For example, if the comparison data 412 indicates that the two signals are highly correlated (that is above a threshold and indicative of the same source), and the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, and the AC energy data 312 are all indicative of speech being present within signals, voice activity data 122 may be generated that indicates the user 102 wearing the HMWD 106 is speaking."; Comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal reads on determining a consistency score, and the comparison data being determined based on the BC energy data and the AC energy data reads on determining a consistency score based on a comparison of the bone conduction binary waveform and the audio signal binary waveform.),
in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval (Column 2, line 57 - Column 3, line 4, "The BC signal data and the AC signal data are processed to determine comparison data that is indicative of the extent of similarity between the two. For example, a cross-correlation algorithm may be used to generate comparison data that is indicative of the correlation between the BC signal data and the AC signal data. If the comparison data indicates a similarity that exceeds a threshold value, voice activity data is generated that indicates the user wearing the BC microphone is speaking. The voice activity data may be used to trigger other activities by the device or a system in communication with the device. For example, after determining that the user is speaking, the AC signal data may be processed by a speech recognition module, used for a voice over internet protocol (VOIP) call, and so forth." ; Column 10, line 46 - Column 11, line 7, "In some implementations, the signal comparison 406 may utilize a time window to account for delays associated with the operation or relative position of one or more of the BC microphone 110 or AC microphone 112. The center of the time window may be determined based on a time difference between the propagation of signals with respect to the BC microphone 110 and the AC microphone 112. For example, the AC microphone travel time may be determined by the propagation time of the sound waves from the mouth of the user 102 to the AC microphone 112. The BC microphone travel time may be determined by the propagation time of the vibrations from a vocal tract of the user 102 (such as larynx, throat, mouth, sinuses, etc.) to the location of the BC microphone 110. The width of the time window may be determined by the variation of the time difference among a population of users 102. Portions of the signal data that have timestamps outside of a specified time window may be disregarded from the determination of similarity. For example, the time window may be used to determine which samples in the frames from the BC signal data 116 and the AC signal data 118 are to be assessed using the cross-correlation function. In one implementation, the duration of the time window may be determined based at least in part on the physical distance between the BC microphone 110 and the AC microphone 112 and based on the speed of sound in the ambient atmosphere. The time window may be fixed, while in other implementations, the time window may vary. For example, the time window may vary based at least in part on the noise data 212."; Generating voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where the voice activity data is used to trigger other activities by the device, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold, and utilizing a time window to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal, where signal data outside of a specified time window is disregarded from the determination of similarity, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.).
Zhong is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Zhong to determine a single bit binary value in which a “1” indicates a portion of a bone conduction (BC) signal has an energy value that is within a threshold range and a “0” indicates the portion of the BC signal has an energy value that is outside of the threshold range, determine a single bit binary value in which a “1” indicates a portion of an air conduction (AC) signal has an energy value that is within a threshold range and a “0” indicates the portion of the AC signal has an energy value that is outside of the threshold range, determine comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal, where the comparison data is determined based on the BC energy data and the AC energy data, and generate voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where a time window is utilized to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal and signal data outside of a specified time window is disregarded from the determination of similarity, and where the voice activity data is used to trigger other activities by the device. Doing so would allow for the user of a wearable computing device such as a head-mounted wearable device (HMWD) being able to provide verbal input in environments with ambient noise, where ambient noise is recognized as being distinct from the voice of the wearer and ignored (Zhong; Column 3, lines 5-16).
Zhang in view of Zhong does not specifically disclose authenticating a user in response to determining a bone conduction signal and an audio signal are consistent.
Lesso teaches:
authenticating a user in response to determining a bone conduction signal and an audio signal are consistent (Column 9, line 65 - Column 10, line 8, "Additionally, or alternatively, the enable module 306 may cross-correlate the bone-conducted audio signal with the air-conducted audio signal. Upon a determination that the bone-conducted audio signal comprises a voice, the enable module 306 may cross-correlate the bone-conducted audio signal (and particularly that portion of the bone-conducted audio signal comprising the voice) with the air-conducted audio signal (particularly that portion of the air-conducted audio signal which is concurrent with the portion of the bone-conducted audio signal comprising the voice), to determine a level of correlation between the two signals."; Column 12, lines 27-35, "If the two audio signals do not correlate, it may be assumed that the voice microphone has detected significant levels of noise (e.g. the presence of other speakers). In this case, it may be inappropriate for the stored voice template to be updated based on the air-conducted voice signal, and thus the method proceeds to step 406 and ends. If the audio signals do correlate, the method proceeds to step 412, in which the biometric system determines whether the user is authenticated as the authorised user or not.”).
Lesso is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Zhang in view of Zhong to incorporate the teachings of Lesso to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and if the signals are correlated, determine whether the user is authenticated as the authorized user, as taught by Lesso, where the method of Zhong is used to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and the method of Zhang is used to determine whether the user is authenticated as the authorized user. Doing so would allow for using a bone-conducted audio signal to determine when an air-conducted audio signal comprises a voice of the user of the system (Lesso; Column 8, line 59 - Column 9, line 6).
Zhang in view of Zhong and Lesso does not specifically disclose: analyze, via a trained machine learning model trained with prior user data and with an enrollment response of the user, the bone conduction signal and the audio signal.
Blouet teaches:
analyze, via a trained machine learning model trained with prior user data and with an enrollment response of the user, the bone conduction signal and the audio signal (Paragraph 0011, lines 1-21, "An aspect of the present disclosure provides method of training a machine learning model to conduct authentication transactions that includes the step of obtaining, by an electronic device, a training dataset of audio signals. Each audio signal includes voice biometric data of a user and information for a passphrase spoken by the respective user and belongs to a same or different data class. Each data class includes a user identity and a passphrase identifier. Moreover, the method includes the steps of creating, using a machine learning model being trained, at least one embedding for each audio signal. The machine learning model includes parameters. Furthermore, the method includes calculating, by a machine learning algorithm using the embeddings, a loss, and updating parameters of the machine learning model based on the calculated loss. In response to determining criteria defining an end of training have been satisfied, deeming the machine learning model to be operable for use in simultaneously successfully verifying the identity of a user based on voice biometric data and verifying a passphrase spoken by the user matches a secret passphrase during authentication transactions."; Voice biometric data of a user reads on prior user data and information for a passphrase spoken by the respective user reads on an enrollment response of the user.
Blouet is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang in view of Zhong and Lesso to incorporate the teachings of Blouet to train a machine learning model to conduct authentication transactions by obtaining a training dataset of audio signals including voice biometric data of a user and information for a passphrase spoken by the respective user, where the trained machine learning model verification of Blouet is applied to the bone conduction signal and the audio signal of Zhang. Doing so would allow for conducting verification transactions that require fewer calculations (Blouet; Paragraph 0010, lines 1-8).
Regarding claim 2, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 1.
Zhang further discloses:
wherein the wearable device comprises one or more of a headphone, a headset, an earbud, a virtual reality (VR) headset, or an augmented reality (AR) headset (Paragraph 0069, lines 1-4, "The wearable device 11 may be a device that has a voice collection function, such as a wireless headset, a wired headset, smart glasses, a smart helmet, or a smart wristwatch.").
Regarding claim 3, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 1.
Zhang further discloses:
wherein the smart device comprises one or more of a computing device, a smart phone, a tablet, a remote control, a television, a video game entertainment system, a household internet enabled device, a household smart device, a voice controlled intelligent personal assistant device, or a voice controlled intelligent personal vehicle management device (Paragraph 0069, lines 4-9, "The terminal 12 may be a device such as a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (Ultra-mobile Personal Computer, UMPC), or a personal digital assistant (Personal Digital Assistant, PDA). This is not limited in the embodiments of this application.").
Regarding claim 4, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 1.
Zhang further discloses:
wherein the bone and audio authentication circuitry, during the analysis of the bone conduction signal and the audio signal, is configured to: pre-process the bone conduction signal and the audio signal (Paragraph 0031, lines 1-11, "With reference to any one of the second aspect and the possible design methods of the second aspect, in a tenth possible design method of the second aspect, before the performing, by the wearable device, voiceprint recognition on the first voice component and the second voice component, the method further includes: performing, by the wearable device, noise reduction processing on the first voice component and the second voice component; and/or canceling, by the wearable device, an echo signal in each of the first voice component and the second voice component by using an echo cancellation algorithm."; Performing noise reduction processing before performing voiceprint recognition reads on pre-processing.);
generate a bone conduction feature vector and an audio conduction feature vector (Paragraph 0123, lines 1-9, "In step S405, after obtaining the first voice component and the second voice component in the voice information, the mobile phone may separately extract an audio feature of each of the first voice component and the second voice component, and then match the first voiceprint model of the authorized user 1 with the audio feature of the first voice component, and match the second voiceprint model of the authorized user 1 with the audio feature of the second voice component."; Extracting a feature reads on generate a feature vector.);
and determine, using the trained machine learning model, a probability that the reduced bone conduction feature vector and the reduced audio conduction feature vector originate from the user (Paragraph 0119, lines 1-9, "Voiceprint models of one or more authorized users may be pre-stored on the mobile phone. Each authorized user has two voiceprint models, one is a first voiceprint model established based on a voice feature of the user collected when the air conduction microphone (in other words, the first voice sensor) works, and the other is a second voiceprint model established based on a voice feature of the user collected when the bone conduction microphone (in other words, the second voice sensor) works."; Paragraph 0120, lines 1-3, "There are two phases for establishing the first voiceprint model and the second voiceprint model. The first phase is a background model training phase."; Paragraph 0121, lines 1-5, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice."; Paragraph 0123, lines 9-19, "For example, the mobile phone may calculate, by using a specific algorithm, a first matching degree (that is, a first voiceprint recognition result) between the first voiceprint model and the first voice component, and a second matching degree (that is, the second voiceprint recognition result) between the second voiceprint model and the second voice component. Usually, a higher matching degree indicates more similarity between the audio feature of the voice information and the audio feature of the authorized user 1, and a higher possibility that the user entering the voice information is the authorized user 1."; A matching degree between a voiceprint model and a voice feature of the user collected with a bone conduction microphone reads on a probability that the reduced bone conduction feature vector originate from the user, a matching degree between a voiceprint model and a voice feature of the user collected with an air conduction microphone reads on a probability that the reduced audio conduction feature vector originate from the user, and establishing the voiceprint models with a model training phase reads on a trained machine learning model.).
Blouet further teaches:
reduce the bone conduction feature vector and the audio conduction feature vector to generate a reduced bone conduction feature vector and a reduced audio conduction feature vector (Paragraph 0110, lines 1-19, "In step S9, the software 40 executed by the processor 18 causes the computing device 10 to create a feature matrix for each captured audio signal and, in step S10, to input the matrices into a machine learning model trained to conduct authentication transactions, for example, as described herein with regard to the flowchart illustrated in FIG. 4. The trained machine learning model software is executed by the processor 18 to cause the computing device 10 to calculate one or more embeddings for each matrix. Next, in step S11, the software 40 executed by the processor 18 causes the computing device 10 to create an enrolment template from the embeddings and store the template in the memory 20. The enrolment template can be any set of the embeddings or any function of the set of embeddings such as, but not limited to, the mean embedding vector or the reduced embedding vector. Thus, for example, when the secret passphrase is spoken three times, three matrices are created, embeddings are calculated for the matrices, and an enrollment template is created from the embeddings."; Calculating a reduced embedding vector from a feature matrix for a captured audio signal reads on reducing feature vectors.).
Blouet is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang in view of Zhong, Lesso, and Blouet to further incorporate the teachings of Blouet to calculate a reduced embedding vector from a feature matrix for a captured audio signal. Doing so would allow for conducting verification transactions that require fewer calculations (Blouet; Paragraph 0010, lines 1-8).
Regarding claim 5, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 4.
Zhang further discloses:
wherein the bone and audio authentication circuitry is further configured to, prior to analysis of the bone conduction signal and the audio signal, prompt the user to enroll with the enrollment response in relation to the smart device (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”.").
Regarding claim 6, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 5.
Zhang further discloses:
wherein the enrollment response includes speaking a random or specified phrase to the smart device and with the wearable device to thereby generate an enrollment template (device (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”.").
Regarding claim 7, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 6.
Zhang further discloses:
wherein the bone and audio authentication circuitry, during analysis of the bone conduction signal and the audio signal, is configured to: compare the reduced bone conduction feature vector and the reduced audio conduction feature vector to the enrollment template to determine authentication of the user (Paragraph 0119, lines 1-9, "Voiceprint models of one or more authorized users may be pre-stored on the mobile phone. Each authorized user has two voiceprint models, one is a first voiceprint model established based on a voice feature of the user collected when the air conduction microphone (in other words, the first voice sensor) works, and the other is a second voiceprint model established based on a voice feature of the user collected when the bone conduction microphone (in other words, the second voice sensor) works."; Paragraph 0123, lines 9-19, "For example, the mobile phone may calculate, by using a specific algorithm, a first matching degree (that is, a first voiceprint recognition result) between the first voiceprint model and the first voice component, and a second matching degree (that is, the second voiceprint recognition result) between the second voiceprint model and the second voice component. Usually, a higher matching degree indicates more similarity between the audio feature of the voice information and the audio feature of the authorized user 1, and a higher possibility that the user entering the voice information is the authorized user 1."; Calculating a matching degree between a voiceprint model and a voice feature of the user collected with a bone conduction microphone and a matching degree between a voiceprint model and a voice feature of the user collected with an air conduction microphone, where a higher matching degree indicates a higher possibility that the user entering the voice information is the authorized user, reads on comparing the reduced bone conduction feature vector and the reduced audio conduction feature vector to the enrollment template to determine authentication of the user.).
Regarding claim 8, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 1.
Zhang further discloses:
wherein authentication of the user enables the user to access functionality of the smart device (Paragraph 0018, lines 1-7, "In a possible design method, if the voicing user is an authorized user, the method further includes: automatically executing, by the terminal, an unlock operation. In this way, the user only needs to enter the voice information once to complete a series of operations such as user identity authentication, mobile phone unlocking, and enabling a function of the mobile phone.").
Claims 12, 14 – 22, 24 and 26 – 38 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Zhong and Lesso.
Regarding claim 12, Zhang discloses a smart device for two-way authentication of a user (Paragraph 0068, lines 1-4, "As shown in FIG. 1, a voice control method provided in an embodiment of this application may be applied to a voice control system including a wearable device 11 and a terminal 12."; Paragraph 0074, lines 1-2, "As shown in FIG. 3, the terminal 12 in the voice control system may be specifically a mobile phone 100."), the smart device comprising:
a communications interface configured to receive a bone conduction signal from the user via a wearable device (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition.");
a microphone configured to receive an audio signal from the user (Paragraph 0086, lines 1-3, "The audio circuit 109, a speaker 113, and a microphone 114 may provide an audio interface between the user and the mobile phone 100.");
analyze the bone conduction signal and the audio signal (Paragraph 0007, lines 1-7, "It can be learned that when collecting the voice information of the voicing user, the wearable device collects two pieces of voice information (that is, the first voice component and the second voice component) by using two voice sensors. In this way, the terminal may separately perform voiceprint recognition on the two pieces of voice information."; Paragraph 0021, lines 6-8, "For example, the first voice sensor is an air conduction microphone, and the second voice sensor is a bone conduction microphone.");
and in response to a score, based on analysis of the bone conduction signal and the audio signal, being greater than or equal to a preselected threshold, perform an action based on natural language processing of the audio signal of the user, the action including one or more of authentication, grant of physical or virtual access, and enable purchase of goods or services (Paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication."; Paragraph 0038, lines 1-12, "In a possible design method, the recognition unit is specifically configured to: calculate a first degree of matching between the first voice component and the first voiceprint model of the authorized user; if the first matching degree is greater than a first threshold, determine that the first voice component matches the first voiceprint model of the authorized user; calculate a second degree of matching between the second voice component and the second voiceprint model of the authorized user; and if the second matching degree is greater than a second threshold, determine that the second voice component matches the second voiceprint model of the authorized user."; Paragraph 0049, lines 1-4, "In a possible design method, the sending unit is further configured to send an authentication success message or an unlocking instruction to the terminal if the voicing user is an authorized user."; Paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP."; Generating an operation instruction corresponding to voice information reads on performing an action based on natural language processing of the audio signal of the user, sending an authentication success message reads on the action being authentication, and displaying a payment screen of a WeChat APP reads on the action being enabling the purchase of goods or services.).
Zhang does not specifically disclose: a bone and audio authentication circuitry configured to: determine a consistency score based on a comparison of the bone conduction signal and the audio signal, in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.
Zhong teaches:
a bone and audio authentication circuitry configured to: determine a consistency score based on a comparison of the bone conduction signal and the audio signal (Column 11, line 26 – Column 12, line 9, "The comparison data 412 may comprise a single bit binary value or flag in which a “1” indicates the two signals are correlated sufficiently to be deemed indicative of the same source, while a “0” indicates the two signals are not indicative of the same source. In other implementations, the comparison data 412 may include the flag, information indicative of the degree of correlation, and so forth. At 414, voice activity data 122 is determined. This determination is based on one or more of the comparison data 412, the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, the AC energy data 312, and so forth. For example, if the comparison data 412 indicates that the two signals are highly correlated (that is above a threshold and indicative of the same source), and the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, and the AC energy data 312 are all indicative of speech being present within signals, voice activity data 122 may be generated that indicates the user 102 wearing the HMWD 106 is speaking."; Comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal reads on determining a consistency score based on a comparison of the bone conduction signal and the audio signal.),
in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval (Column 2, line 57 - Column 3, line 4, "The BC signal data and the AC signal data are processed to determine comparison data that is indicative of the extent of similarity between the two. For example, a cross-correlation algorithm may be used to generate comparison data that is indicative of the correlation between the BC signal data and the AC signal data. If the comparison data indicates a similarity that exceeds a threshold value, voice activity data is generated that indicates the user wearing the BC microphone is speaking. The voice activity data may be used to trigger other activities by the device or a system in communication with the device. For example, after determining that the user is speaking, the AC signal data may be processed by a speech recognition module, used for a voice over internet protocol (VOIP) call, and so forth." ; Column 10, line 46 - Column 11, line 7, "In some implementations, the signal comparison 406 may utilize a time window to account for delays associated with the operation or relative position of one or more of the BC microphone 110 or AC microphone 112. The center of the time window may be determined based on a time difference between the propagation of signals with respect to the BC microphone 110 and the AC microphone 112. For example, the AC microphone travel time may be determined by the propagation time of the sound waves from the mouth of the user 102 to the AC microphone 112. The BC microphone travel time may be determined by the propagation time of the vibrations from a vocal tract of the user 102 (such as larynx, throat, mouth, sinuses, etc.) to the location of the BC microphone 110. The width of the time window may be determined by the variation of the time difference among a population of users 102. Portions of the signal data that have timestamps outside of a specified time window may be disregarded from the determination of similarity. For example, the time window may be used to determine which samples in the frames from the BC signal data 116 and the AC signal data 118 are to be assessed using the cross-correlation function. In one implementation, the duration of the time window may be determined based at least in part on the physical distance between the BC microphone 110 and the AC microphone 112 and based on the speed of sound in the ambient atmosphere. The time window may be fixed, while in other implementations, the time window may vary. For example, the time window may vary based at least in part on the noise data 212."; Generating voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where the voice activity data is used to trigger other activities by the device, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold, and utilizing a time window to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal, where signal data outside of a specified time window is disregarded from the determination of similarity, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.).
Zhong is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Zhong to determine comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal, and generate voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where a time window is utilized to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal and signal data outside of a specified time window is disregarded from the determination of similarity, and where the voice activity data is used to trigger other activities by the device. Doing so would allow for the user of a wearable computing device such as a head-mounted wearable device (HMWD) being able to provide verbal input in environments with ambient noise, where ambient noise is recognized as being distinct from the voice of the wearer and ignored (Zhong; Column 3, lines 5-16).
Zhang in view of Zhong does not specifically disclose authenticating a user in response to determining a bone conduction signal and an audio signal are consistent.
Lesso teaches:
authenticating a user in response to determining a bone conduction signal and an audio signal are consistent (Column 9, line 65 - Column 10, line 8, "Additionally, or alternatively, the enable module 306 may cross-correlate the bone-conducted audio signal with the air-conducted audio signal. Upon a determination that the bone-conducted audio signal comprises a voice, the enable module 306 may cross-correlate the bone-conducted audio signal (and particularly that portion of the bone-conducted audio signal comprising the voice) with the air-conducted audio signal (particularly that portion of the air-conducted audio signal which is concurrent with the portion of the bone-conducted audio signal comprising the voice), to determine a level of correlation between the two signals."; Column 12, lines 27-35, "If the two audio signals do not correlate, it may be assumed that the voice microphone has detected significant levels of noise (e.g. the presence of other speakers). In this case, it may be inappropriate for the stored voice template to be updated based on the air-conducted voice signal, and thus the method proceeds to step 406 and ends. If the audio signals do correlate, the method proceeds to step 412, in which the biometric system determines whether the user is authenticated as the authorised user or not.”).
Lesso is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Zhang in view of Zhong to incorporate the teachings of Lesso to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and if the signals are correlated, determine whether the user is authenticated as the authorized user, as taught by Lesso, where the method of Zhong is used to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and the method of Zhang is used to determine whether the user is authenticated as the authorized user. Doing so would allow for using a bone-conducted audio signal to determine when an air-conducted audio signal comprises a voice of the user of the system (Lesso; Column 8, line 59 - Column 9, line 6).
Regarding claim 14, Zhang in view of Zhong and Lesso discloses the smart device as claimed in claim 12.
Zhang further discloses:
wherein the bone and audio authentication circuitry is configured to, prior to analysis, prompt the user to submit an enrollment template or response (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; The registration voice reads on an enrollment response.).
Regarding claim 15, Zhang in view of Zhong and Lesso discloses the smart device as claimed in claim 14.
Zhang further discloses:
wherein the enrollment template or response includes one or more specific phrases to be spoken by the user (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; The registration voice reads on an enrollment response, and “Hello Xiao E” reads on a specific phrase to be spoken by the user.).
Regarding claim 16, Zhang in view of Zhong and Lesso discloses the smart device as claimed in claim 15.
Zhang further discloses:
wherein the smart device is configured to prompt the user to submit the enrollment template or response for the wearable device (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; The Bluetooth headset reads on the wearable device.).
Regarding claim 17, Zhang discloses a system for two-way authentication of a user (Paragraph 0068, lines 1-4, "As shown in FIG. 1, a voice control method provided in an embodiment of this application may be applied to a voice control system including a wearable device 11 and a terminal 12."), the system comprising:
one or more wearable devices, each of the one or more wearable devices including: a sensor to receive a bone conduction signal based on a user's speech (Paragraph 0070, lines 1-14, "As shown in FIG. 2, the wearable device 11 may specifically include a first voice sensor 201 disposed outside the wearable device 11 and a second voice sensor 202 disposed inside the wearable device 11. An inside of the wearable device 11 refers to a side that is directly in contact with a user when the user uses the wearable device 11, and an outside of the wearable device 11 refers to a side that is not directly in contact with the user. For example, the first voice sensor 201 may be an air conduction microphone, and the second voice sensor 202 may be a sensor capable of collecting a vibration signal generated when the user makes a sound, such as a bone conduction microphone, an optical vibration sensor, an acceleration sensor, or an air conduction microphone."),
and a communication circuitry configured to wirelessly transmit the bone conduction signal (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition.");
and a smart device including; (a) a microphone configured to receive an audio signal from the user's speech (Paragraph 0074, lines 1-2, "As shown in FIG. 3, the terminal 12 in the voice control system may be specifically a mobile phone 100."; Paragraph 0086, lines 1-3, "The audio circuit 109, a speaker 113, and a microphone 114 may provide an audio interface between the user and the mobile phone 100."),
and (b) bone and audio authentication circuitry configured to: receive the bone conduction signal from the communication circuitry (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition."),
determine, using a first verification model, a first probability indicating that the audio signal originates from the user, determine, using a second verification model, a second probability indicating that the bone conduction signal originates from the user (Paragraph 0123, lines 9-19, "For example, the mobile phone may calculate, by using a specific algorithm, a first matching degree (that is, a first voiceprint recognition result) between the first voiceprint model and the first voice component, and a second matching degree (that is, the second voiceprint recognition result) between the second voiceprint model and the second voice component. Usually, a higher matching degree indicates more similarity between the audio feature of the voice information and the audio feature of the authorized user 1, and a higher possibility that the user entering the voice information is the authorized user 1."; A matching degree between a voiceprint model and a voice feature of the user collected with an air conduction microphone reads on a probability indicating that the audio signal originates from the user, and a matching degree between a voiceprint model and a voice feature of the user collected with a bone conduction microphone reads on a probability indicating that the bone conduction signal originates from the user.),
and in response to the first probability indicating that the audio signal originates from the user being greater than or equal to a second preselected threshold and the second probability indicating that the bone conduction signal originates from the user being greater than or equal to a third preselected threshold: perform an action based on natural language processing of the audio signal of the user, the action including one or more of authentication, grant of physical or virtual access, and enable purchase of goods or services (Paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication."; Paragraph 0038, lines 1-12, "In a possible design method, the recognition unit is specifically configured to: calculate a first degree of matching between the first voice component and the first voiceprint model of the authorized user; if the first matching degree is greater than a first threshold, determine that the first voice component matches the first voiceprint model of the authorized user; calculate a second degree of matching between the second voice component and the second voiceprint model of the authorized user; and if the second matching degree is greater than a second threshold, determine that the second voice component matches the second voiceprint model of the authorized user."; Paragraph 0049, lines 1-4, "In a possible design method, the sending unit is further configured to send an authentication success message or an unlocking instruction to the terminal if the voicing user is an authorized user."; Paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP."; The first matching degree being greater than a first threshold reads on the first probability indicating that the audio signal originates from the user being greater than or equal to a second preselected threshold, the second matching degree being greater than a second threshold reads on the second probability indicating that the bone conduction signal originates from the user being greater than or equal to a third preselected threshold, generating an operation instruction corresponding to voice information reads on performing an action based on natural language processing of the audio signal of the user, sending an authentication success message reads on the action being authentication, and displaying a payment screen of a WeChat APP reads on the action being enabling the purchase of goods or services.).
Zhang does not specifically disclose: convert the bone conduction signal to a bone conduction binary waveform and the audio signal to an audio signal binary waveform, determine, using a consistency evaluation algorithm, a consistency score based on a comparison of the bone conduction binary waveform and the audio signal binary waveform, in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.
Zhong teaches:
convert the bone conduction signal to a bone conduction binary waveform and the audio signal to an audio signal binary waveform (Column 5, lines 13-25, "The voice activity data 122 may comprise information indicative of whether the user 102 wearing the HMWD 106 is speaking at a particular time. For example, voice activity data 122 may include a single bit binary value in which that a “0” represents no speech by the user 102 and a “1” indicates that the user 102 is speaking. In some implementations, the voice activity data 122 may include a timestamp. For example, the timestamp may be indicative of the time for which the determination of the voice activity data 122 is deemed to be relevant, such as the time of data acquisition, time of processing, and so forth. One or more of the BC signal data 116 or the AC signal data 118 are processed to determine presence of speech."; Column 8, lines 30-39, "At 210, BC energy data 214 is generated. The BC energy data 214 may be generated by determining the energy is greater than a threshold minimum value and less than a threshold maximum value. In one implementation, the BC energy data 214 may comprise a single bit binary value or flag in which a “1” indicates the portion of the BC signal data 116 assessed has an energy value that is within the threshold range, while a “0” indicates the portion of the BC signal data 116 assessed has an energy value that is outside of this threshold range."; Column 9, lines 31-39, "At 310, the value of the energy is compared to a threshold value and AC energy data 312 is generated. For example, the value of the energy of the AC signal data 118 may be determined to be greater than the threshold energy value. In one implementation, the AC energy data 312 may comprise a single bit binary value or flag in which a “1” indicates the AC signal data 118 has an energy that is within the threshold range, while a “0” indicates the AC signal data 118 has a BC energy value that is outside of this range."; Determining a single bit binary value in which a “1” indicates a portion of a bone conduction (BC) signal has an energy value that is within a threshold range and a “0” indicates the portion of the BC signal has an energy value that is outside of the threshold range reads on converting the bone conduction signal to a bone conduction binary waveform, and determining a single bit binary value in which a “1” indicates a portion of an air conduction (AC) signal has an energy value that is within a threshold range and a “0” indicates the portion of the AC signal has an energy value that is outside of the threshold range reads on converting the audio signal to an audio signal binary waveform.),
determine, using a consistency evaluation algorithm, a consistency score based on a comparison of the bone conduction binary waveform and the audio signal binary waveform (Column 11, line 26 – Column 12, line 9, "The comparison data 412 may comprise a single bit binary value or flag in which a “1” indicates the two signals are correlated sufficiently to be deemed indicative of the same source, while a “0” indicates the two signals are not indicative of the same source. In other implementations, the comparison data 412 may include the flag, information indicative of the degree of correlation, and so forth. At 414, voice activity data 122 is determined. This determination is based on one or more of the comparison data 412, the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, the AC energy data 312, and so forth. For example, if the comparison data 412 indicates that the two signals are highly correlated (that is above a threshold and indicative of the same source), and the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, and the AC energy data 312 are all indicative of speech being present within signals, voice activity data 122 may be generated that indicates the user 102 wearing the HMWD 106 is speaking."; Comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal reads on determining a consistency score, and the comparison data being determined based on the BC energy data and the AC energy data reads on determining a consistency score based on a comparison of the bone conduction binary waveform and the audio signal binary waveform.),
in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval (Column 2, line 57 - Column 3, line 4, "The BC signal data and the AC signal data are processed to determine comparison data that is indicative of the extent of similarity between the two. For example, a cross-correlation algorithm may be used to generate comparison data that is indicative of the correlation between the BC signal data and the AC signal data. If the comparison data indicates a similarity that exceeds a threshold value, voice activity data is generated that indicates the user wearing the BC microphone is speaking. The voice activity data may be used to trigger other activities by the device or a system in communication with the device. For example, after determining that the user is speaking, the AC signal data may be processed by a speech recognition module, used for a voice over internet protocol (VOIP) call, and so forth." ; Column 10, line 46 - Column 11, line 7, "In some implementations, the signal comparison 406 may utilize a time window to account for delays associated with the operation or relative position of one or more of the BC microphone 110 or AC microphone 112. The center of the time window may be determined based on a time difference between the propagation of signals with respect to the BC microphone 110 and the AC microphone 112. For example, the AC microphone travel time may be determined by the propagation time of the sound waves from the mouth of the user 102 to the AC microphone 112. The BC microphone travel time may be determined by the propagation time of the vibrations from a vocal tract of the user 102 (such as larynx, throat, mouth, sinuses, etc.) to the location of the BC microphone 110. The width of the time window may be determined by the variation of the time difference among a population of users 102. Portions of the signal data that have timestamps outside of a specified time window may be disregarded from the determination of similarity. For example, the time window may be used to determine which samples in the frames from the BC signal data 116 and the AC signal data 118 are to be assessed using the cross-correlation function. In one implementation, the duration of the time window may be determined based at least in part on the physical distance between the BC microphone 110 and the AC microphone 112 and based on the speed of sound in the ambient atmosphere. The time window may be fixed, while in other implementations, the time window may vary. For example, the time window may vary based at least in part on the noise data 212."; Generating voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where the voice activity data is used to trigger other activities by the device, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold, and utilizing a time window to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal, where signal data outside of a specified time window is disregarded from the determination of similarity, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.).
Zhong is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Zhong to determine a single bit binary value in which a “1” indicates a portion of a bone conduction (BC) signal has an energy value that is within a threshold range and a “0” indicates the portion of the BC signal has an energy value that is outside of the threshold range, determine a single bit binary value in which a “1” indicates a portion of an air conduction (AC) signal has an energy value that is within a threshold range and a “0” indicates the portion of the AC signal has an energy value that is outside of the threshold range, determine comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal, where the comparison data is determined based on the BC energy data and the AC energy data, and generate voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where a time window is utilized to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal and signal data outside of a specified time window is disregarded from the determination of similarity, and where the voice activity data is used to trigger other activities by the device. Doing so would allow for the user of a wearable computing device such as a head-mounted wearable device (HMWD) being able to provide verbal input in environments with ambient noise, where ambient noise is recognized as being distinct from the voice of the wearer and ignored (Zhong; Column 3, lines 5-16).
Zhang in view of Zhong does not specifically disclose authenticating a user in response to determining a bone conduction signal and an audio signal are consistent.
Lesso teaches:
authenticating a user in response to determining a bone conduction signal and an audio signal are consistent (Column 9, line 65 - Column 10, line 8, "Additionally, or alternatively, the enable module 306 may cross-correlate the bone-conducted audio signal with the air-conducted audio signal. Upon a determination that the bone-conducted audio signal comprises a voice, the enable module 306 may cross-correlate the bone-conducted audio signal (and particularly that portion of the bone-conducted audio signal comprising the voice) with the air-conducted audio signal (particularly that portion of the air-conducted audio signal which is concurrent with the portion of the bone-conducted audio signal comprising the voice), to determine a level of correlation between the two signals."; Column 12, lines 27-35, "If the two audio signals do not correlate, it may be assumed that the voice microphone has detected significant levels of noise (e.g. the presence of other speakers). In this case, it may be inappropriate for the stored voice template to be updated based on the air-conducted voice signal, and thus the method proceeds to step 406 and ends. If the audio signals do correlate, the method proceeds to step 412, in which the biometric system determines whether the user is authenticated as the authorised user or not.”).
Lesso is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Zhang in view of Zhong to incorporate the teachings of Lesso to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and if the signals are correlated, determine whether the user is authenticated as the authorized user, as taught by Lesso, where the method of Zhong is used to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and the method of Zhang is used to determine whether the user is authenticated as the authorized user. Doing so would allow for using a bone-conducted audio signal to determine when an air-conducted audio signal comprises a voice of the user of the system (Lesso; Column 8, line 59 - Column 9, line 6).
Regarding claim 18, Zhang in view of Zhong and Lesso discloses the system as claimed in claim 17.
Zhang further discloses:
wherein the smart device is configured to, prior to authentication, prompt the user to submit an enrollment response (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; The registration voice reads on an enrollment response.).
Regarding claim 19, Zhang in view of Zhong and Lesso discloses the system as claimed in claim 18.
Zhang further discloses:
wherein the smart device is configured to prompt the user to submit the enrollment response for each of the one or more wearable devices (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; The Bluetooth headset reads on the wearable device.).
Regarding claim 20, Zhang in view of Zhong and Lesso discloses the system as claimed in claim 19.
Zhang further discloses:
wherein the first verification model and the second verification model are trained using data from prior users and using a user's enrollment response (Paragraph 0120, lines 1-19, "There are two phases for establishing the first voiceprint model and the second voiceprint model. The first phase is a background model training phase. In the first phase, a. developer may collect voices of related texts (for example, “Hello, Xiao E”) generated when a large quantity of speakers wearing the Bluetooth headset make a sound. Further, after performing filtering and noise reduction on the voices of the related texts, the mobile phone may extract an audio feature (for example, a time-frequency noise spectrum graph, or a gammatone-like spectrogram) in a background sound, and a background model of voiceprint recognition is established by using a machine learning algorithm such as a GMM (gaussian mixed model, Gaussian mixture model), an SVM (support vector machines, support vector machine), or a deep neural network framework. The mobile phone or the Bluetooth headset may establish, based on the background model and a registration voice entered by a user, a first voiceprint model and a second voiceprint model belonging to the user."; A registration voice entered by a user reads on a user's enrollment response, and voices of related texts generated from a large quantity of speakers reads on data from prior users.).
Regarding claim 21, Zhang in view of Zhong and Lesso discloses the system as claimed in claim 20.
Zhang further discloses:
wherein the user's enrollment response includes one or more bone conduction signals and corresponding one or more audio signals (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; Establishing the first voiceprint model and the second voiceprint model belonging to the user by entering the registration voice reads on the user's enrollment response includes one or more bone conduction signals and corresponding one or more audio signals.).
Regarding claim 22, Zhang in view of Zhong and Lesso discloses the system as claimed in claim 21.
Zhang further discloses:
wherein the enrollment response further includes an indicator to identify one of the one or more wearable devices corresponding to at least one of the one or more bone conduction signals and at least one of the one or more audio signals (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; A user wearing a Bluetooth headset and saying a registration voice reads on an indicator to identify one of the one or more wearable devices corresponding to at least one of the one or more bone conduction signals and at least one of the one or more audio signals.).
Regarding claim 24, Zhang discloses a system for two-way authentication of a user (Paragraph 0068, lines 1-4, "As shown in FIG. 1, a voice control method provided in an embodiment of this application may be applied to a voice control system including a wearable device 11 and a terminal 12."), the system comprising:
a wearable device including: a first sensor to receive a bone conduction signal based on a user's speech (Paragraph 0070, lines 1-14, "As shown in FIG. 2, the wearable device 11 may specifically include a first voice sensor 201 disposed outside the wearable device 11 and a second voice sensor 202 disposed inside the wearable device 11. An inside of the wearable device 11 refers to a side that is directly in contact with a user when the user uses the wearable device 11, and an outside of the wearable device 11 refers to a side that is not directly in contact with the user. For example, the first voice sensor 201 may be an air conduction microphone, and the second voice sensor 202 may be a sensor capable of collecting a vibration signal generated when the user makes a sound, such as a bone conduction microphone, an optical vibration sensor, an acceleration sensor, or an air conduction microphone."),
and a first communication circuitry configured to transmit the bone conduction signal (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition.");
a device including: a second sensor configured to receive an identification signal from a user (Paragraph 0074, lines 1-2, "As shown in FIG. 3, the terminal 12 in the voice control system may be specifically a mobile phone 100."; Paragraph 0086, lines 1-3, "The audio circuit 109, a speaker 113, and a microphone 114 may provide an audio interface between the user and the mobile phone 100."; A microphone reads on a sensor configured to receive an identification signal from a user.),
and a second communication circuitry configured to transmit the identification signal (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition."; A voice component reads on the identification signal.);
and an authentication circuitry configured to: receive the bone conduction signal from the first communication circuitry, receive the identification signal from the second communication circuitry (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition."; The two voice components reads on the bone conduction signal and the identification signal.),
analyze the bone conduction signal and the identification signal (Paragraph 0007, lines 1-7, "It can be learned that when collecting the voice information of the voicing user, the wearable device collects two pieces of voice information (that is, the first voice component and the second voice component) by using two voice sensors. In this way, the terminal may separately perform voiceprint recognition on the two pieces of voice information."; Paragraph 0021, lines 6-8, "For example, the first voice sensor is an air conduction microphone, and the second voice sensor is a bone conduction microphone."),
and perform an action based on natural language processing of the audio signal of the user, the action including one or more of authentication, grant of physical or virtual access, and enable purchase of goods or services (Paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication."; Paragraph 0049, lines 1-4, "In a possible design method, the sending unit is further configured to send an authentication success message or an unlocking instruction to the terminal if the voicing user is an authorized user."; Paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP."; Generating an operation instruction corresponding to voice information reads on performing an action based on natural language processing of the audio signal of the user, sending an authentication success message reads on the action being authentication, and displaying a payment screen of a WeChat APP reads on the action being enabling the purchase of goods or services.).
Zhang does not specifically disclose: determine, using a consistency evaluation algorithm, a consistency score based on a comparison of the bone conduction signal and the identification signal, and in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the identification signal being less than a preselected time interval.
Zhong teaches:
determine, using a consistency evaluation algorithm, a consistency score based on a comparison of the bone conduction signal and the identification signal (Column 11, line 26 – Column 12, line 9, "The comparison data 412 may comprise a single bit binary value or flag in which a “1” indicates the two signals are correlated sufficiently to be deemed indicative of the same source, while a “0” indicates the two signals are not indicative of the same source. In other implementations, the comparison data 412 may include the flag, information indicative of the degree of correlation, and so forth. At 414, voice activity data 122 is determined. This determination is based on one or more of the comparison data 412, the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, the AC energy data 312, and so forth. For example, if the comparison data 412 indicates that the two signals are highly correlated (that is above a threshold and indicative of the same source), and the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, and the AC energy data 312 are all indicative of speech being present within signals, voice activity data 122 may be generated that indicates the user 102 wearing the HMWD 106 is speaking."; Comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal reads on determining a consistency score based on a comparison of a bone conduction signal and an identification signal.),
and in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the identification signal being less than a preselected time interval (Column 2, line 57 - Column 3, line 4, "The BC signal data and the AC signal data are processed to determine comparison data that is indicative of the extent of similarity between the two. For example, a cross-correlation algorithm may be used to generate comparison data that is indicative of the correlation between the BC signal data and the AC signal data. If the comparison data indicates a similarity that exceeds a threshold value, voice activity data is generated that indicates the user wearing the BC microphone is speaking. The voice activity data may be used to trigger other activities by the device or a system in communication with the device. For example, after determining that the user is speaking, the AC signal data may be processed by a speech recognition module, used for a voice over internet protocol (VOIP) call, and so forth." ; Column 10, line 46 - Column 11, line 7, "In some implementations, the signal comparison 406 may utilize a time window to account for delays associated with the operation or relative position of one or more of the BC microphone 110 or AC microphone 112. The center of the time window may be determined based on a time difference between the propagation of signals with respect to the BC microphone 110 and the AC microphone 112. For example, the AC microphone travel time may be determined by the propagation time of the sound waves from the mouth of the user 102 to the AC microphone 112. The BC microphone travel time may be determined by the propagation time of the vibrations from a vocal tract of the user 102 (such as larynx, throat, mouth, sinuses, etc.) to the location of the BC microphone 110. The width of the time window may be determined by the variation of the time difference among a population of users 102. Portions of the signal data that have timestamps outside of a specified time window may be disregarded from the determination of similarity. For example, the time window may be used to determine which samples in the frames from the BC signal data 116 and the AC signal data 118 are to be assessed using the cross-correlation function. In one implementation, the duration of the time window may be determined based at least in part on the physical distance between the BC microphone 110 and the AC microphone 112 and based on the speed of sound in the ambient atmosphere. The time window may be fixed, while in other implementations, the time window may vary. For example, the time window may vary based at least in part on the noise data 212."; Generating voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where the voice activity data is used to trigger other activities by the device, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold, and utilizing a time window to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal, where signal data outside of a specified time window is disregarded from the determination of similarity, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of a bone conduction signal and an identification signal being less than a preselected time interval.).
Zhong is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Zhong to determine comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal, and generate voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where a time window is utilized to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal and signal data outside of a specified time window is disregarded from the determination of similarity, and where the voice activity data is used to trigger other activities by the device. Doing so would allow for the user of a wearable computing device such as a head-mounted wearable device (HMWD) being able to provide verbal input in environments with ambient noise, where ambient noise is recognized as being distinct from the voice of the wearer and ignored (Zhong; Column 3, lines 5-16).
Zhang in view of Zhong does not specifically disclose authenticating a user in response to determining a bone conduction signal and an identification signal are consistent.
Lesso teaches:
authenticating a user in response to determining a bone conduction signal and an identification signal are consistent (Column 9, line 65 - Column 10, line 8, "Additionally, or alternatively, the enable module 306 may cross-correlate the bone-conducted audio signal with the air-conducted audio signal. Upon a determination that the bone-conducted audio signal comprises a voice, the enable module 306 may cross-correlate the bone-conducted audio signal (and particularly that portion of the bone-conducted audio signal comprising the voice) with the air-conducted audio signal (particularly that portion of the air-conducted audio signal which is concurrent with the portion of the bone-conducted audio signal comprising the voice), to determine a level of correlation between the two signals."; Column 12, lines 27-35, "If the two audio signals do not correlate, it may be assumed that the voice microphone has detected significant levels of noise (e.g. the presence of other speakers). In this case, it may be inappropriate for the stored voice template to be updated based on the air-conducted voice signal, and thus the method proceeds to step 406 and ends. If the audio signals do correlate, the method proceeds to step 412, in which the biometric system determines whether the user is authenticated as the authorised user or not.”).
Lesso is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Zhang in view of Zhong to incorporate the teachings of Lesso to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and if the signals are correlated, determine whether the user is authenticated as the authorized user, as taught by Lesso, where the method of Zhong is used to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and the method of Zhang is used to determine whether the user is authenticated as the authorized user. Doing so would allow for using a bone-conducted audio signal to determine when an air-conducted audio signal comprises a voice of the user of the system (Lesso; Column 8, line 59 - Column 9, line 6).
Regarding claim 26, Zhang discloses a smart device for two-way authentication of a user (Paragraph 0068, lines 1-4, "As shown in FIG. 1, a voice control method provided in an embodiment of this application may be applied to a voice control system including a wearable device 11 and a terminal 12."; Paragraph 0074, lines 1-2, "As shown in FIG. 3, the terminal 12 in the voice control system may be specifically a mobile phone 100."), the smart device comprising:
a communications interface configured to receive a bone conduction signal from a user (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition.");
a microphone configured to receive an audio signal from the user (Paragraph 0086, lines 1-3, "The audio circuit 109, a speaker 113, and a microphone 114 may provide an audio interface between the user and the mobile phone 100.");
analyze the bone conduction signal and the audio signal (Paragraph 0007, lines 1-7, "It can be learned that when collecting the voice information of the voicing user, the wearable device collects two pieces of voice information (that is, the first voice component and the second voice component) by using two voice sensors. In this way, the terminal may separately perform voiceprint recognition on the two pieces of voice information."; Paragraph 0021, lines 6-8, "For example, the first voice sensor is an air conduction microphone, and the second voice sensor is a bone conduction microphone.");
and in response to a score, based on analysis of the bone conduction signal and the audio signal, being greater than or equal to a preselected threshold, perform an action based on natural language processing of the audio signal of the user, the action including one or more of authentication, grant of physical or virtual access, and enable purchase of goods or services (Paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication."; Paragraph 0038, lines 1-12, "In a possible design method, the recognition unit is specifically configured to: calculate a first degree of matching between the first voice component and the first voiceprint model of the authorized user; if the first matching degree is greater than a first threshold, determine that the first voice component matches the first voiceprint model of the authorized user; calculate a second degree of matching between the second voice component and the second voiceprint model of the authorized user; and if the second matching degree is greater than a second threshold, determine that the second voice component matches the second voiceprint model of the authorized user."; Paragraph 0049, lines 1-4, "In a possible design method, the sending unit is further configured to send an authentication success message or an unlocking instruction to the terminal if the voicing user is an authorized user."; Paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP."; Generating an operation instruction corresponding to voice information reads on performing an action based on natural language processing of the audio signal of the user, sending an authentication success message reads on the action being authentication, and displaying a payment screen of a WeChat APP reads on the action being enabling the purchase of goods or services.).
Zhang does not specifically disclose: a bone and audio authentication circuitry configured to: determine, using a consistency evaluation algorithm, a consistency score based on a comparison of the bone conduction signal and the audio signal, in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.
Zhong teaches:
a bone and audio authentication circuitry configured to: determine, using a consistency evaluation algorithm, a consistency score based on a comparison of the bone conduction signal and the audio signal (Column 11, line 26 – Column 12, line 9, "The comparison data 412 may comprise a single bit binary value or flag in which a “1” indicates the two signals are correlated sufficiently to be deemed indicative of the same source, while a “0” indicates the two signals are not indicative of the same source. In other implementations, the comparison data 412 may include the flag, information indicative of the degree of correlation, and so forth. At 414, voice activity data 122 is determined. This determination is based on one or more of the comparison data 412, the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, the AC energy data 312, and so forth. For example, if the comparison data 412 indicates that the two signals are highly correlated (that is above a threshold and indicative of the same source), and the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, and the AC energy data 312 are all indicative of speech being present within signals, voice activity data 122 may be generated that indicates the user 102 wearing the HMWD 106 is speaking."; Comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal reads on determining a consistency score based on a comparison of the bone conduction signal and the audio signal.),
in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval (Column 2, line 57 - Column 3, line 4, "The BC signal data and the AC signal data are processed to determine comparison data that is indicative of the extent of similarity between the two. For example, a cross-correlation algorithm may be used to generate comparison data that is indicative of the correlation between the BC signal data and the AC signal data. If the comparison data indicates a similarity that exceeds a threshold value, voice activity data is generated that indicates the user wearing the BC microphone is speaking. The voice activity data may be used to trigger other activities by the device or a system in communication with the device. For example, after determining that the user is speaking, the AC signal data may be processed by a speech recognition module, used for a voice over internet protocol (VOIP) call, and so forth." ; Column 10, line 46 - Column 11, line 7, "In some implementations, the signal comparison 406 may utilize a time window to account for delays associated with the operation or relative position of one or more of the BC microphone 110 or AC microphone 112. The center of the time window may be determined based on a time difference between the propagation of signals with respect to the BC microphone 110 and the AC microphone 112. For example, the AC microphone travel time may be determined by the propagation time of the sound waves from the mouth of the user 102 to the AC microphone 112. The BC microphone travel time may be determined by the propagation time of the vibrations from a vocal tract of the user 102 (such as larynx, throat, mouth, sinuses, etc.) to the location of the BC microphone 110. The width of the time window may be determined by the variation of the time difference among a population of users 102. Portions of the signal data that have timestamps outside of a specified time window may be disregarded from the determination of similarity. For example, the time window may be used to determine which samples in the frames from the BC signal data 116 and the AC signal data 118 are to be assessed using the cross-correlation function. In one implementation, the duration of the time window may be determined based at least in part on the physical distance between the BC microphone 110 and the AC microphone 112 and based on the speed of sound in the ambient atmosphere. The time window may be fixed, while in other implementations, the time window may vary. For example, the time window may vary based at least in part on the noise data 212."; Generating voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where the voice activity data is used to trigger other activities by the device, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold, and utilizing a time window to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal, where signal data outside of a specified time window is disregarded from the determination of similarity, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.).
Zhong is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Zhong to determine comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal, and generate voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where a time window is utilized to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal and signal data outside of a specified time window is disregarded from the determination of similarity, and where the voice activity data is used to trigger other activities by the device. Doing so would allow for the user of a wearable computing device such as a head-mounted wearable device (HMWD) being able to provide verbal input in environments with ambient noise, where ambient noise is recognized as being distinct from the voice of the wearer and ignored (Zhong; Column 3, lines 5-16).
Zhang in view of Zhong does not specifically disclose authenticating a user in response to determining a bone conduction signal and an audio signal are consistent.
Lesso teaches:
authenticating a user in response to determining a bone conduction signal and an audio signal are consistent (Column 9, line 65 - Column 10, line 8, "Additionally, or alternatively, the enable module 306 may cross-correlate the bone-conducted audio signal with the air-conducted audio signal. Upon a determination that the bone-conducted audio signal comprises a voice, the enable module 306 may cross-correlate the bone-conducted audio signal (and particularly that portion of the bone-conducted audio signal comprising the voice) with the air-conducted audio signal (particularly that portion of the air-conducted audio signal which is concurrent with the portion of the bone-conducted audio signal comprising the voice), to determine a level of correlation between the two signals."; Column 12, lines 27-35, "If the two audio signals do not correlate, it may be assumed that the voice microphone has detected significant levels of noise (e.g. the presence of other speakers). In this case, it may be inappropriate for the stored voice template to be updated based on the air-conducted voice signal, and thus the method proceeds to step 406 and ends. If the audio signals do correlate, the method proceeds to step 412, in which the biometric system determines whether the user is authenticated as the authorised user or not.”).
Lesso is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Zhang in view of Zhong to incorporate the teachings of Lesso to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and if the signals are correlated, determine whether the user is authenticated as the authorized user, as taught by Lesso, where the method of Zhong is used to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and the method of Zhang is used to determine whether the user is authenticated as the authorized user. Doing so would allow for using a bone-conducted audio signal to determine when an air-conducted audio signal comprises a voice of the user of the system (Lesso; Column 8, line 59 - Column 9, line 6).
Regarding claim 27, Zhang in view of Zhong and Lesso discloses the smart device as claimed in claim 26.
Zhang further discloses:
wherein a wearable device transmits the bone conduction to the smart device (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition.").
Regarding claim 28, Zhang in view of Zhong and Lesso discloses the smart device as claimed in claim 27.
Zhang further discloses:
wherein the bone and audio authentication circuitry is configured to, prior to the analysis, prompt the user to submit an enrollment template or response (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; The registration voice reads on an enrollment response.).
Regarding claim 29, Zhang in view of Zhong and Lesso discloses the smart device as claimed in claim 28.
Zhang further discloses:
wherein the enrollment template or response includes one or more specific phrases to be spoken by the user (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; The registration voice reads on an enrollment response, and “Hello Xiao E” reads on a specific phrase to be spoken by the user.).
Regarding claim 30, Zhang in view of Zhong and Lesso discloses the smart device as claimed in claim 29.
Zhang further discloses:
wherein the smart device is configured to prompt the user to submit the enrollment template or response for the wearable device (Paragraph 0121, lines 1-9, "The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice. For example, when an authorized user 1 uses, for the first time, a voice assistant APP installed on the mobile phone, the voice assistant APP may prompt the user to wear a Bluetooth headset and say a registration voice “Hello Xiao E”."; The Bluetooth headset reads on the wearable device.).
Regarding claim 31, Zhang discloses an authentication device for two-way authentication of a user (Paragraph 0068, lines 1-4, "As shown in FIG. 1, a voice control method provided in an embodiment of this application may be applied to a voice control system including a wearable device 11 and a terminal 12."; Paragraph 0074, lines 1-2, "As shown in FIG. 3, the terminal 12 in the voice control system may be specifically a mobile phone 100."), the authentication device comprising:
a communications interface configured to receive: a first signal from a user; and a second signal from the user, the second signal being a different type of signal than the first signal (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition.");
analyze the first signal and the second signal (Paragraph 0007, lines 1-7, "It can be learned that when collecting the voice information of the voicing user, the wearable device collects two pieces of voice information (that is, the first voice component and the second voice component) by using two voice sensors. In this way, the terminal may separately perform voiceprint recognition on the two pieces of voice information."; Paragraph 0021, lines 6-8, "For example, the first voice sensor is an air conduction microphone, and the second voice sensor is a bone conduction microphone.");
and perform an action based on natural language processing of the audio signal of the user, the action including one or more of authentication, grant of physical or virtual access, and enable purchase of goods or services (Paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication."; Paragraph 0049, lines 1-4, "In a possible design method, the sending unit is further configured to send an authentication success message or an unlocking instruction to the terminal if the voicing user is an authorized user."; Paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP."; Generating an operation instruction corresponding to voice information reads on performing an action based on natural language processing of the audio signal of the user, sending an authentication success message reads on the action being authentication, and displaying a payment screen of a WeChat APP reads on the action being enabling the purchase of goods or services.).
Zhang does not specifically disclose: an authentication circuitry configured to: determine, using a consistency evaluation algorithm, a consistency score based on a comparison of the first signal and the second signal, in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the first signal and the second signal being less than a preselected time interval.
Zhong teaches:
an authentication circuitry configured to: determine, using a consistency evaluation algorithm, a consistency score based on a comparison of the first signal and the second signal (Column 11, line 26 – Column 12, line 9, "The comparison data 412 may comprise a single bit binary value or flag in which a “1” indicates the two signals are correlated sufficiently to be deemed indicative of the same source, while a “0” indicates the two signals are not indicative of the same source. In other implementations, the comparison data 412 may include the flag, information indicative of the degree of correlation, and so forth. At 414, voice activity data 122 is determined. This determination is based on one or more of the comparison data 412, the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, the AC energy data 312, and so forth. For example, if the comparison data 412 indicates that the two signals are highly correlated (that is above a threshold and indicative of the same source), and the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, and the AC energy data 312 are all indicative of speech being present within signals, voice activity data 122 may be generated that indicates the user 102 wearing the HMWD 106 is speaking."; Comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal reads on determining a consistency score based on a comparison of the first signal and the second signal.),
in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the first signal and the second signal being less than a preselected time interval (Column 2, line 57 - Column 3, line 4, "The BC signal data and the AC signal data are processed to determine comparison data that is indicative of the extent of similarity between the two. For example, a cross-correlation algorithm may be used to generate comparison data that is indicative of the correlation between the BC signal data and the AC signal data. If the comparison data indicates a similarity that exceeds a threshold value, voice activity data is generated that indicates the user wearing the BC microphone is speaking. The voice activity data may be used to trigger other activities by the device or a system in communication with the device. For example, after determining that the user is speaking, the AC signal data may be processed by a speech recognition module, used for a voice over internet protocol (VOIP) call, and so forth." ; Column 10, line 46 - Column 11, line 7, "In some implementations, the signal comparison 406 may utilize a time window to account for delays associated with the operation or relative position of one or more of the BC microphone 110 or AC microphone 112. The center of the time window may be determined based on a time difference between the propagation of signals with respect to the BC microphone 110 and the AC microphone 112. For example, the AC microphone travel time may be determined by the propagation time of the sound waves from the mouth of the user 102 to the AC microphone 112. The BC microphone travel time may be determined by the propagation time of the vibrations from a vocal tract of the user 102 (such as larynx, throat, mouth, sinuses, etc.) to the location of the BC microphone 110. The width of the time window may be determined by the variation of the time difference among a population of users 102. Portions of the signal data that have timestamps outside of a specified time window may be disregarded from the determination of similarity. For example, the time window may be used to determine which samples in the frames from the BC signal data 116 and the AC signal data 118 are to be assessed using the cross-correlation function. In one implementation, the duration of the time window may be determined based at least in part on the physical distance between the BC microphone 110 and the AC microphone 112 and based on the speed of sound in the ambient atmosphere. The time window may be fixed, while in other implementations, the time window may vary. For example, the time window may vary based at least in part on the noise data 212."; Generating voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where the voice activity data is used to trigger other activities by the device, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold, and utilizing a time window to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal, where signal data outside of a specified time window is disregarded from the determination of similarity, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.).
Zhong is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Zhong to determine comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal, and generate voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where a time window is utilized to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal and signal data outside of a specified time window is disregarded from the determination of similarity, and where the voice activity data is used to trigger other activities by the device. Doing so would allow for the user of a wearable computing device such as a head-mounted wearable device (HMWD) being able to provide verbal input in environments with ambient noise, where ambient noise is recognized as being distinct from the voice of the wearer and ignored (Zhong; Column 3, lines 5-16).
Zhang in view of Zhong does not specifically disclose authenticating a user in response to determining a bone conduction signal and an audio signal are consistent.
Lesso teaches:
authenticating a user in response to determining a bone conduction signal and an audio signal are consistent (Column 9, line 65 - Column 10, line 8, "Additionally, or alternatively, the enable module 306 may cross-correlate the bone-conducted audio signal with the air-conducted audio signal. Upon a determination that the bone-conducted audio signal comprises a voice, the enable module 306 may cross-correlate the bone-conducted audio signal (and particularly that portion of the bone-conducted audio signal comprising the voice) with the air-conducted audio signal (particularly that portion of the air-conducted audio signal which is concurrent with the portion of the bone-conducted audio signal comprising the voice), to determine a level of correlation between the two signals."; Column 12, lines 27-35, "If the two audio signals do not correlate, it may be assumed that the voice microphone has detected significant levels of noise (e.g. the presence of other speakers). In this case, it may be inappropriate for the stored voice template to be updated based on the air-conducted voice signal, and thus the method proceeds to step 406 and ends. If the audio signals do correlate, the method proceeds to step 412, in which the biometric system determines whether the user is authenticated as the authorised user or not.”).
Lesso is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Zhang in view of Zhong to incorporate the teachings of Lesso to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and if the signals are correlated, determine whether the user is authenticated as the authorized user, as taught by Lesso, where the method of Zhong is used to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and the method of Zhang is used to determine whether the user is authenticated as the authorized user. Doing so would allow for using a bone-conducted audio signal to determine when an air-conducted audio signal comprises a voice of the user of the system (Lesso; Column 8, line 59 - Column 9, line 6).
Regarding claim 32, Zhang in view of Zhong and Lesso discloses the authentication device as claimed in claim 31.
Zhang further discloses:
wherein the first signal includes one or more of an audio signal, a bone conduction signal, or signals associated with a badge scan, identification scan, retinal scan, fingerprint scan, facial scan, or gesture scan (Paragraph 0070, lines 1-14, "As shown in FIG. 2, the wearable device 11 may specifically include a first voice sensor 201 disposed outside the wearable device 11 and a second voice sensor 202 disposed inside the wearable device 11. An inside of the wearable device 11 refers to a side that is directly in contact with a user when the user uses the wearable device 11, and an outside of the wearable device 11 refers to a side that is not directly in contact with the user. For example, the first voice sensor 201 may be an air conduction microphone, and the second voice sensor 202 may be a sensor capable of collecting a vibration signal generated when the user makes a sound, such as a bone conduction microphone, an optical vibration sensor, an acceleration sensor, or an air conduction microphone.").
Regarding claim 33, Zhang in view of Zhong and Lesso discloses the authentication device as claimed in claim 31.
Zhang further discloses:
wherein the second signal includes one or more of an audio signal, a bone conduction signal, or signals associated with a badge scan, identification scan, retinal scan, fingerprint scan, facial scan, or gesture scan (Paragraph 0070, lines 1-14, "As shown in FIG. 2, the wearable device 11 may specifically include a first voice sensor 201 disposed outside the wearable device 11 and a second voice sensor 202 disposed inside the wearable device 11. An inside of the wearable device 11 refers to a side that is directly in contact with a user when the user uses the wearable device 11, and an outside of the wearable device 11 refers to a side that is not directly in contact with the user. For example, the first voice sensor 201 may be an air conduction microphone, and the second voice sensor 202 may be a sensor capable of collecting a vibration signal generated when the user makes a sound, such as a bone conduction microphone, an optical vibration sensor, an acceleration sensor, or an air conduction microphone.").
Regarding claim 34, Zhang in view of Zhong and Lesso discloses the authentication device as claimed in claim 33.
Zhang further discloses:
wherein analysis of the first signal and the second signal includes (a) a determination of whether the first signal and the second signal are from the user, and (b) a determination, if the first signal and the second signal are from the user, of the score based on a verification that each of the first signal and the second signal are authentic (Paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication.").
Regarding claim 35, Zhang discloses a method for two-way authentication of a user (Paragraph 0068, lines 1-4, "As shown in FIG. 1, a voice control method provided in an embodiment of this application may be applied to a voice control system including a wearable device 11 and a terminal 12."; Paragraph 0074, lines 1-2, "As shown in FIG. 3, the terminal 12 in the voice control system may be specifically a mobile phone 100."), the method comprising:
in response to reception of an audio signal from a user, determining, via a smart device, whether a corresponding bone conduction signal is received from one of one or more separate wearable devices (Paragraph 0010, lines 13-17, "In other words, after collecting the first voice component and the second voice component in the voice information of the voicing user, the wearable device may send the two voice components to the terminal for voiceprint recognition.");
verifying, via the smart device, the audio signal; verifying, via the smart device, the corresponding bone conduction signal (Paragraph 0123, lines 9-19, "For example, the mobile phone may calculate, by using a specific algorithm, a first matching degree (that is, a first voiceprint recognition result) between the first voiceprint model and the first voice component, and a second matching degree (that is, the second voiceprint recognition result) between the second voiceprint model and the second voice component. Usually, a higher matching degree indicates more similarity between the audio feature of the voice information and the audio feature of the authorized user 1, and a higher possibility that the user entering the voice information is the authorized user 1."; Calculating a matching degree between a voiceprint model and a voice feature of the user collected with an air conduction microphone reads on verifying the audio signal, and calculating a matching degree between a voiceprint model and a voice feature of the user collected with a bone conduction microphone reads on verifying the corresponding bone conduction signal.);
in response to verification of the audio signal and the corresponding bone conduction signal, performing an action based on natural language processing of the audio signal of the user, the action including one or more of authentication, grant of physical or virtual access, and enable purchase of goods or services (Paragraph 0007, lines 7-14, "When voiceprint recognition results of the two pieces of voice information both match that of the authorized user, it may be determined that a current voicing user is the authorized user. It is clearly that compared with a voiceprint recognition process of one piece of voice information, the dual voiceprint recognition process of the two pieces of voice information can significantly improve accuracy and security during user identity authentication."; Paragraph 0038, lines 1-12, "In a possible design method, the recognition unit is specifically configured to: calculate a first degree of matching between the first voice component and the first voiceprint model of the authorized user; if the first matching degree is greater than a first threshold, determine that the first voice component matches the first voiceprint model of the authorized user; calculate a second degree of matching between the second voice component and the second voiceprint model of the authorized user; and if the second matching degree is greater than a second threshold, determine that the second voice component matches the second voiceprint model of the authorized user."; Paragraph 0049, lines 1-4, "In a possible design method, the sending unit is further configured to send an authentication success message or an unlocking instruction to the terminal if the voicing user is an authorized user."; Paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP."; Generating an operation instruction corresponding to voice information reads on performing an action based on natural language processing of the audio signal of the user, sending an authentication success message reads on the action being authentication, and displaying a payment screen of a WeChat APP reads on the action being enabling the purchase of goods or services.).
Zhang does not specifically disclose: in response to a determination that the corresponding bone conduction signal is received: determining, via the smart device, a consistency score based on a comparison between the audio signal and the corresponding conduction signal; in response to a determination that the consistency score exceeds a preselected threshold and in response to a difference in reception time of the conduction signal and the audio signal being less than a preselected time interval.
Zhong teaches:
in response to a determination that the corresponding bone conduction signal is received: determining, via the smart device, a consistency score based on a comparison between the audio signal and the corresponding conduction signal (Column 11, line 26 – Column 12, line 9, "The comparison data 412 may comprise a single bit binary value or flag in which a “1” indicates the two signals are correlated sufficiently to be deemed indicative of the same source, while a “0” indicates the two signals are not indicative of the same source. In other implementations, the comparison data 412 may include the flag, information indicative of the degree of correlation, and so forth. At 414, voice activity data 122 is determined. This determination is based on one or more of the comparison data 412, the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, the AC energy data 312, and so forth. For example, if the comparison data 412 indicates that the two signals are highly correlated (that is above a threshold and indicative of the same source), and the BC ZCR data 206, the BC energy data 214, the AC ZCR data 306, and the AC energy data 312 are all indicative of speech being present within signals, voice activity data 122 may be generated that indicates the user 102 wearing the HMWD 106 is speaking."; Column 5, lines 53-57, “In some implementations, the HMWD 106 may communicate with another mobile device. For example, the HMWD 106 may use a personal area network (PAN) such as Bluetooth® to communicate with a smartphone.”; Column 21, lines 5-9, “In another implementation, the VAD module 120 may be implemented at the server 128. For example, the server 128 may receive the BC signal data 116 and the AC signal data 118, and may generate the voice activity data 122 separately from the HMWD 106.”; Comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal reads on determining a consistency score based on a comparison between the audio signal and the corresponding conduction signal.),
in response to a determination that the consistency score exceeds a preselected threshold and in response to a difference in reception time of the conduction signal and the audio signal being less than a preselected time interval (Column 2, line 57 - Column 3, line 4, "The BC signal data and the AC signal data are processed to determine comparison data that is indicative of the extent of similarity between the two. For example, a cross-correlation algorithm may be used to generate comparison data that is indicative of the correlation between the BC signal data and the AC signal data. If the comparison data indicates a similarity that exceeds a threshold value, voice activity data is generated that indicates the user wearing the BC microphone is speaking. The voice activity data may be used to trigger other activities by the device or a system in communication with the device. For example, after determining that the user is speaking, the AC signal data may be processed by a speech recognition module, used for a voice over internet protocol (VOIP) call, and so forth." ; Column 10, line 46 - Column 11, line 7, "In some implementations, the signal comparison 406 may utilize a time window to account for delays associated with the operation or relative position of one or more of the BC microphone 110 or AC microphone 112. The center of the time window may be determined based on a time difference between the propagation of signals with respect to the BC microphone 110 and the AC microphone 112. For example, the AC microphone travel time may be determined by the propagation time of the sound waves from the mouth of the user 102 to the AC microphone 112. The BC microphone travel time may be determined by the propagation time of the vibrations from a vocal tract of the user 102 (such as larynx, throat, mouth, sinuses, etc.) to the location of the BC microphone 110. The width of the time window may be determined by the variation of the time difference among a population of users 102. Portions of the signal data that have timestamps outside of a specified time window may be disregarded from the determination of similarity. For example, the time window may be used to determine which samples in the frames from the BC signal data 116 and the AC signal data 118 are to be assessed using the cross-correlation function. In one implementation, the duration of the time window may be determined based at least in part on the physical distance between the BC microphone 110 and the AC microphone 112 and based on the speed of sound in the ambient atmosphere. The time window may be fixed, while in other implementations, the time window may vary. For example, the time window may vary based at least in part on the noise data 212."; Generating voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where the voice activity data is used to trigger other activities by the device, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold, and utilizing a time window to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal, where signal data outside of a specified time window is disregarded from the determination of similarity, reads on taking an action in response to the consistency score being greater than or equal to a predetermined threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval.).
Zhong is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Zhong to determine comparison data including information indicative of the degree of correlation between a bone conduction (BC) signal and an air conduction (AC) signal, and generate voice activity data that indicates a user wearing a bone conduction (BC) microphone is speaking when comparison data indicating a similarity between a bone conduction (BC) signal and an air conduction (AC) signal exceeds a threshold value, where a time window is utilized to account for delays between the bone conduction (BC) signal and the air conduction (AC) signal and signal data outside of a specified time window is disregarded from the determination of similarity, and where the voice activity data is used to trigger other activities by the device. Doing so would allow for the user of a wearable computing device such as a head-mounted wearable device (HMWD) being able to provide verbal input in environments with ambient noise, where ambient noise is recognized as being distinct from the voice of the wearer and ignored (Zhong; Column 3, lines 5-16).
Zhang in view of Zhong does not specifically disclose authenticating a user in response to determining a bone conduction signal and an audio signal are consistent.
Lesso teaches:
authenticating a user in response to determining a bone conduction signal and an audio signal are consistent (Column 9, line 65 - Column 10, line 8, "Additionally, or alternatively, the enable module 306 may cross-correlate the bone-conducted audio signal with the air-conducted audio signal. Upon a determination that the bone-conducted audio signal comprises a voice, the enable module 306 may cross-correlate the bone-conducted audio signal (and particularly that portion of the bone-conducted audio signal comprising the voice) with the air-conducted audio signal (particularly that portion of the air-conducted audio signal which is concurrent with the portion of the bone-conducted audio signal comprising the voice), to determine a level of correlation between the two signals."; Column 12, lines 27-35, "If the two audio signals do not correlate, it may be assumed that the voice microphone has detected significant levels of noise (e.g. the presence of other speakers). In this case, it may be inappropriate for the stored voice template to be updated based on the air-conducted voice signal, and thus the method proceeds to step 406 and ends. If the audio signals do correlate, the method proceeds to step 412, in which the biometric system determines whether the user is authenticated as the authorised user or not.”).
Lesso is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Zhang in view of Zhong to incorporate the teachings of Lesso to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and if the signals are correlated, determine whether the user is authenticated as the authorized user, as taught by Lesso, where the method of Zhong is used to determine a level of correlation between a bone-conducted audio signal and an air-conducted audio signal, and the method of Zhang is used to determine whether the user is authenticated as the authorized user. Doing so would allow for using a bone-conducted audio signal to determine when an air-conducted audio signal comprises a voice of the user of the system (Lesso; Column 8, line 59 - Column 9, line 6).
Regarding claim 36, Zhang in view of Zhong and Lesso discloses the method as claimed in claim 35.
Zhang further discloses:
further comprising, upon authentication of the user, allowing the user to (a) access or utilize the smart device and (b) access data on the smart device (Paragraph 0144, lines 1-9, " It can be learned that according to the voice control method provided in this embodiment of this application, the mobile phone may be unlocked and execute the operation instruction corresponding to the voice information while identifying the user identity by using a voiceprint. In other words, the user only needs to enter the voice information once to complete a series of operations such as user identity authentication, mobile phone unlocking, and enabling a function of the mobile phone."; Unlocking a mobile phone and enabling a function of the mobile phone reads on allowing the user to access the smart device and access data on the smart device.).
Regarding claim 37, Zhang in view of Zhong and Lesso discloses the method as claimed in claim 35.
Zhang further discloses:
further comprising, upon authentication of the user, allowing the user to (a) access or utilize the smart device and (b) access data on the smart device associated with the user (Paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP."; Determining that the voicing user entering the voice information is the authorized user and generating the operation instruction corresponding to the voice information reads on allowing the user to access the smart device and access data on the smart device associated with the user.).
Regarding claim 38, Zhang in view of Zhong and Lesso discloses the method as claimed in claim 35.
Zhang further discloses:
further comprising, upon authentication of the user, allowing the user to (a) access or utilize the smart device and (b) access personal data on the smart device associated with the user (Paragraph 0141, lines 1-12, "Through the authentication process in step S406, if the mobile phone determines that the voicing user entering the voice information in step S402 is the authorized user, the mobile phone may generate the operation instruction corresponding to the voice information. For example, when the voice information is “Xiao E, pay by using WeChat”, the operation instruction corresponding to the voice information is displaying a payment screen of a WeChat APP. In this way, after generating the operation instruction for displaying the payment screen on the WeChat APP, the mobile phone may automatically enable the WeChat APP, and display the payment screen on the WeChat APP."; Determining that the voicing user entering the voice information is the authorized user and generating the operation instruction corresponding to the voice information reads on allowing the user to access the smart device, and displaying a payment screen of a WeChat APP of the user reads on allowing the user to access personal data on the smart device associated with the user.).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Zhong, Lesso, and Blouet, and further in view of Devine et al. (US Patent No. 11,170,085), hereinafter Devine.
Regarding claim 9, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 1, but does not specifically disclose: wherein the smart device is configured to prompt an unauthenticated user to give at least one additional attempt to obtain authentication from the smart device.
Devine teaches:
wherein the smart device is configured to prompt an unauthenticated user to give at least one additional attempt to obtain authentication from the smart device (Column 47, lines 5-12, "For example, the secure element optionally requires detection that the electronic device is being worn, detection of a button press, detection of entry of a passcode, detection of a touch, detection of one or more option selections (e.g., received while interacting with an application), detection of a fingerprint signature, detection of a voice or voice command, and or detection of a gesture or movement (e.g., rotation or acceleration)."; Column 55, lines 45-56, "In some examples, the indication (e.g., 714A-I) of the error condition includes (814) an indication of a user action (e.g., visible indication (e.g., graphic or text)) that can be performed to correct the error condition (e.g., for a subsequent authentication attempt). In some examples, the indication of the user action indicates how to correct the error condition for a subsequent authentication attempt. Displaying an indication of a user action that can be performed to correct the error condition provides feedback to the user as to what course of action to take so that the user can be biometrically authenticated in a subsequent authentication attempt."; Column 60, lines 3-12, "In some examples, in accordance with a determination that the biometric information captured using the one or more biometric sensors does not correspond to or does not match the authorization credentials, the electronic device (e.g., 100, 300, 500, 700) maintains the locked state of the electronic device. Displaying an animation of the lock icon shaking provides the user with feedback about the current state of the device (e.g., that biometric authentication has failed) and prompts the user to take further action."; Prompting the user to take further action when the biometric information captured using the one or more biometric sensors does not correspond to or does not match the authorization credentials, where the course of action is a subsequent authentication attempt, reads on prompting an unauthenticated user to give at least one additional attempt to obtain authentication.).
Devine is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang in view of Zhong, Lesso, and Blouet to incorporate the teachings of Devine to prompt a user to take further action when the biometric information captured using one or more biometric sensors does not correspond to or does not match the authorization credentials, where the course of action is a subsequent authentication attempt. Doing so would allow for enhancing the operability of a device and making the user-device interface more efficient (Devine; Column 60, lines 12-19).
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Zhong, Lesso, and Blouet, and further in view of Bae (US Patent No. 11,437,046).
Regarding claim 10, Zhang in view of Zhong, Lesso, and Blouet discloses the system as claimed in claim 1.
Zhang further discloses:
wherein an unauthenticated user comprises an attacker, and wherein the smart device is configured to, in response to unauthenticated bone and audio signals, lock access to personal user data (Paragraph 0008, lines 1-6, "In addition, if the second voice component is collected by a bone conduction microphone of the wearable device, it indicates that the user wears the wearable device when making a sound. This avoids a case in which an unauthorized user maliciously controls a terminal of the authorized user by using a recording of the authorized user."; An unauthorized user maliciously controlling a terminal of the authorized user by using a recording of the authorized user reads on an unauthenticated user comprises an attacker.);
unauthenticated bone and audio signals (Paragraph 0134, lines 6-18, "During user identity authentication, if the first matching degree and the second matching degree meet a preset authentication policy, for example, the authentication policy is that when the first matching degree is greater than a first threshold and the second matching degree is greater than a second threshold (the second threshold is the same as or different from the first threshold), the mobile phone determines that the user sending the first voice component and the second voice component is the authorized user, or otherwise, the mobile phone may determine that the user sending the first voice component and the second voice component is an unauthorized user.").
Zhang in view of Zhong, Lesso, and Blouet does not specifically disclose: in response to unauthenticated signals, locks access to personal user data.
Bae teaches:
in response to unauthenticated signals, locks access to personal user data (Column 10, lines 13-19, "The above-described example only describes the process of terminating the voice recognition service when the second authentication is failed. However, when the second authentication is failed, the processor 120 may change the electronic apparatus 100 to a lock state, or terminate the providing of the tasks and/or applications corresponding to other services other than the voice recognition service.").
Bae is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang in view of Zhong, Lesso, and Blouet to incorporate the teachings of Bae to change an electronic apparatus to a lock state when authentication fails. Doing so would allow for improving security for speaker authentication (Bae; Column 2, lines 8-14).
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Zhong, Lesso, Blouet, and Bae, and further in view of Goto et al. (US Patent No. 12,077,129), hereinafter Goto.
Regarding claim 11, Zhang in view of Zhong, Lesso, Blouet, and Bae discloses the system as claimed in claim 10.
Bae further teaches:
wherein the smart device is configured to, in response to the unauthenticated bone and audio signals: lock functionality of the smart device (Column 10, lines 13-19, "The above-described example only describes the process of terminating the voice recognition service when the second authentication is failed. However, when the second authentication is failed, the processor 120 may change the electronic apparatus 100 to a lock state, or terminate the providing of the tasks and/or applications corresponding to other services other than the voice recognition service.").
Bae is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang in view of Zhong, Lesso, Blouet, and Bae to further incorporate the teachings of Bae to change an electronic apparatus to a lock state when authentication fails. Doing so would allow for improving security for speaker authentication (Bae; Column 2, lines 8-14).
Zhang in view of Zhong, Lesso, Blouet, and Bae does not specifically disclose: notify the user via a pre-selected alternative form of communication.
Goto teaches:
notify the user via a pre-selected alternative form of communication (Column 13, line 62 - Column 14, line 6, "The notifier 26 executes the predetermined notification. The notification is executed for the purpose of informing the user that the authentication using the first authentication function (biometric authentication function) has failed. Examples of a notification message include “The biometric authentication function is OFF.” and “The biometric authentication function is ON but the authentication has failed.” For example, the notification can be executed by displaying a text or icon on a screen of a display device such as an information display or by outputting voice via a loudspeaker. When the notification is executed, the process proceeds to Step S610.").
Goto is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang in view of Zhong, Lesso, Blouet, and Bae to incorporate the teachings of Goto to execute a notification for the purpose of informing the user that the authentication using a biometric authentication function has failed. Doing so would allow for informing the user whether the authentication using a biometric authentication function is successful (Goto; Column 14, lines 20-28).
Claim 23 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Zhong and Lesso, and further in view of Fawaz et al. (US Patent No. 10,566,007), hereinafter Fawaz.
Regarding claim 23, Zhang in view of Zhong and Lesso discloses the system as claimed in claim 17, but does not specifically disclose: wherein the smart device is configured to, prior to determination of the consistency score, align the bone conduction signal and the audio signal, and wherein the consistency score is based on the aligned bone conduction signal and audio signal.
Fawaz teaches:
wherein the smart device is configured to, prior to determination of the consistency score, align the bone conduction signal and the audio signal, and wherein the consistency score is based on the aligned bone conduction signal and audio signal (Column 5, line 51 - Column 6, line 1, "The voice authentication (VA) module 112 may be configured to process, filter, analyze, and/or compare recorded speech signals from the microphone 110 and recorded vibration data from the accelerometer device 140. For example, the VA module 112 may synchronize the recorded speech signals with the recorded vibrations. The VA module 112 may perform the synchronization by aligning time shifts of the speech signal and vibration data such that there is maximum cross correlation between both signals. This operation may be vital for any comparisons, as the speech signals and the vibration data may not be received and/or recorded at the same time. Further, the VA module 112 may be configured to run various algorithms to determine if the speech signals originated from the user corresponding to the vibration data. In one embodiment, the VA module 112 may compare the speech signals to the vibration data to determine if they are correlated."; Synchronizing the recorded speech signals with the recorded vibrations by aligning time shifts of the speech signal and vibration data reads on aligning the bone conduction signal and the audio signal, and comparing the speech signals to the vibration data to determine if they are correlated reads on determining a consistency score based on the aligned bone conduction signal and audio signal.).
Fawaz is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang in view of Zhong and Lesso to incorporate the teachings of Fawaz to synchronize the recorded speech signals with the recorded vibrations by aligning time shifts of the speech signal and vibration data, and compare the speech signals to the vibration data to determine if they are correlated. Doing so would allow for comparing vibrations recorded from an accelerometer with that of speech signals recorded from a microphone, and identify and match individual speech segments from each signal to prevent an adversary from injecting commands (Fawaz; Column 3, lines 13-35).
Claim 25 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Zhong and Lesso, and further in view of Jorasch et al. (US Patent No. 11,269,426), hereinafter Jorasch.
Regarding claim 25, Zhang in view of Zhong and Lesso discloses the system as claimed in claim 24, but does not specifically disclose: wherein the identification signal includes a signal generated by scanning a badge or identification card.
Jorasch teaches:
wherein the identification signal includes a signal generated by scanning a badge or identification card (Column 143, lines 36-41, "For example, employee identification readers can be RFID readers to scan an employee badge, a camera to identify the employee via face recognition, a scanner to identify an employee by a carried user device, a microphone for voice recognition, or other employee identification technology."; Column 385, lines 20-39, "The use of computer technologies to perform financial and other high value transactions continues to increase. Because of the nature of these transactions, authentication of the parties involved in the transaction may be important. Authentication traditionally takes one of three forms, referred to as factors—something the user is (e.g., fingerprint, retinal scan, etc.), something the user has (e.g., smart card, ID card, computing device, etc), or something the user knows (e.g., PIN, password, etc.). Certain transactions, e.g., financial transactions, require multiple authentication factors (referred to as multi-factor authentication). For example, a user may have to present a smartcard and input a PIN to gain access to a system or specific service. Biometric authentication is sometimes considered a particularly strong form of authentication due to the complexities of spoofing a valid biometric signature for a user. Biometric authentication uses physical or behavioral characteristics of a user for authentication purposes. Examples of biometrics include fingerprints, eye retinas and irises, and voice patterns.").
Jorasch is considered to be analogous to the claimed invention because it is in the same field of voice authentication. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang in view of Zhong and Lesso to incorporate the teachings of Jorasch to scan an employee badge to identify an employee. Doing so would allow for improving control of presentation content (Jorasch; Column 1, lines 38-48).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JAMES BOGGS/Examiner, Art Unit 2657