Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 08/28/2025 has been entered.
Response to Amendment
In the response filed on 08/28/2025. The applicant amended claims 1, 12, and 19 are amended. No claims were added.
Response to Arguments
With respect to 135 U.S.C. §103 rejections:
Applicant's arguments filed on 08/28/2025 have been received and entered. Applicant's arguments with respect to the newly amended independents “Claim Rejections - 35 USC § 103” remarks pages 7-9, have been considered but are moot because the claim amendment introduces new claim limitations that have not previously been considered. Therefore, the new 103 ground of rejection relies on new references in combination as presented below.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 3, 6-11, and 19-22 are rejected under 35 U.S.C. 103 as being unpatentable over Fawaz (US 20180068671 A1) in view of Zhang (US 20210256979 A1).
Regarding claim 1, Fawaz teaches a method for two-way authentication of a user, the method comprising:
receiving a bone conduction signal from the user via one or more wearable devices (Fawaz, Fig 2, The method 200 may begin after the system records vibration data via an accelerometer device (Block 202) (i.e., Bone conduction signal), vibration data corresponds to recorded vibrations related to speech from a user associated with the accelerometer device. In an embodiment, a user may be associated with the accelerometer device if the user is wearing the device as a necklace, headset, or other suitable equipment (i.e., one or more wearable devices), [0036]) [Examiner interprets that system recoding body vibration corresponding to speech of a user by using accelerometer which is worn as a necklace/headset devices as receiving a bone conduction signal from the user via one or more wearable devices];
receiving an audio signal from the user via a microphone separate from the one or more wearable devices, the audio signal to correspond with the bone conduction signal (Fawaz, The system collects various data associated with a voice command, such as a speech signal and vibrations of the head, neck and/or chest of a user corresponding to the speech signals. The data is collected from at least one accelerometer and one microphone disposed in one or more devices, For example, the accelerometer may be disposed in a necklace worn by a user and communicatively coupled to a smartphone, including a microphone, that implements a voice assistant. As the user speaks, the wearable device utilizes an accelerometer to record and then transmit the vibration data to the smart phone which is simultaneously recording the speech signals, [0017] The client device 130 may include a microphone 110 to record speech signals. However, in other embodiments, the microphone 110 may be separate from the client device 130, [0029]) [Examiner interprets system having accelerometer in a necklace and microphone in a smart phone (i.e., separate wearable devices) and recording speech signals (i.e., audio signal) while the wearable device recoding vibration simultaneously (i.e., , the audio signal to correspond with the bone conduction signal) as limitation above];
determining a consistency score for the audio signal in relation to the corresponding bone conduction signal, the consistency score comprising a number indicative of a probability that the audio signal and bone conduction signal originate from the user (Fawaz, the VA module 112 may be configured to run various algorithms to determine if the speech signals originated from the user corresponding to the vibration data…the VA module 112 may compare the speech signals to the vibration data to determine if they are correlated. Depending on the criteria of the system, the correlation may need to be within a certain percentage to authenticate the voice command. In other words, if the VA module 112 determines that the speech signals have a high enough correlation to the vibration data, then the speech signals can be attributed to the user and, thus, a voice command corresponding to the speech signals is considered authenticated, [0034] The matching algorithm may include receiving speech signals and vibration data along with the corresponding sampling frequencies. The matching algorithm may conclude by producing a decision value indicating whether there is a match between the speech signals and the vibration data (i.e. authenticate or deny), [0042] The final post-processing step may include measuring the signal similarity between the accelerometer and microphone signals by using the normalized cross correlation. The system analyzes whether the normalized cross correlation has a significant peak and that the maximum value is outside the range [−0.4, 0.4]. This indicates that the two signals are included within each other as shown in the plot 540 of FIG. 5. In this case, the system will conclude that the resulting microphone signal matches the recorded vibration data, [0052]) [Examiner interprets that VA module calculating normalized cross correlation between speech and vibration signal and produces the decision value compared to thresholds which Under BRI numeric similarity/decision value (i.e., consistency score) for the audio signal in relation to the corresponding bone conduction signal and deciding whether the command is originated from the same user based on this value (i.e., indicative of probability) as limitation above];
in response to the consistency score being greater than or equal to a consistency threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval (Fawaz, the VA module 112 may synchronize the recorded speech signals with the recorded vibrations. The VA module 112 may perform the synchronization by aligning time shifts of the speech signal and vibration data such that there is maximum cross correlation between both signals. This operation may be vital for any comparisons, as the speech signals and the vibration data may not be received and/or recorded at the same time, [0033] the system records vibration data via an accelerometer device (Block 202)….At approximately the same time, the system may also record speech signals via a microphone (Block 204). If speech signals and vibration data are recorded at disparate points of time, then clearly the speech signals are disassociated from the vibration data and authentication automatically fails. Thus, the method 200 only continues with speech signals and vibration data which are recorded at substantially the same time, [0036-0037] Comparison of the speech signals to the vibration data may also include performing a correlation analysis to determine the correlation between the two. Acceptable correlation levels between the vibration data and speech signals may depend on a number of factors such as the quality of the recordings, the necessary level of security, usability/wearability, etc. If the speech signals and vibration data are sufficiently correlated, then the speech signal was originated from the user corresponding to the vibration data. Thus, the voice command corresponding to the speech signal is authenticated, [0040] The final post-processing step may include measuring the signal similarity between the accelerometer and microphone signals by using the normalized cross correlation. The system analyzes whether the normalized cross correlation has a significant peak and that the maximum value is outside the range [−0.4, 0.4]. This indicates that the two signals are included within each other as shown in the plot 540 of FIG. 5. In this case, the system will conclude that the resulting microphone signal matches the recorded vibration data, [0052]) [Examiner interprets that system using minimum correlation requirement such as correlation outside [-0.4,0.4] (i.e., consistency score), comparing against the threshold and requiring the signals to be recorded substantially at the same time and failing authentication when recorded at disparate times, alignment via time shifts teaches limitation above];
Although, Fawaz teaches VA module uses defined algorithm such as filters, per segment test, correlation to clean and evaluate the microphonic signal and decide whether it matches vibration from the user, [0033,0047,0052], explicit logic applied to the vibration signal such as rejecting segments not corresponding to a human speech or not matching speech segments (length filters, glottal pulses, pitch constraints) to verify if the remaining vibration data comes from the speaking user, [0024,0046,0049] and when both channels matches (i.e., both signals are satisfied), the command is authenticated for access, [0017,0041],
Fawaz does not explicitly teach:
verifying, using an audio conduction model (AC model), that the audio signal is associated with the user; verifying, using a bone conduction model (BC model), that the bone conduction signal is associated with the user; in response to verification of the audio signal and the bone conduction signal, enabling, for the user, access to a smart device
However, Zhang teaches:
verifying, using an audio conduction model (AC model), that the audio signal is associated with the user (Zhang, When the first voice component matches the first voiceprint model, it indicates that the voice information collected by the Bluetooth headset at this time is entered by the authorized user. A higher matching degree indicates more similarity between the voice component and the corresponding voiceprint model, and a higher possibility that a voicing user is the authorized user. [0092]);
verifying, using a bone conduction model (BC model), that the bone conduction signal is associated with the user (Zhang, When the second voice component matches the second voiceprint model, it indicates that the voice information collected by the Bluetooth headset at this time is entered by the authorized user. A higher matching degree indicates more similarity between the voice component and the corresponding voiceprint model, and a higher possibility that a voicing user is the authorized user. [0092]); and
in response to verification of the audio signal and the bone conduction signal, enabling, for the user, access to a smart device (Zhang, Bluetooth headset determine that the first voice component matches the first voiceprint model, and the second voice component matches the second voiceprint model and sends to the mobile phone (i.e., a smart device), an operation instruction corresponding to the voice information, for example, an unlock instruction, a power-off instruction, or an instruction for calling a specific contact. In this way, the mobile phone performs a corresponding operation based on the operation instruction, so that the user can control the mobile phone by using a voice (i.e., access to a smart device) , [0092]) [ Examiner interprets that checking if the voice components does not match their voice models to authorize users to perform operation on the mobile phone such as access as in response to verification of the audio signal and the bone conduction signal, enabling, for the user, access to a smart device]
Therefore, it would have been obvious to PHOSITA before the effective filing date to modify the teaching of Fawaz to include a concept of verifying, using an audio conduction model (AC model), that the audio signal is associated with the user; verifying, using a bone conduction model (BC model), that the bone conduction signal is associated with the user; in response to verification of the audio signal and the bone conduction signal, enabling, for the user, access to a smart device as taught by Zhang for the purpose of determining that the first voice component matches the first voiceprint model, and the second voice component matches the second voiceprint model and sending to the mobile phone (i.e., a smart device), to perform a corresponding operation based on the operation instruction, so that the user can control the mobile phone by using a voice (i.e., access to a smart device), [Zhang: 0092].
Regarding claim 3, Fawaz and Zhang further teaches the method of claim 1, further comprising, prior to determining the consistency score, pre- processing the bone conduction signal (Fawaz, Beginning with the pre-processing step, in an example embodiment, the module 112 may re-sample both vibration data and speech signals to the same sampling rate while applying a low-pass filter to prevent aliasing. In some embodiments, the low-pass filter may be 4 Khz, 8 kHz or any other suitable frequency that preserves most of the acoustic features of the speech signals and while reducing the processing load,… algorithm may refine the raw data 320 and 330 by normalizing the magnitude of the data to have a maximum magnitude of unity, which necessitates removal of the spikes in the data. matching algorithm may continue pre-processing by identifying the energy envelope of the vibration data and respective application to the speech signal. Pre-processing of the matching algorithm concludes when the VA module 112 overlays the vibration data envelope to the speech signal so that it removes all portions of the speech signal that did not result from vibrations of the user, [0044-0047]) [Examiner interprets that system preprocessing operations on the vibration data such as low pass filter, normalization, envelope detection performed before the final cross correlation used for authentication as limitation above];
Regarding claim 6, Fawaz and Zhang further teaches the method of claim 1, further comprising, prior to determination of the consistency score and verification, prompting the user to submit an initial or enrollment (1) bone conduction signal and (2) audio signal (Zhang, Fig 4, S403: If being in the wearing state, the Bluetooth headset collects, by using the first voice sensor, a first voice component in the voice information entered by the user, and collects a second voice component in the voice information by using the second voice sensor (i.e. an initial or enrollment (1) bone conduction signal and (2) audio signal ), [0108]. For example, the first voice sensor is an air conduction microphone, and the second voice sensor is a bone conduction microphone. In a process of using the Bluetooth headset, the user may enter voice information “Xiao F, pay by using WeChat”. In this case, because the air conduction microphone is exposed in the air, the Bluetooth headset may receive, by using the air conduction microphone, a vibration signal (in other words, the first voice component in the voice information) generated by air vibration after the user makes a sound, In addition, because the bone conduction microphone can be in contact with an ear bone of the user through the skin, the Bluetooth headset may receive, by using the bone conduction microphone, a vibration signal (in other words, the second voice component in the voice information) generated by vibration of the ear bone and the skin after the user makes a sound, [0109] S405: The mobile phone separately performs voiceprint recognition on the first voice component and the second voice component, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint recognition result corresponding to the second voice component, [0118]) [ Examiner interprets that collecting first and second voice component (i.e., an initial or enrollment (1) bone conduction signal and (2) audio signal ) by prompting user to enter voice before the performing an initial or enrollment (1) bone conduction signal and (2) audio signal (i.e., determination of a consistency score and verification) as prior to determination of a consistency score and verification, prompting the user to submit an initial or enrollment (1) bone conduction signal and (2) audio signal].
Regarding claim 7, Fawaz and Zhang further teaches the method of claim 6, further comprising:
training the AC model with a plurality of responses prior to submission of the initial or enrollment audio signal, and training the AC model with the submitted initial or enrollment audio signal (Zhang, Voiceprint models of one or more authorized users are pre-stored on the mobile phone. Each authorized user has two voiceprint models, one is a first voiceprint model established based on a voice feature of the user collected when the air conduction microphone (in other words, the first voice sensor) works), [0119] and Fig 4, S403: If being in the wearing state, the Bluetooth headset collects, by using the first voice sensor, a first voice component in the voice information entered by the user (i.e., an initial or enrollment audio signal ), [0108]. S405: The mobile phone separately performs voiceprint recognition on the first voice component to obtain a first voiceprint recognition result corresponding to the first voice component [0118] The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model belonging to the user are established by entering the registration voice (i.e., an initial or enrollment audio signal ), [0121]) [Examiner interprets that user having their prestored voice print models on their phone as training the AC model with a plurality of responses prior to submission of an initial or enrollment audio signal and performing voiceprint recognition result after the collection of a first voice component in the voice information entered by the user (i.e. as an initial or enrollment audio signal) as training the AC model with a submitted initial or enrollment audio signal].
Regarding claim 8, Fawaz and Zhang further teaches the method of claim 6, wherein the BC model comprises a convolutional neural network (CNN) (Zhang, The first phase is establishing the first voiceprint model and the second voiceprint model. The first phase is a background model training phase. In the first phase, a background model of voiceprint recognition is established by using a machine learning algorithm such as a GMM (gaussian mixed model, Gaussian mixture model), an SVM (support vector machines, support vector machine), or a deep neural network framework. The mobile phone or the Bluetooth headset may establish, based on the background model and a registration voice entered by a user, a first voiceprint model and a second voiceprint model belonging to the user. The deep neural network framework includes but is not limited to a DNN (deep neural network, deep neural network) algorithm, an RNN (recurrent neural network, recurrent neural network) algorithm, an LSTM (long short-term memory, long short-term memory) algorithm, and the like, [0120]) [In light of specification, Examiner interprets that under Broadest reasonable interpretation BC model being an image classifier based on CNN interprets second voiceprint model comprising DNN as the BC model comprises a convolutional neural network (CNN), see instant application at spec [0056, 0057]].
Regarding claim 9, Fawaz and Zhang further teaches the method of claim 8, further comprising training the CNN with a stored/pre-collected bone conduction dataset (Zhang, The first phase is a background model training phase. In the first phase, a. developer may collect voices of related texts (for example, “Hello, Xiao E”) generated when a large quantity of speakers wearing the Bluetooth headset make a sound. Further, after performing filtering and noise reduction on the voices of the related texts, the mobile phone may extract an audio feature (for example, a time-frequency noise spectrum graph, or a gammatone-like spectrogram) in a background sound, and a background model of voiceprint recognition is established by using a machine learning algorithm such as a GMM (gaussian mixed model, Gaussian mixture model), an SVM (support vector machines, support vector machine), or a deep neural network framework. The mobile phone or the Bluetooth headset may establish, based on the background model and a registration voice entered by a user, a first voiceprint model and a second voiceprint model belonging to the user. The deep neural network framework includes but is not limited to a DNN (deep neural network, deep neural network) algorithm, an RNN (recurrent neural network, recurrent neural network) algorithm, an LSTM (long short-term memory, long short-term memory) algorithm, and the like, [0120]) [ Examiner interprets that collecting voices of related texts generated when a large quantity of speakers wearing the Bluetooth headset make a sound for training a background model (i.e., the second voiceprint model) as training the CNN with a stored/pre-collected bone conduction dataset].
Regarding claim 10, Fawaz and Zhang further teaches the method of claim 1, further comprising, if one or more of the audio signal or the bone conduction signal are not verified, preventing the user from accessing the smart device (Fawaz, If the system has authenticated the voice command, the system may then transmit the voice command to a voice assistant (Block 212). If the speech signals and vibration data are not sufficiently correlated, then the voice command is not authenticated. The system may restart the method each time vibration data is recorded, thus enabling continuous voice authentication and added security for voice assistant devices, [0041]) [Examiner interprets that non authentication of user as preventing the user from accessing the smart device].
Regarding claim 11, Fawaz and Zhang further teaches the method of claim 1, further comprising, prior to determining the consistency score, delaying an earliest received of the bone conduction signal and the audio signal to align the bone conduction signal and the audio signal (Fawaz, the device may include one or more components to process recorded vibrations prior to transmitting the recordings. For example, the device may insert metadata such as timestamps to help the voice authentication module (discussed below) sync the recorded vibrations with recorded speech signals for comparison. Further, the device may edit the recorded vibrations to apply a filter, remove noise, adjust frequency, and/or otherwise prepare the recorded vibrations in a format compatible with other components of the system, [0023] the VA module 112 may synchronize the recorded speech signals with the recorded vibrations. The VA module 112 may perform the synchronization by aligning time shifts of the speech signal and vibration data such that there is maximum cross correlation between both signals. This operation may be vital for any comparisons, as the speech signals and the vibration data may not be received and/or recorded at the same time, [0033] Graphs 340 and 350 of FIG. 3 illustrate normalized speech signals and vibration data, respectively. Further, the graphs 340 and 350 have been aligned by finding the time shift that results in the maximum cross correlation of both graphs. In some embodiments, the graphs 340 and 350 may also be truncated so that the two graphs are on the same signal duration, [0045]) [Examiner interprets that system synchronizing via time shifts (i.e., delaying earlier signal) to align signals to have maximum cross correlation as limitation above].
Regarding claim 19, Claim 19 recite commensurate subject matter as claim 1. Therefore, it is rejected for the same reasons. Except for additional elements:
Fawaz further teaches:
a non-transitory machine-readable storage medium storing processor-executable instructions that, when executed by at least one processor, cause the at least one processor (Fawaz, The method 200 may include one or more functions and/or routines in the form of non-transitory computer-executable instructions that are stored in a tangible computer-readable storage medium and/or executed using a processor of a computing device (e.g., the client device 130 and/or the server 120), [0035]) to:
Regarding claim 20, Fawaz and Zhang further teaches the non-transitory machine-readable storage medium of claim 19, wherein the smart device includes the microphone (Fawaz, The client device 130 (i.e., the smart device) may include a microphone 110 to record speech signals. However, in other embodiments, the microphone 110 may be separate from the client device 130. In these embodiments, the microphone 110 may be configured to transmit recorded speech signals to the client device 130, [0029])
Regarding claim 21, Fawaz and Zhang further teaches the non-transitory machine-readable storage medium of claim 20, wherein the bone conduction signal is received via wireless communication (Fawaz, The device may also be capable of transmitting recorded vibrations through wireless signals, [0021] The system may then transmit the recorded vibration data (Block 206). In one embodiment, the device housing the accelerometer may implement Bluetooth™ technology to transmit the recorded vibration data, [0031])
Regarding claim 22, Fawaz and Zhang further teaches the non-transitory machine-readable storage medium of claim 19, wherein the consistency threshold is based on similarities between bone conduction signals and audio signals that indicate the bone conduction signal and the audio signal are from the user (Fawaz, the VA module 112 may be configured to run various algorithms to determine if the speech signals originated from the user corresponding to the vibration data…the VA module 112 may compare the speech signals to the vibration data to determine if they are correlated. Depending on the criteria of the system, the correlation may need to be within a certain percentage to authenticate the voice command. In other words, if the VA module 112 determines that the speech signals have a high enough correlation to the vibration data, then the speech signals can be attributed to the user and, thus, a voice command corresponding to the speech signals is considered authenticated, [0034] The matching algorithm may include receiving speech signals and vibration data along with the corresponding sampling frequencies. The matching algorithm may conclude by producing a decision value indicating whether there is a match between the speech signals and the vibration data (i.e. authenticate or deny), [0042] The final post-processing step may include measuring the signal similarity between the accelerometer and microphone signals by using the normalized cross correlation. The system analyzes whether the normalized cross correlation has a significant peak and that the maximum value is outside the range [−0.4, 0.4]. This indicates that the two signals are included within each other as shown in the plot 540 of FIG. 5. In this case, the system will conclude that the resulting microphone signal matches the recorded vibration data, [0052]) [Examiner interprets that VA module calculating normalized cross correlation between speech and vibration signal and produces the decision value compared to thresholds which Under BRI numeric similarity/decision value (i.e., consistency score) for the audio signal in relation to the corresponding bone conduction signal and deciding whether the command is originated from the same user based on this value (i.e., indicative of probability) as limitation above];
Claims 12-18 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (US 20210256979 A1) in view of Fawaz (US 20180068671 A1).
Regarding claim 12, Zhang teaches a method for two-way authentication of a user (Zhang, voice control method to complete a series of operations such as user identity authentication, mobile phone unlocking, and enabling a function of the mobile phone, [0144]), the method comprising:
prompting the user to submit initial or enrollment (a) audio signals and (b) bone conduction signals (Zhang, Fig 4, S403: If being in the wearing state, the Bluetooth headset collects, by using the first voice sensor, a first voice component in the voice information entered by the user, and collects a second voice component in the voice information by using the second voice sensor (i.e., an initial or enrollment (1) bone conduction signal and (2) audio signal ), [0108]. For example, the first voice sensor is an air conduction microphone, and the second voice sensor is a bone conduction microphone. In a process of using the Bluetooth headset, the user may enter voice information “Xiao F, pay by using WeChat”. In this case, because the air conduction microphone is exposed in the air, the Bluetooth headset may receive, by using the air conduction microphone, a vibration signal (in other words, the first voice component in the voice information) generated by air vibration after the user makes a sound, In addition, because the bone conduction microphone can be in contact with an ear bone of the user through the skin, the Bluetooth headset may receive, by using the bone conduction microphone, a vibration signal (in other words, the second voice component in the voice information) generated by vibration of the ear bone and the skin after the user makes a sound, [0109]) [ Examiner interprets that collecting first and second voice component (i.e., an initial or enrollment (1) bone conduction signal and (2) audio signal ) by prompting user to enter voice as prompting the user to submit initial or enrollment (a) audio signals and (b) bone conduction signals ];
updating an audio conduction model (AC model) and a bone conduction model (BC model) based on the received initial or enrollment (a) audio signals and (b) bone conduction signals (Zhang, Voiceprint models of one or more authorized users may he pre-stored on the mobile phone. Each authorized user has two voiceprint models, one is a first voiceprint model established based on a voice feature of the user collected when the air conduction microphone (in other words, the first voice sensor) works, and the other is a second voiceprint model established based on a voice feature of the user collected when the bone conduction microphone (in other words, the second voice sensor) works, [0119], S405: The mobile phone separately performs voiceprint recognition on the first voice component and the second voice component, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint recognition result corresponding to the second voice component, [0118] The second phase is a process in which when the user uses a voice control function on the mobile phone for the first time, the first voiceprint model and the second voiceprint model belonging to the user are established by entering the registration voice(i.e., an initial or enrollment (1) bone conduction signal and (2) audio signal ), [0121]) [Examiner interprets that collecting registration voice and performing voiceprint recognition result after the collection of a first voice component in the voice information entered by the user (i.e. as an initial or enrollment audio signal) as updating an audio conduction model (AC model) and a bone conduction model (BC model) based on the received initial or enrollment (a) audio signals and (b) bone conduction signals];
after reception of the initial or enrollment (a) audio signals and (b) bone conduction signals (Zhang, Fig 4, , S403: If being in the wearing state, the Bluetooth headset collects, by using the first voice sensor, a first voice component in the voice information entered by the user, and collects a second voice component in the voice information by using the second voice sensor (i.e., an initial or enrollment (1) bone conduction signal and (2) audio signal ), [0108]. For example, the first voice sensor is an air conduction microphone, and the second voice sensor is a bone conduction microphone. In a process of using the Bluetooth headset, the user may enter voice information “Xiao F, pay by using WeChat”. In this case, because the air conduction microphone is exposed in the air, the Bluetooth headset may receive, by using the air conduction microphone, a vibration signal (in other words, the first voice component in the voice information) generated by air vibration after the user makes a sound, In addition, because the bone conduction microphone can be in contact with an ear bone of the user through the skin, the Bluetooth headset may receive, by using the bone conduction microphone, a vibration signal (in other words, the second voice component in the voice information) generated by vibration of the ear bone and the skin after the user makes a sound, [0109]);
receiving a bone conduction signal from the user via one or more wearable devices (Zhang, Fig 2, the second voice sensor 202 of wearable device 11 collects the voice information sent by the user after bone propagation (i.e., a bone conduction signal), [0071]) Fig 4, S404: The Bluetooth headset sends the second voice component to the mobile phone by using the Bluetooth connection. S405: The mobile phone separately performs voiceprint recognition on the first voice component, to obtain a second voiceprint recognition result corresponding to the second voice component, [0118]) [Examiner interprets that mobile phone receiving second voice component (i.e., a bone conduction signal) from user via wearable device 11 which collects the bone conduction signal for identity authentication as receiving a bone conduction signal from the user via one or more wearable devices];
receiving an audio signal from the user via a microphone, the audio signal corresponding to the bone conduction signal (Zhang, the first voice sensor 201 is an air conduction microphone, and when the user wearing the wearable device 11 speaks, the wearable device 11 may collect, by using the first voice sensor 201, voice information sent by the user after air propagation [0071], Fig 4, S404: The Bluetooth headset sends the first voice component (i.e., audio signal) to the mobile phone by using the Bluetooth connection. S405: The mobile phone separately performs voiceprint recognition on the first voice component to obtain a first voiceprint recognition result corresponding to the first voice component [0118]) [Examiner interprets that mobile phone receiving air conduction signal (i.e., audio signal) from user via microphone which collects the air conduction signal for identity authentication as receiving a bone conduction signal from the user via one or more wearable devices];
determining a consistency score for the audio signal in relation to the corresponding bone conduction signal (Zhang, the Bluetooth headset (i.e., the wearable device 11) separately collect the voice information by using the externally disposed first voice sensor 201 and the internally disposed second voice sensor 202. For example, the voice information collected by the first voice sensor 201 is a first voice component (i.e., the audio signal), and the voice information collected by the second voice sensor 202 is a second voice component (i.e., the bone conduction signal), [0090] , the Bluetooth headset perform voiceprint recognition on the first voice component and the second voice component, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint recognition result corresponding to the second voice component [0091] and When the first voice component matches the first voiceprint model, and the second voice component matches the second voiceprint model, it indicates that the voice information collected by the Bluetooth headset at this time is entered by the authorized user. For example, the Bluetooth headset calculates, by using a specific algorithm, a first degree of matching between the first voice component and the first voiceprint model and a second degree of matching between the second voice component and the second voiceprint model. A higher matching degree indicates more similarity between the voice component and the corresponding voiceprint model, and a higher possibility that a voicing user is the authorized user, when an average value of the first matching degree and the second matching degree (i.e., the average consistency score of first component and second component) is greater than 80 scores (i.e., consistency threshold) , the Bluetooth headset may determine that the first voice component matches the first voiceprint model, and the second voice component matches the second voiceprint model, [0092]) [Under BRI, Examiner interprets that “consistency score” between two signals need only be a single quantitative measure that reflects how well they agree and computing a first matching degree between the air conduction (i.e. audio) signal and a second matching degree between the bone conduction signal and bone voice print model, then taking an average of those two degrees (i.e., a single consistency score) and comparing it to a threshold as determining a consistency score for the audio signal in relation to the corresponding bone conduction signal].
in response to the consistency score being greater than or equal to a consistency threshold (Zhang, when an average value of the first matching degree and the second matching degree (i.e., the average consistency score of first component and second component) is greater than 80 scores (i.e., consistency threshold), the Bluetooth headset may determine that the first voice component matches the first voiceprint model, and the second voice component matches the second voiceprint model, [0092]):
verifying, using an audio conduction model (AC model), that the audio signal is associated with the user (Zhang, When the first voice component matches the first voiceprint model, it indicates that the voice information collected by the Bluetooth headset at this time is entered by the authorized user. A higher matching degree indicates more similarity between the voice component and the corresponding voiceprint model, and a higher possibility that a voicing user is the authorized user. [0092]);
verifying, using a bone conduction model (BC model), that the bone conduction signal is associated with the user (Zhang, When the second voice component matches the second voiceprint model, it indicates that the voice information collected by the Bluetooth headset at this time is entered by the authorized user. A higher matching degree indicates more similarity between the voice component and the corresponding voiceprint model, and a higher possibility that a voicing user is the authorized user. [0092]); and
in response to verification of the audio signal and the bone conduction signal, enabling, for the user, access to a smart device (Zhang, Bluetooth headset determine that the first voice component matches the first voiceprint model, and the second voice component matches the second voiceprint model and sends to the mobile phone (i.e., a smart device), an operation instruction corresponding to the voice information, for example, an unlock instruction, a power-off instruction, or an instruction for calling a specific contact. In this way, the mobile phone performs a corresponding operation based on the operation instruction, so that the user can control the mobile phone by using a voice (i.e., access to a smart device) , [0092]) [ Examiner interprets that checking if the voice components does not match their voice models to authorize users to perform operation on the mobile phone such as access as in response to verification of the audio signal and the bone conduction signal, enabling, for the user, access to a smart device];
Although an average value of the first matching degree and the second matching degree (i.e., the average consistency score of first component and second component) is greater than 80 scores (i.e., consistency threshold), can be considered as consistency score for the audio signal in relation to the corresponding bone conduction signal under BRI [0092],
Zhang does not explicitly teach:
receiving an audio signal from the user via a microphone separate from the one or more wearable devices; determining a consistency score for the audio signal in relation to the corresponding bone conduction signal, the consistency score comprising a number indicative of a probability that the audio signal and bone conduction signal originate from the user; in response to the consistency score being greater than or equal to a consistency threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval;
However, Fawaz teaches:
receiving an audio signal from the user via a microphone separate from the one or more wearable devices (Fawaz, The system collects various data associated with a voice command, such as a speech signal and vibrations of the head, neck and/or chest of a user corresponding to the speech signals. The data is collected from at least one accelerometer and one microphone disposed in one or more devices, For example, the accelerometer may be disposed in a necklace worn by a user and communicatively coupled to a smartphone, including a microphone, that implements a voice assistant. As the user speaks, the wearable device utilizes an accelerometer to record and then transmit the vibration data to the smart phone which is simultaneously recording the speech signals, [0017] The client device 130 may include a microphone 110 to record speech signals. However, in other embodiments, the microphone 110 may be separate from the client device 130, [0029]) [Examiner interprets system having accelerometer in a necklace and microphone in a smart phone (i.e., separate wearable devices) and recording speech signals (i.e., audio signal) while the wearable device recoding vibration simultaneously (i.e., , the audio signal to correspond with the bone conduction signal) as limitation above];
determining a consistency score for the audio signal in relation to the corresponding bone conduction signal, the consistency score comprising a number indicative of a probability that the audio signal and bone conduction signal originate from the user (Fawaz, the VA module 112 may be configured to run various algorithms to determine if the speech signals originated from the user corresponding to the vibration data…the VA module 112 may compare the speech signals to the vibration data to determine if they are correlated. Depending on the criteria of the system, the correlation may need to be within a certain percentage to authenticate the voice command. In other words, if the VA module 112 determines that the speech signals have a high enough correlation to the vibration data, then the speech signals can be attributed to the user and, thus, a voice command corresponding to the speech signals is considered authenticated, [0034] The matching algorithm may include receiving speech signals and vibration data along with the corresponding sampling frequencies. The matching algorithm may conclude by producing a decision value indicating whether there is a match between the speech signals and the vibration data (i.e. authenticate or deny), [0042] The final post-processing step may include measuring the signal similarity between the accelerometer and microphone signals by using the normalized cross correlation. The system analyzes whether the normalized cross correlation has a significant peak and that the maximum value is outside the range [−0.4, 0.4]. This indicates that the two signals are included within each other as shown in the plot 540 of FIG. 5. In this case, the system will conclude that the resulting microphone signal matches the recorded vibration data, [0052]) [Examiner interprets that VA module calculating normalized cross correlation between speech and vibration signal and produces the decision value compared to thresholds which Under BRI numeric similarity/decision value (i.e., consistency score) for the audio signal in relation to the corresponding bone conduction signal and deciding whether the command is originated from the same user based on this value (i.e., indicative of probability) as limitation above];
in response to the consistency score being greater than or equal to a consistency threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval (Fawaz, the VA module 112 may synchronize the recorded speech signals with the recorded vibrations. The VA module 112 may perform the synchronization by aligning time shifts of the speech signal and vibration data such that there is maximum cross correlation between both signals. This operation may be vital for any comparisons, as the speech signals and the vibration data may not be received and/or recorded at the same time, [0033] the system records vibration data via an accelerometer device (Block 202)….At approximately the same time, the system may also record speech signals via a microphone (Block 204). If speech signals and vibration data are recorded at disparate points of time, then clearly the speech signals are disassociated from the vibration data and authentication automatically fails. Thus, the method 200 only continues with speech signals and vibration data which are recorded at substantially the same time, [0036-0037] Comparison of the speech signals to the vibration data may also include performing a correlation analysis to determine the correlation between the two. Acceptable correlation levels between the vibration data and speech signals may depend on a number of factors such as the quality of the recordings, the necessary level of security, usability/wearability, etc. If the speech signals and vibration data are sufficiently correlated, then the speech signal was originated from the user corresponding to the vibration data. Thus, the voice command corresponding to the speech signal is authenticated, [0040] The final post-processing step may include measuring the signal similarity between the accelerometer and microphone signals by using the normalized cross correlation. The system analyzes whether the normalized cross correlation has a significant peak and that the maximum value is outside the range [−0.4, 0.4]. This indicates that the two signals are included within each other as shown in the plot 540 of FIG. 5. In this case, the system will conclude that the resulting microphone signal matches the recorded vibration data, [0052]) [Examiner interprets that system using minimum correlation requirement such as correlation outside [-0.4,0.4] (i.e., consistency score), comparing against the threshold and requiring the signals to be recorded substantially at the same time and failing authentication when recorded at disparate times, alignment via time shifts teaches limitation above];
Therefore, it would have been obvious to PHOSITA before the effective filing date to modify the teaching of Zhang to include a concept of receiving an audio signal from the user via a microphone separate from the one or more wearable devices; determining a consistency score for the audio signal in relation to the corresponding bone conduction signal, the consistency score comprising a number indicative of a probability that the audio signal and bone conduction signal originate from the user; in response to the consistency score being greater than or equal to a consistency threshold and in response to a difference in reception time of the bone conduction signal and the audio signal being less than a preselected time interval as taught by Fawaz for the purpose of performing the synchronization by aligning time shifts of the speech signal and vibration data such that there is maximum cross correlation between both signals [Fawaz:0033], comparing speech signals and vibration data to determine sufficiently correlated, then the speech signal was originated from the user corresponding to the vibration data for authentication [Fawaz:0040] and improving accuracy of the authentication of voice commands by, for example, employing machine learning methods to dynamically learn the cut-off thresholds [Fawaz:0054].
Regarding claim 13, Zhang and Fawaz further teaches the method of claim 12, wherein the initial or enrollment audio signal is received via the microphone (Zhang, Fig 4, S403: If being in the wearing state, th