Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Response to Arguments
Applicant’s arguments, see pages 3-4, filed October 08, 2025, with respect to the rejections of claims 1-20 under 35 U.S.C. 102(a)(1) as being anticipated by (KHOURY et al (US 20210326421 A1 –hereinafter—"KHOURY”) have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of PARK et al. (US 20200286489 A1 –hereinafter--- “PARK”).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over (KHOURY et al (US 20210326421 A1 –hereinafter—"KHOURY”) in view of PARK et al. (US 20200286489 A1 –hereinafter--- “PARK”).
As per claim 1, KHOURY discloses a device for user authentication of an occupant in a vehicle, the device comprising:
a memory configured to store a reference embedding set ([0039] Voice biometrics for speaker recognition and other operations (e.g., authentication) typically rely upon models or feature vectors (sometimes called “embeddings”) generated from a universe of speaker samples and samples of a particular speaker)
including a feature embedding for a registered user's utterance ([0039] The machine-learning architecture outputs certain results according to corresponding inputs and evaluates the results according to a loss function by comparing the expected output against the observed output [0040] After training the machine-learning architecture, the server can further refine and develop the machine-learning architecture to recognize a particular speaker during enrollment operations for the particular speaker. The speech recognition engine can generate an enrollee a model embedding (sometimes called a “voiceprint”) using embeddings extracted from enrollee audio signals having utterances of the speaker. [0116] The analytics server 102 receives the audio signal from the content server 111 and extracts various types of features from the audio signal. The analytics server 102 performs audio event detection or other voice activity detection to differentiate between background noise, silence, and speakers in the audio signal. For example, the analytics server 102 may pre-process the audio data (e.g., filtering the audio signal to reduce noise, parsing the audio signal into frames or sub-frames, performing various normalizations or scaling operations), execute voice activity detection (VAD) software or VAD machine learning, and/or extract features (e.g., one or more spectro-temporal features) from portions (e.g., frames, segments) or from substantially all of the audio signal. The features extracted from the audio signal may include Mel frequency spectrum coefficients (MFCCs), Mel Filter banks, Linear filter banks, bottleneck features, and the like. [0119] the analytics server 102 may train a machine learning architecture to perform VAD operations parsing a set of speech portions and a set of non-speech portions from the audio signal. When the VAD is applied to the features extracted from the audio signal, the VAD may output binary results (e.g., speech detection, no speech detection) or continuous values (e.g., probabilities of speech occurring) for each frame (or sub-frame) of the audio signal. The speech portions of the audio signal may be called utterances. The audio signal may include utterances of multiple speakers. The audio signal may also include overlapping sounds (e.g., utterances and ambient background noise). The analytics server 102 may determine the beginning and end of an utterance using speaker detection or other conventional speaker segmentation solutions);
a user interface configured to receive input audio including an utterance of an occupant of the vehicle and noise ([0082] the end-user device 114 actively captures user input data, where the end-user actively interacts with the end-user device 114 (e.g., speaking a “wake” word, pressing a button, making a gesture). In some cases, the end-user device 114 passively captures the user input data, where the end-user passively interacts with the end-user device 114 (e.g., speak to another user, the end-user device 114 automatically capturing utterances without user's affirmative action). Various types of inputs represent the ways that users interact with end-user devices 114, such as sound or audio data captured by a microphone of the end-user device 114 or user inputs entered via a user interface presented by the end-user device 114. The captured sound includes the background noise (e.g., ambient noises) and/or utterances of one or more speaker-users. [0139] The analytics server 102 may also determine speaker characteristics based upon information inputted by the speaker, received from the content server, or identified by executing the various machine-learning models. The speaker characteristics may include, for example, the age of the speaker, the gender of the speaker, an emotional state of the speaker, the dialect of the speaker, the accent of the speaker, and the diction of the speaker, among others); and
a processor configured to transform the input audio to an input embedding, and determine whether the occupant is a registered user based on a comparison between the input embedding and the reference embedding set ([0171-0172] Authenticating Users and Parental Controls: The analytics server 102 may determine the identity of a user interacting with the end-user device 114 by comparing the similarity of the extracted embedding to embeddings/voiceprints stored in the analytics database 104. The analytics server 102, upon identifying one or more users, may transmit a user identifier, user characteristics (e.g., age, gender, emotion, dialect, accent, and the like), user-independent characteristics, and/or metadata to the content server 111. In some configurations, the analytics server 102 (or content server 111 using the information transmitted from the analytics server 102) may authenticate the identified users using the transmitted information. For example, a user's age may authenticate the user to watch content over a certain age limit).
KHOURY does not explicitly disclose the feature embeddings is for synthesis results between the registered user's utterance and a plurality of environmental noises. PARK, in analogous art however, discloses the feature embeddings for synthesis results between the registered user's utterance and a plurality of environmental noises ([0087] In FIG. 2. Additive noise may be generated from a source different from that of a speech signal, and may have no correlation with the speech signal. The additive noise may be added to a speech signal through addition. The additive noise may include, for example, a door closing sound, a horn sound, and ambient noise. In addition, channel noise may be a type of noise detected in a conversion or transformation process. The channel noise may include, for example, a room impulse response. However, the types of noise described in the foregoing description are provided merely as examples of the additive noise and the channel noise, and thus the additive noise and the channel noise may include other various types of noise. [0091] A feature vector generator 260 of the registration apparatus generates a feature vector based on the synthesized signal of the speech signal and the noise signal. The feature vector generator 260 recognizes the synthesized signal and outputs the feature vector corresponding to the recognized synthesized signal. The feature vector may include information distinguishing each recognition element, for example, time-based frequency information having compressed components in the speech signal that are needed for recognition. [0093] In operation 320, the registration apparatus synthesizes the received speech signal and a noise signal. The noise signal may be a signal preset to be similar to noise that may occur in a test process and include, for example, an additive noise signal, a channel noise signal, or a combination thereof. [0099] The type of noise refers to a type of source of noise including, for example, a babbling sound occurring nearby, a door closing sound, and a horn sound. The type of noise may differ from one noise to another noise based on a length of noise, even though the noises are generated from a same source. The timing of noise refers to a start point and/or an end point at which noise is synthesized with a speech signal. The SNR refers to a relative volume difference between a speech signal and noise. [0100] The registration apparatus generates a feature vector 450 by inputting, to a feature vector generator, the synthesized signal of the speech signal 410 and the additive noise signal 420. The feature vector generator transforms a domain of the synthesized signal, and extracts the feature vector 450 from a result of the transforming. The synthesized signal is generated by adding the additive noise signal 420 to the speech signal 410, and thus may include time-domain sound information. The feature vector generator transforms a time-domain synthesized signal into a form of a frequency domain including image information, and generates the feature vector 450 based on the transformation).
PARK further discloses ([0121] In the registration process, a registration apparatus receives the speech signal 810 of a speaker, and synthesizes the speech signal 810 and additive noise 815 or channel noise 820. After performing segmentation in stage 825 on the synthesized signal obtained through the synthesizing, the registration apparatus generates the registered feature vector 835 by inputting a result of the segmentation to a feature vector generator 830, and extracts the representative registered feature vector 840 to construct a registration DB. To store all the registered feature vector 835 in the registration DB, the extracting of the representative registered feature vector 840 may be omitted. [0127] In stage 865, the recognition apparatus compares at least one input feature vector 860 to at least one registered feature vector 835 or representative registered feature vector 840 of a registered user stored in the registration DB constructed in the registration process. The registered user described herein may be a speaker corresponding to a registered feature vector or a representative registered feature vector that is stored in the registration DB. [0128] The recognition apparatus verifies the speaker in stage 870 based on a result of the comparing obtained in stage 865. In stage 870, the recognition apparatus verifies the speaker based on a similarity score between the input feature vector 860 and the registered feature vector 835, or the representative registered feature vector 840. For example, when a representative value of similarity scores is greater than or equal to a preset threshold value, the recognition apparatus determines the verifying of the speaker performed in stage 870 to be successful, or determines that the speaker is successfully verified. Alternatively, only when the number of registered feature vectors 835 or representative registered feature vectors 840 having respective similarity scores with the input feature vector 860 being greater than or equal to the threshold value is greater than or equal to a preset number, the recognition apparatus may determine the verifying of the speaker performed in stage 870 to be successful). Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to modify the claimed limitations of the feature embeddings disclosed by PARK to include the synthesis results between the registered user's utterance and a plurality of environmental noises. This modification would have been obvious because a person having ordinary skill in the art would have been motivated by the desire to provide a speaker recognition system used to verify or identify a speaker based on a voice or speech of the speaker that can be applied to various situations and fields of application, for example, meetings, conferences, and identification in a dialog or conversation, and furthermore that can be applied to vehicles, buildings, and bank accounts for access control for the purpose of security as suggested by PARK ([0003-0005]).
As per claim 2, KHOURY in view of PARK discloses the device of claim 1, wherein the user interface comprises a microphone for capturing the utterance of the occupant (KHOURY [0082]; [0104] The end-user devices 114 may comprise a processor and/or software capable of using communication features of a paired or otherwise networked device. The end-user devices 114 may be configured with a microphone, accelerometer, gyroscope, camera, fingerprint scanner, interaction buttons (such as directional buttons, numeric buttons), joysticks, or any combination, and the like. The end-user devices 114 may comprise hardware (e.g., microphone) and/or software (e.g., codec) for detecting and converting sound (e.g., spoken utterance, ambient noise) into electrical audio signals).
As per claim 3, KHOURY in view of PARK discloses the device of claim 1, wherein the processor is configured to transform the input audio into an input spectrogram and transform the input spectrogram into the input embedding using an embedding model ([0218] The input audio signal includes an utterance of a speaker-user, where the input audio signal is a data file (e.g., WAV file, MP3 file) or a data stream. The server performs various pre-processing operations on the input audio signal, such as parsing the audio signal into segments or frames of speech or performing one or more transformation operations (e.g., Fast-Fourier Transform), among other potential operations. [0232] The server extracts various types of features from the input audio signal, such as spectro-temporal features or metadata. Additionally or alternatively, the server performs various pre-processing operations on the input audio signal, such as parsing the audio signal into segments or frames of speech or performing one or more transformation operations (e.g., Fast-Fourier Transform), among other potential operations. [0059-0060] To use hierarchical clustering, for example, a voice biometrics system may access each of the stored utterances in the system and shuffle the utterances between the clusters. The system may compare each of the utterances of the clusters with each other and cluster the utterances together that have the highest similarity. Additionally or alternatively, the system compares the voiceprints against one another to combines those voiceprints with the highest similarity scores that also match a voiceprint similarity threshold. Because each clustering methodology has its own advantages and disadvantages (e.g., incremental clustering may be faster but less accurate while organizational clustering may be more accurate but require a large amount of computer resources), using a combination of the two methodologies over time may cover the deficiencies of both methods and enable the system to create mature and accurate speaker embedding models. The system may execute incremental clustering operations intermittently with organizational clustering operations to improve the accuracy of the speaker embedding models while avoiding using organizational clustering too often to save processing resources. The combination ensures efficient and accurate clustering that is appropriate for passive and continuous enrollment and authentication).
As per claim 4, KHOURY in view of PARK discloses the device of claim 1, wherein the processor is configured to calculate a similarity score between the input embedding and the reference embedding set, and determine whether the occupant is the registered user based on the similarity score ([0146] If the analytics server 102 determines that the maximum similarity score satisfies the low similarity threshold, then the analytics server 102 may identify (or authenticate) the speaker. In addition, the analytics server 102 determines that that the embedding involved in the similarity score is a weak embedding based on a weak utterance. Weak embeddings lack enough similarity with the corresponding voiceprint for immediately characterizing the weak embedding as part of the particular speaker cluster. The analytics server 102 may store and/or update a set of weak embeddings in the analytics database 104).
As per claim 5, KHOURY in view of PARK discloses the device of claim 1, wherein the plurality of environmental noises include noises with different acoustic characteristics ([0378] The computer may be further configured to extract one or more features from the inbound audio signal; and calculate a spoofing score indicating a likelihood that the inbound audio signal includes a spoofing condition based upon the one or more acoustic features by applying a second machine-learning model. [0410] The method may further comprise extracting, by the device, one or more acoustic features from the inbound audio signal; and calculating, by the device, a spoofing score indicating a likelihood that the inbound audio signal includes a spoofing condition based upon the one or more acoustic features by applying a second machine-learning model).
As per claim 6, KHOURY in view of PARK discloses the device of claim 1, wherein the registered user is registered in an environment in which a noise level before or after the registered user's utterance is lower than a noise threshold ([0303] In determination step 926, the server determines whether the max similarity score satisfies a high similarity threshold, if the server determines that the max similarity score satisfies the low threshold (in step 922). In some cases, the high similarity threshold is a preconfigured default value or an adaptive threshold tailored for the particular voiceprint and the putative enrolled registered speaker (as in FIG. 4)).
As per claim 7, KHOURY in view of PARK discloses the device of claim 1, wherein the processor is configured to receive the input audio from the vehicle and transmit a determination result of the occupant's registration to the vehicle ([0160] The analytics server 102 may identify an environment setting representing the speaker or speakers current circumstances and environment. The machine-learning model executed by the analytics server 102 may include an audio event classification model and/or an environment classification model, such as background noise or specific sounds that are classifiable (e.g., dishwasher, trucks) or include overwhelming amount of energy to the inbound signal. The analytics server 102 may transmit the speaker identifiers to the content server 111 along with an indicator of the environment setting associated with certain content characteristics. For example, a speaker interacting with a smart TV 114a at a restaurant or party with only adult speakers may cause the content server 111 to generate different suggested content to the end-user device 114 from a different circumstance where the speaker interacts with a smart TV 114a in a living room including child speakers).
As per claim 8, KHOURY in view of PARK discloses a vehicle comprising the device of claim 1 ([0328] FIGS. 11A-11B shows components of a system 1100 employing audio-processing machine-learning operations, where the machine-learning models are implemented by a vehicle or other edge device (e.g., car, home assistant device, smart appliance). [0329] The vehicle comprises a microphone 1108 configured to capture audio waves 1110 containing speech and convert audio waves 1110 into audio signals for audio processing operations. The vehicle comprises computing hardware and software components (shown as analytics computer 1102 and speaker database 1104) configured to perform the various audio processing operations described herein. The components and operations described in the system 1100 are similar to those of FIGS. 1-2. The system 100 of FIG. 1 placed much of the machine-learning audio-processing operations on the analytics server 102, though the content system 110 might perform certain operations in some embodiments. The system 200 of FIG. 2 placed much of the machine-learning audio-processing operations on the end-user device 214, though the end-user device 214 could still rely upon the analytics system 201 or content system 210 for various operations and database information. The vehicle-based system 1100 of FIG. 11, however, seeks to encapsulate much of the audio-processing operations and data within the vehicle-based system 1100, with relatively less reliance upon the devices of the external system infrastructures. [0330] The analytics computer 1102 receives input data signals from the microphone 1108 and performs various pre-processing operations, such as VAD and ASR to identify utterances. The analytics computer 1102 and apply any number of machine-learning models to extract features, extract embeddings, and compare the embeddings against voiceprints stored in the speaker database 1104. The analytics computer 1102 is coupled to various electronics components of the vehicle, such as the infotainment system, engine, door locks, and other components of the vehicle. The analytics computer 1102 receives voice instructions from the driver or passengers to activate or adjust the various options of the vehicle. [0333] In some embodiments, the analytics computer 1102 employs static enrollment configuration, whereby the analytics computer 1102 does not accept unknown speaker embeddings as new enrollments. In addition, the analytics computer 1102 performs authentication functions that rejects authentication of unrecognized voiceprints and does not permit the speaker from accessing certain functions of the vehicle. For example, the analytics computer 1102 could be employed in livery vehicles (e.g., police cars, delivery trucks) to limit unauthorized access to the vehicle and vehicle operation).
As per claims 9-15,they are directed to a vehicle comprising the device of claims 1-7 and claims 9-15 are having substantially similar corresponding limitations of claims 1-7 respectively and therefore claims 9-15 are rejected with the same rationale given above to reject claims 1-7 respectively.
As per claims 16-20, the claims are directed to a non-transitory computer readable medium containing program instructions executed by a processor, the computer readable medium of claims 16-20 are having corresponding limitations of claims 1, 3-6 respectively and therefore claims 16-20 are rejected with the same rationale given above to reject corresponding limitations of claims 1, 3-6 respectively.
Conclusion
The prior arts made of record and not relied upon are considered pertinent to applicant's disclosure. See the notice of reference cited in form PTO-892 for additional prior arts.
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TECHANE GERGISO whose telephone number is (571)272-3784. The examiner can normally be reached 9:30am to 6:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, LINGLAN EDWARDS can be reached at (571) 270-5440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TECHANE GERGISO/Primary Examiner, Art Unit 2408