Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 11, and 17 are independent.
Claims 2-10 depend from Claim 1.
Claims 12-16 depend from Claim 11.
Claims 18-19 depend from Claim 17.
This Application was published as U.S. 2024/0412735.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 3 Jun 2024, 14 Jan 2025, and 18 Sep 2025 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings are objected to because Fig 16 refers to Processors(s) 102 and Pre-trained ML Model 106 , while Par [0113] refers to “pre-trained ML model 102.”
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description:
Fig 16 refers to reference item “208” User Reference audio sample that is not mentioned in the specification.
Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-6, 8, 10, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Heigold et al. (US2017/0069327 hereinafter Heigold) in view of Min et al.(US2020/0320995 hereinafter Min) in further view of Wang et al. (US2021/0110833)
With regards to claim 1, Heigold teaches:
A computer-implemented method, performed by a server, for personalising a trained speaker verification machine learning, ML, model for specific users, the method comprising: [Heigold Fig 1 teaches system of trained speaker verification with ML model (144) for specific users that can be implemented on back end server, middleware application server, or front end component, “or any combination of such back end, middleware, or front end components” (Par [0118])]
obtaining at least one audio sample of the voice of a specific user; [Heigold Fig 1 teaches verification utterance (154) or enrollment utterance (152) as audio sample for a user]
With regards to claim 1, Heigold fails to teach:
identifying, using the at least one audio sample, a group of users that have a similar voice to the voice of the specific user; selecting, from a database, a set of audio samples of voices corresponding to the identified group of users; and
With regards to claim 1, Min teaches:
identifying, using the at least one audio sample, a group of users that have a similar voice to the voice of the specific user; selecting, from a database, a set of audio samples of voices corresponding to the identified group of users; and [Min Fig 2 teaches using the utterance from receiver (101) and “determination unit 102 that determines whether the first utterer is a registered user by comparing the first utterance of the first utterer against the speech samples of the respective registered users stored in the storage unit 301” (Par [0037]) where audio samples are selected from a storage unit or database where the registered user is part of a group and the “small group of users may be a family, and the users may be members of the family.” (Par [0037]) where it is known that family members can have similar voice characteristics (see Sharifi et al. (US2022/0157321) Par [0007])]
It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the speaker verification system as taught by Heigold with the personalized system using voiceprints of members in a group as taught by Min. The motivation to combine the teachings of Heigold and Min is because Min teaches “provide a technique of implementing multiple AI assistants on one platform (e.g., in one AI speaker) by installing multiple AI assistants modules in one AI speaker so that the AI speaker can identify and verify each of multiple user” (Par [0007]) which increases the capabilities of the invention of Heigold to be used with multiple users on the same system]
With regards to claim 1, Heigold in view of Min fails to teach:
transmitting the selected set of audio samples to a user device used by the specific user, for personalising the trained ML model for the user using the set of audio samples.
With regards to claim 1, Wang teaches:
transmitting the selected set of audio samples to a user device used by the specific user, for personalising the trained ML model for the user using the set of audio samples. [Wang teaches transmitting the first model which includes positive and negative samples to create a second model for personalized speaker verification where the “second model may be deployed in a remote server, cloud, client-side device, etc” (Par [0045])
It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the speaker verification system as taught by Heigold and Min with the personalized speaker verification system as taught by Wang. The motivation to combine the teachings of Heigold and Min with Wang is because Wang teaches an improvement over traditional speaker verification system that “uses a general speaker recognition model for all users (one-for-all) without any personalized update for target user, and hence lacks robustness and flexibility.” (Par [0004]) which increases the capabilities of the invention of Heigold and Min to be more flexible and robust]
With regards to claim 2, Heigold in view of Min and Wang teaches: All the limitations of claim 1
wherein identifying a group of users comprises using a classifier to: process the at least one audio sample of the voice of the specific user to determine characteristics of the voice of the specific user, and [Min Fig 2 teaches a classifier (201) “compares the first utterance against speech samples stored in the storage unit 301 and determines whether the first utterer is a registered user” (Par [0043]) which is processing the audio sample to determine if the characteristics of the voice matches a registered user]
identify, based on the determined characteristics of the voice of the specific user, a group of users, from a plurality of groups, that have a similar voice to the voice of the specific user. [Min Fig 3 step 305 teaches comparing the voice characteristic and if the “value of the i-vector indicating similarity is 0.8 or more, the first utterer is determined as a user belong to the group” (Par [0049]) in order to determine “similarity in voice between the speaker and each of the family members is calculated, and a user who is not registered and has similarity in voice to the family members, the user is asked to be registered.” (Par [0066])]
With regards to claim 3, Heigold in view of Min and Wang teaches: All the limitations of claim 1
wherein selecting a set of audio samples comprises selecting a set of audio samples from the identified group of users which are most similar to the voice of the specific user. [Min Fig 3 step 305 teaches comparing the voice characteristic and if the “value of the i-vector indicating similarity is 0.8 or more, the first utterer is determined as a user belong to the group” (Par [0049]) in order to determine “similarity in voice between the speaker and each of the family members is calculated, and a user who is not registered and has similarity in voice to the family members, the user is asked to be registered.” (Par [0066])]
With regards to claim 4, Heigold in view of Min and Wang teaches: All the limitations of claim 2
further comprising: updating the classifier using parameters of the personalised ML models received from user devices. [Heigold Fig 6 item 608 teaches “speaker model for the enrolled user may then be updated based on the verification utterance” (Par [0099]) where updating the model is updating parameters using the verification utterance provided on the user device]
With regards to claim 5, Heigold in view of Min and Wang teaches: All the limitations of claim 4
wherein updating the classifier comprises: aggregating parameters of the personalised trained ML model received from a plurality of user devices. [Heigold teaches “enrolling one or more new users” (Par [0046]) which is aggregating parameters for the model received from the user devices. While training of the model does not have to be done for each new user, “enrollment, verification, or both, may be provided to the computing system 120 and added to the training data so that the neural network (and thus the speaker verification model) may be regularly updated based using newly collected training data.” (Par [046])]
With regards to claim 6, Heigold in view of Min and Wang teaches: All the limitations of claim 5
wherein aggregating parameters comprises aggregating parameters received from a plurality of user devices in the identified group of users. [Heigold teaches “enrolling one or more new users” (Par [0046]) which is aggregating parameters for the model received from the user devices in the group of users]
With regards to claim 8, Heigold in view of Min and Wang teaches: All the limitations of claim 5
wherein aggregating parameters comprises aggregating parameters received from a plurality of groups of users. [Heigold teaches “enrolling one or more new users” (Par [0046]) which is aggregating parameters for the model received from user devices of a plurality of groups]
With regards to claim 10, Heigold in view of Min and Wang teaches: All the limitations of claim 4
wherein updating the classifier comprises: receiving, from at least one user device, an embedding corresponding to a positively-verified audio input and a pseudo-label corresponding to the positively-verified audio input; and [Heigold teaches “a reference vector or other set of values corresponding to the user 102. The reference vector or other set of values may constitute a speaker model that characterizes distinctive features of the user's voice” (Par [0056]) where a vector or other value is an embedding that characterizes the user’s voice or audio input and includes a label for enrolled data]
retraining the classifier using the received embedding and pseudo-label.
[Wang Fig 2 teaches deploying the second model on the user device (Par [0045]) where the “positive/negative sample vectors are output from the embedding layer 208 of the first model” (Par [0098]) which is used to retrain the first model to create a second model]
With regards to claim 19, Heigold teaches:
A system for personalising a trained speaker verification machine learning, ML, model for specific users, the system comprising: [Heigold Fig 1 teaches system of trained speaker verification with ML model (144) for specific users that can be implemented on back end server, middleware application server, or front end component, “or any combination of such back end, middleware, or front end components” (Par [0118])]
a central server comprising at least one processor coupled to memory for: [Heigold teaches computing device (102) that may “include one or more processors (e.g., a digital processor, an analog processor … and one or more memories (e.g., permanent memory, temporary memory, non-transitory computer-readable storage medium)” (Par [0040]) where the computing system (102) can be on a central server (Par [0038])]
obtaining at least one audio sample of the voice of each specific user of a plurality of user devices; [Heigold Fig 1 teaches verification utterance (154) or enrollment utterance (152) as audio sample for a specific user and the model may be used on “many different client devices” (Par [0055])]
a plurality of user devices, each user device comprising at least one processor coupled to memory for: [Heigold Fig 1 teaches the model may be used on “many different client devices” (Par [0055]) where the “client device 110 can be, for example, a desktop computer, laptop computer, a tablet computer, a watch, a wearable computer, a cellular phone, a smart phone, a music player, an e-book reader, a navigation system, or any other appropriate computing device that a user may interact with”(Par [0045])]
obtaining, from the central server, and storing the trained speaker verification ML model; [Heigold Fig 1 teaches obtaining speaker verification model (144) from computing device (120)
With regards to claim 19, Heigold fails to teach:
identifying, using the at least one audio sample, a group of users that have a similar voice to the voice of each specific user; selecting, from a database, a set of audio samples of voices corresponding to the identified group of users; and
With regards to claim 19, Min teaches:
identifying, using the at least one audio sample, a group of users that have a similar voice to the voice of each specific user; selecting, from a database, a set of audio samples of voices corresponding to the identified group of users; and [Min Fig 2 teaches using the utterance from receiver (101) and “determination unit 102 that determines whether the first utterer is a registered user by comparing the first utterance of the first utterer against the speech samples of the respective registered users stored in the storage unit 301” (Par [0037]) where audio samples are selected from a storage unit or database where the registered user is part of a group and the “small group of users may be a family, and the users may be members of the family.” (Par [0037]) where it is known that family members can have similar voice characteristics (see Sharifi et al. (US2022/0157321) Par [0007])]
It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the speaker verification system as taught by Heigold with the personalized system using voiceprints of members in a group as taught by Min. The motivation to combine the teachings of Heigold and Min is because Min teaches “provide a technique of implementing multiple AI assistants on one platform (e.g., in one AI speaker) by installing multiple AI assistants modules in one AI speaker so that the AI speaker can identify and verify each of multiple user” (Par [0007]) which increases the capabilities of the invention of Heigold to be used with multiple users on the same system]
With regards to claim 19, Heigold in view of Min fails to teach:
transmitting the selected set of audio samples to a user device used by the specific user, for personalising the trained ML model for the user using the set of audio samples; and
receiving and storing the selected set of audio samples, the set of audio samples comprising voices that are similar to the voice of the specific user of the user device; and
personalising the trained speaker verification ML model using at least one reference audio sample comprising the voice of the specific user and the obtained selected set of audio samples.
With regards to claim 19, Wang teaches:
transmitting the selected set of audio samples to a user device used by the specific user, for personalising the trained ML model for the user using the set of audio samples; and [Wang teaches transmitting the first model which includes positive and negative samples to create a second model for personalized speaker verification where the “second model may be deployed in a remote server, cloud, client-side device, etc” (Par [0045])]
receiving and storing the selected set of audio samples, the set of audio samples comprising voices that are similar to the voice of the specific user of the user device; and [Wang teaches transmitting the first model which includes positive and negative samples to create a second model for personalized speaker verification where positive sample is “(e.g., speech data of a target speaker for personalizing speaker verification” (Par [0024]) which are similar to the voice of the specific user)]
personalising the trained speaker verification ML model using at least one reference audio sample comprising the voice of the specific user and the obtained selected set of audio samples. [Wang teaches transmitting the first model which includes positive and negative samples to create a second model for personalized speaker verification where the “second model may be deployed in a remote server, cloud, client-side device, etc” (Par [0045])
It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the speaker verification system as taught by Heigold and Min with the personalized speaker verification system as taught by Wang. The motivation to combine the teachings of Heigold and Min with Wang is because Wang teaches an improvement over traditional speaker verification system that “uses a general speaker recognition model for all users (one-for-all) without any personalized update for target user, and hence lacks robustness and flexibility.” (Par [0004]) which increases the capabilities of the invention of Heigold and Min to be more flexible and robust]
Claims 11-18 are rejected under 35 U.S.C. 103 as being unpatentable over Heigold et al. (US2017/0069327) in view of Wang et al. (US2021/0110833)
With regards to claim 11, Heigold teaches:
A computer-implemented method, performed by a user device, for personalising a trained speaker verification machine learning, ML, model for a specific user of the user device, the method comprising: [Heigold Fig 1 teaches trained speaker verification system that can be implemented on back end server, middleware application server, or “a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here)” (Par [0118])]
obtaining and storing a trained speaker verification ML model; [Heigold teaches “speaker verification model 144 based on the trained neural network 140 is transmitted from the computing system 120 to the client device 110” (Par [0053]) where the ML model is stored]
With regards to claim 11, Heigold fails to teach:
obtaining and storing a selected set of audio samples, the set of audio samples comprising voices that are similar to the voice of the specific user; and
personalising the trained speaker verification ML model using at least one reference audio sample comprising the voice of the specific user and the obtained selected set of audio samples.
With regards to claim 11, Wang teaches:
obtaining and storing a selected set of audio samples, the set of audio samples comprising voices that are similar to the voice of the specific user; and [Wang teaches transmitting the first model which includes positive and negative samples to create a second model for personalized speaker verification where positive sample is “(e.g., speech data of a target speaker for personalizing speaker verification” (Par [0024]) which are similar to the voice of the specific user)
personalising the trained speaker verification ML model using at least one reference audio sample comprising the voice of the specific user and the obtained selected set of audio samples. [Wang teaches transmitting the first model which includes positive and negative samples to create a second model for personalized speaker verification where the “second model may be deployed in a remote server, cloud, client-side device, etc” (Par [0045])
It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the speaker verification system as taught by Heigold with the personalized speaker verification system as taught by Wang. The motivation to combine the teachings of Heigold with Wang is because Wang teaches an improvement over traditional speaker verification system that “uses a general speaker recognition model for all users (one-for-all) without any personalized update for target user, and hence lacks robustness and flexibility.” (Par [0004]) which increases the capabilities of the invention of Heigold to be more flexible and robust]
With regards to claim 12, Heigold in view of Wang teaches:
All the limitations of claim 11
wherein personalising the trained speaker verification ML model comprises optimising a contrastive loss by: minimising a distance between the at least one audio sample and the at least one reference audio sample; and maximising a distance between the set of audio samples and the at least one reference audio sample. [Heigold Fig 2 teaches adjusting the weight values or other parameters of the neural network (206) by optimizing the loss by “maximize the similarity score for matching speakers samples or to optimize a score output by the logistic regression, and the neural network 206 may also be optimized so as to minimize the similarity score for non-matching speakers samples or to optimize the score output by the logistic regression” (Par [0072])
With regards to claim 13, Heigold in view of Wang teaches:
All the limitations of claim 11
further comprising: sharing parameters of the personalised ML model with a central server. [Heigold teaches trained speaker verification system that can be implemented on “back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server) … or any combination of such back end, middleware, or front end components.” (Par [0118])]
With regards to claim 14, Heigold in view of Wang teaches:
All the limitations of claim 11
wherein the user device is part of a group of user devices, and the method further comprises: sharing parameters of the personalised ML model with a second user device of the group of user devices, wherein the second user device aggregates the parameters received from user devices in the group and transmits the aggregated parameters to a central server. [Heigold teaches “enrolling one or more new users” (Par [0046]) and their associated devices where the parameters of the speaker verification model including “utterances of a given user that are provided for enrollment, verification, or both, may be provided to the computing system 120 and added to the training data” (Par [0046]) are aggregated for the central server]
With regards to claim 15, Heigold in view of Wang teaches:
All the limitations of claim 11
wherein the user device is part of a group of user devices, and the method further comprises: receiving ML model parameters of the personalised ML model from a plurality of user devices in the group; aggregating the received parameters; and transmitting the aggregated parameters to a central server. [Heigold teaches “enrolling one or more new users” (Par [0046]) and their associated devices where the parameters of the speaker verification model including “utterances of a given user that are provided for enrollment, verification, or both, may be provided to the computing system 120 and added to the training data” (Par [0046]) are aggregated and transmitted to the central server]
With regards to claim 16, Heigold in view of Wang teaches:
All the limitations of claim 11
further comprising: transmitting, to a central server, an embedding corresponding to a positively-verified audio input and a pseudo-label corresponding to the positively-identified audio input. [Wang Fig 2B teaches training the speaker verification system with embeddings (Par [0035,56]) where the “method may be performed by one or more components of the system 100, such as the computing system 102 and/or the computing device 10” (Par [0054]) and where the computing system (102) can be on a central server (Par [0038])]
With regards to claim 17, Heigold teaches:
A computer-implemented method, performed by a user device, for performing speaker verification for a user of the user device, the method comprising: [Heigold Fig 1 teaches trained speaker verification system that can be implemented on back end server, middleware application server, or “a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here)” (Par [0118])]
receiving a request to access a function or service which requires speaker verification; [Heigold Fig 1 teaches “stage (E), the user 102 attempts to gain access to the client device 110 using voice authentication” (Par [0057])]
receiving an audio input containing a voice; [Heigold Fig 1 teaches verification utterance (154) (Par [0057])]
granting access to the function or service to the user when the ML model verifies that the voice in the audio input is the voice of the user of the client device. [Heigold Fig 6 teaches using ML model (604) to verify the voice of the user and take action (612) or grant access]
With regards to claim 17, Heigold fails to teach:
processing, using a personalised trained speaker verification machine learning, ML, model, the received audio input; and
With regards to claim 17, Wang teaches:
processing, using a personalised trained speaker verification machine learning, ML, model, the received audio input; and [Wang Fig 1 teaches transmitting the first model which includes positive and negative samples to create a second model for personalized speaker verification (Par [0045]) where voice input (126) is the received audio input.
It would be obvious to one of ordinary skill in the art at the time of applicant’s filing to combine the speaker verification system as taught by Heigold with the personalized speaker verification system as taught by Wang. The motivation to combine the teachings of Heigold with Wang is because Wang teaches an improvement over traditional speaker verification system that “uses a general speaker recognition model for all users (one-for-all) without any personalized update for target user, and hence lacks robustness and flexibility.” (Par [0004]) which increases the capabilities of the invention of Heigold to be more flexible and robust]
With regards to claim 18, Wang teaches:
All the limitations of claim 17
wherein when the ML model verifies that the voice is the voice of the user, the method further comprises: generating, using the ML model, an embedding and a pseudo-label for the received audio input; and [Heigold teaches “a reference vector or other set of values corresponding to the user 102. The reference vector or other set of values may constitute a speaker model that characterizes distinctive features of the user's voice” (Par [0056]) where a vector or other value is an embedding that characterizes the user’s voice or audio input and includes a label for enrolled data]
transmitting, to a central server, the generated embedding and pseudo-label. [Wang Fig 2B teaches training the speaker verification system with embeddings (Par [0035,56]) where the “method may be performed by one or more components of the system 100, such as the computing system 102 and/or the computing device 10” (Par [0054]) and where the computing system (102) can be on a central server (Par [0038])]
Allowable Subject Matter
Claims 7 and 9 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Joseph J Yamamoto whose telephone number is (571)272-4020. The examiner can normally be reached M-F 1000-1800 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached at 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
JOSEPH J. YAMAMOTO
Examiner
Art Unit 2656
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656