Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings were received on 5/10/2024. These drawings are accepted.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 21-23,26-27,29-35,37-41 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Heigold et al (US Publication No.: 20210366491).
Claim 21, Heigold et al discloses
receiving, at a speech module, a set of instructions to perform voice recognition (Fig. 1, label utterance pools, enrollment utterances, verification utterances, wherein the utterances are considered a set of instructions from the user received via a speech module such as 110.);
classifying, by the speech module, the set of instructions as one of training instructions or matching instructions (Fig. 1, label utterance pools, enrollment utterances, verification utterances are classified as one of training instructions such as utterance pools are used for training and matching instructions such as verification utterances. The separation of the types of utterances indicates a classification of the set of instructions or utterances.);
determining the classification as training instructions to train a speech model (fig. 1, label utterances pools and 112 are determined as training instructions, indicating classification of the set of instructions.);
retrieving a model based on the set of instructions (Fig. 1, label 122 as the retrieved model of simulated utterances for verification and enrollment, retrieved or selected based on the utterance pools or set of instructions.);
generating a new speech model based on the retrieved model (Fig. 2 shows the process of training. Fig. 2, label 210a-n,214 a s the generated new speech model.);
training the new speech model using training data (Fig. 2, label 222 adjusts or trains the model 206 to generate the new speech model, labels 210a-n,214. Paragraph 74 discloses training of 206 based on 222. Paragraph 4 discloses “the neural network may be trained in iterations that each simulate speaker enrollment and verification of an utterance. For example, in each training iteration, a speaker representation generated by the neural network for a given utterance may be evaluated with respect to a speaker model. Based on a comparison ….., the parameters of the neural network may be updated so as to optimize the ability of the speaker verification model to classify a given utterance …”. Such disclosure indicates training is iterative, wherein each interaction updates can occur. When updates occur, the new speech model is generated and trained.), the training data including at least one audio signal (Fig. 1, label 122, Fig. 2, label 202,204a-n);
transmitting the trained speech model for use at a user device (Fig. 1, label speaker verification model is shown transmitted from the server to the user device for use. Paragraph 62 discloses the speaker verification model uses the speaker representation to generate speaker model for the speaker during the enrollment phase or verifying an identity of the speaker during the verification phase. Paragraph 53 discloses generation of the speaker verification model 140,144 is described in fig. 2,4a-b,5a-b.).
Claim 22, Heigold et al discloses wherein the set of instructions includes at least one of: commands to perform a voice recognition process, commands to train a speech model, commands to convert speech to text, commands to verify a voice, commands to identify an unknown speaker, or commands to recognize a known speaker (Fig. 1, label 121 indicates the set of instructions includes commands to train a speech model such as training data. Label 154 verification utterances are commands or triggers to perform voice recognition process such as identifying a speaker or recognizing a speaker via voice input and voice recognition.).
Claim 23, Heigold et al discloses wherein the retrieved speech model is a machine learning model retrieved from a model storage (Fig. 1, label 122 simulated verification utterances and enrollment as the retrieved speech model. Paragraph 52 discloses training samples 122 may be selected from a repository of training data or models.).
Claim 26, Heigold et al discloses wherein the training further includes optimizing a model parameter associated with the trained speech model and at least one hyperparameter (Paragraph 4,53,74 discloses optimizing or training of the model updates the model’s parameters, wherein hyperparameter is a type of parameter of the neural network.).
Claim 27, Heigold et al discloses wherein the model parameter includes one of a model weight, a coefficient, or an offset (Paragraph 4,53,74 discloses model parameters includes weights.).
Claim 29, Heigold et al discloses further comprising determining a performance metric (Paragraph 82 discloses “training may continue until tests on the held-out set indicate that the neural network has achieved at least the target performance level.”).
Claim 30, Heigold et al discloses further comprising updating a model index of the trained speech model by recording the performance metric (Paragraph 82 discloses “the process 300 may continue until a target performance level is reached. For example, after a number of training iterations, the neural network may be tested against a held-out set of data that was not used during the training process 300. Training may continue until tests on the held-out set indicate that the neural network has achieved at least the target performance level.” Such disclosure indicates recording of performance level for each iteration the model is trained (model index) since iteration is performed until at least the target performance level is reached.).
Claim 31, Heigold et al discloses
receiving, at a speech module, a set of instructions to perform voice recognition (Fig. 1, label utterance pools, enrollment utterances, verification utterances, wherein the utterances are considered a set of instructions from the user received via a speech module such as 110.);
classifying, by the speech module, the set of instructions as one of training instructions or matching instructions (Fig. 1, label utterance pools, enrollment utterances, verification utterances are classified as one of training instructions such as utterance pools are used for training and matching instructions such as verification utterances. The separation of the types of utterances indicates a classification of the set of instructions or utterances.);
determining the classification as matching instructions to match a received voice signal to a user (Fig. 1, label verification indicates the utterance or instruction of the set of instructions are classified or determined as request for verification such as matching instructions to match the received utterances to a user. Fig. 6 shows the verification phase with matching or similarity between the user’s voice to the enrolled user’s voice.);
retrieving a model based on the set of instructions (Fig. 1, label 140 retrieved for verification. Paragraph 100 discloses “a speaker model is accessed on the phone.”);
receiving input data, the input data comprising audio data (Fig. 1, label verification utterance.);
generating a match result based on the input data and the model by applying the speech model to the input data to determine a user identity (Paragraph 100 discloses “the speaker representation that was generated at stage 606 based on the verification utterance is compared to the speaker model.”);
updating, by the speech module, the speech model based on the match (Paragraph 101 discloses “if the speaker verification model determines with sufficient confidence that the verification utterance was spoken by the enrolled speaker, the speaker model for the enrolled user may then be updated based on the verification utterance.”); and
transmitting the match result to a user device associated with the user (Fig. 1, label verification result.).
Claim 32, Heigold et al discloses further comprising retrieving user data based on the set of instructions (Paragraph 100,101 discloses enrolled speaker, which includes user data, is retrieved based on the set of instructions such as verification utterance.), the user data including one or more reference audio data (Paragraph 100 discloses “enrolled user’s voice” is compared to the user’s voice such as verification utterance.).
Claim 33, Heigold et al discloses wherein the input data includes user profile data (Fig. 1, label verification utterance, where the properties such as user profile data is found in the sound or speech of the utterance.).
Claim 34, Heigold et al discloses wherein the match result further comprises identifying a speech component (Paragraph 55,58 discloses enrollment includes a phrase spoken by the user, where verification phase includes verifying the user according to the verification utterance of the same phrase that was spoken during enrollment.).
Claim 35, Heigold et al discloses wherein the match result further comprises identifying, in the input data, a stored key phrase associated with the user (Paragraph 58 discloses identification of a phrase uttered in the verification utterance (input data) and enrollment utterance, where the phrase or key phrase of the enrollment utterance is stored.).
Claim 37, Heigold et al discloses wherein the operations further comprise training a speech model based on the match (Paragraph 101 discloses “if the speaker verification model determines with sufficient confidence that the verification utterance was spoken by the enrolled speaker, the speaker model for the enrolled user may then be updated based on the verification utterance.”).
Claim 38, Heigold et al discloses wherein the audio data further comprises voice data (Fig. 1, label verification utterance spoken by the user, label 102.).
Claim 39, Heigold et al discloses wherein the training further comprises optimizing model parameters (Paragraph 4 discloses optimizing or training of the model updates the model’s parameters, wherein hyperparameters are types of parameters of the neural network.).
Claim 40, Heigold et al discloses wherein the optimization further comprises using model drift data based on at least one of a user voice change over time, temporary illness, or tiredness (Paragraph 31 discloses the training data includes different first utterances of a first speaker for the respective set of utterances and second utterance of either the first speaker for the respective set of utterances or a second speaker for the respective set of utterances other than the first speaker. This indicates the training data includes the condition of the voice at the time of speaker such as changes over time. Paragraph 52 discloses simulated utterances for verification and enrollment are used as training data. Depending on the simulation of the utterances, this indicates training data can include voice changes over time, temporary illness, tiredness, etc.).
Claim 41, Heigold et al discloses
a hardware processor (Paragraph 104, Fig. 7, label 702); and
a memory comprising instructions (Fig. 7, label 704, paragraph 106), that when executed by the at least one hardware processor, cause the hardware processor to perform the steps of (Fig. 7, label 704,702, paragraph 104-106):
receiving, at a speech module, a set of instructions to perform voice recognition (Fig. 1, label utterance pools, enrollment utterances, verification utterances, wherein the utterances are considered a set of instructions from the user received via a speech module such as 110.);
classifying, by the speech module, the set of instructions as one of training instructions or matching instructions (Fig. 1, label utterance pools, enrollment utterances, verification utterances are classified as one of training instructions such as utterance pools are used for training and matching instructions such as verification utterances. The separation of the types of utterances indicates a classification of the set of instructions or utterances.);
based on determining that the classification is for training instructions (Based on the set of instructions such as labels 121,verification utterance, enrollment utterances, actions will be performed per the classification or type of utterance.), performing steps for training (Fig. 1, label training, Fig. 2), comprising:
retrieving a model based on the set of instructions (Fig. 1, label 122 as the retrieved model of simulated utterances for verification and enrollment, retrieved or selected based on the utterance pools or set of instructions.);
generating a new speech model based on the retrieved model (Fig. 2 shows the process of training. Fig. 2, label 210a-n,214 a s the generated new speech model.);
training the new speech model using training data (Fig. 2, label 222 adjusts or trains the model 206 to generate the new speech model, labels 210a-n,214. Paragraph 74 discloses training of 206 based on 222. Paragraph 4 discloses “the neural network may be trained in iterations that each simulate speaker enrollment and verification of an utterance. For example, in each training iteration, a speaker representation generated by the neural network for a given utterance may be evaluated with respect to a speaker model. Based on a comparison ….., the parameters of the neural network may be updated so as to optimize the ability of the speaker verification model to classify a given utterance …”. Such disclosure indicates training is iterative, wherein each interaction updates can occur. When updates occur, the new speech model is generated and trained.), the training data including at least one audio signal (Fig. 1, label 122, Fig. 2, label 202,204a-n);
transmitting the trained speech model for use at a user device (Fig. 1, label speaker verification model is shown transmitted from the server to the user device for use. Paragraph 62 discloses the speaker verification model uses the speaker representation to generate speaker model for the speaker during the enrollment phase or verifying an identity of the speaker during the verification phase. Paragraph 53 discloses generation of the speaker verification model 140,144 is described in fig. 2,4a-b,5a-b.);
based on determining that the classification is for matching instructions (Based on the set of instructions such as labels 121,verification utterance, enrollment utterances, actions will be performed per the classification or type of utterance.), performing steps for matching (Fig. 1, label verification utterance of stage E.), comprising:
retrieving a second model based on the set of instructions (Paragraph 54 discloses “The model 144 may be configured to provide data characterizing an utterance of the user 102 as input to the trained neural network 140, in order to generate a speaker representation for the user 102 that indicates distinctive features of the user’s voice. The speaker representation can then be compared to a model of the user’s voice that has been previously determined.” The highlighted portion indicates retrieval of a second model, a model of the user’s voice previously determined. The retrieval of such model is based on the verification utterance or set of instructions.);
receiving input data, the input data comprising audio data (Fig. 1, label verification utterance);
generating a match result based on the input data and the second model by applying the speech model to the input data to determine a user identity (Paragraph 100 discloses “the speaker representation that was generated at stage 606 based on the verification utterance is compared to the speaker model.” Paragraph 54 discloses “The model 144 may be configured to provide data characterizing an utterance of the user 102 as input to the trained neural network 140, in order to generate a speaker representation for the user 102 that indicates distinctive features of the user’s voice. The speaker representation can then be compared to a model of the user’s voice that has been previously determined. If the speaker representation is sufficiently similar to the user's speaker model, then the speaker verification model 144 can output an indication that the identity of the user 102 is valid.”);
updating, by the speech module, the speech model based on the match (Paragraph 101 discloses “if the speaker verification model determines with sufficient confidence that the verification utterance was spoken by the enrolled speaker, the speaker model for the enrolled user may then be updated based on the verification utterance.”); and
transmitting the match result to a user device (Fig. 1, label verification result.).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 24, 36 is/are rejected under 35 U.S.C. 103 as being unpatentable over Heigold et al (US Publication No.: 20210366491) in view of Walters et al (US Patent No.: 10459954).
Claim 24, Heigold et al discloses server or cloud communication with the user, where the trained neural network is trained at the server or cloud (Fig. 1, label 120, 110), but fails to disclose wherein the method is performed by at least one ephemeral container instance.
Walters et al discloses one or more cloud services designed to generate one or more ephemeral container instances. (Col. 4, lines 57-67) It would be obvious to one skilled in the art before the effective filing date of the application to modify the server of Heigold et al with generating one or more ephemeral container instances as disclosed by Walters et al so to perform functions such as speech recognition or authentication of a user (Col. 8, lines 23-31).
Claim 36, Heigold et al discloses server or cloud communication with the user, where the trained neural network is trained at the server or cloud (Fig. 1, label 120, 110), but fails to disclose wherein the method is performed by at least one ephemeral container instance.
Walters et al discloses one or more cloud services designed to generate one or more ephemeral container instances. (Col. 4, lines 57-67) It would be obvious to one skilled in the art before the effective filing date of the application to modify the server of Heigold et al with generating one or more ephemeral container instances as disclosed by Walters et al so to perform functions such as speech recognition or authentication of a user (Col. 8, lines 23-31).
Claim(s) 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Heigold et al (US Publication No.: 20210366491) in view of Ranjan et al (US Publication No.: 20210012768).
Claim 25, Heigold et al discloses training data (Fig. 1, label training, 122), but fails to disclose wherein the training data further includes metadata labeling a speaker or audio data that contains a passphrase.
Ranjan et al discloses the training data further includes metadata labeling a speaker or audio data that contains a passphrase. (Paragraph 43 discloses “train the computing device (e.g. computing device 140) to recognize speech by repeating certain words and/or phrases that may be utilized as labeled data for a training model.”) It would be obvious to one skilled in the art before the effective filing date of the application to modify Heigold et al’s training data by incorporating labels of the phrase associated with the training data as disclosed by Ranjan et al so to improve training of the model.
Claim(s) 28 is/are rejected under 35 U.S.C. 103 as being unpatentable over Heigold et al (US Publication No.: 20210366491) in view of Stefani et al (US Patent No.: 10777186).
Claim 28, Heigold et al discloses updating the neural network via updating parameters (paragraph 4,53,74), but fails to discloses the paragraphs comprise the at least one hyperparameter includes one of a learning size, batch size, or an architectural parameter.
Stefani et al discloses training a neural network comprising the at least one hyperparameter includes one of a learning size, batch size, or an architectural parameter (Col. 6, lines 41-55 discloses during training of a neural network, one or more parameters such as hyperparameters may be set. Hyperparameters can be learning rate, batch size, hidden units, maximum iterations, etc.).
It would be obvious to one skilled in the art before the effective filing date of the application to modify Heigold et al’s neural network training by incorporating updating parameters as disclosed by Stefani et al so to improve training of the model and optimize the model’s performance.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044. The examiner can normally be reached 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LINDA WONG/Primary Examiner, Art Unit 2655