Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory obviousness-type double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); and In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on a nonstatutory double patenting ground provided the conflicting application or patent either is shown to be commonly owned with this application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement.
Effective January 1, 1994, a registered attorney or agent of record may sign a terminal disclaimer. A terminal disclaimer signed by the assignee must fully comply with 37 CFR 3.73(b).
Claims 1, 10, and 19 with dependent claims thereof, are rejected on the ground of nonstatutory obviousness-type double patenting as being unpatentable over claims 1, 14, and 15 and any dependent claims thereof of U.S. Patent No. 12073824. Although the conflicting claims are not identical, they are not patentably distinct from each other because said claims of the instant application includes all of the features of said claims of U.S. Patent No. 12073824. It would have been obvious to one of ordinary skill in the art to omit the step of using RNN-T as an optional and/or step amounting to a broader representation than the parent claims, In re Karlson 136 USPQ 184 (1963): "Omission of an element and its function is an obvious expedient if the remaining elements perform the same functions as before"
Present invention 12073824
1. A method implemented by one or more processors, the method comprising: training a two-pass automatic speech recognition (ASR) model to generate a text representation of a spoken utterance, wherein training the ASR model comprises: training a first-pass portion of the ASR model, wherein training the first-pass portion of the ASR model comprises updating one or more portions of a shared encoder portion of the ASR model and/or updating one or more portions of a recurrent neural network transformer (RNN-T) decoder portion of the first-pass portion of the ASR model based on processing a plurality of training instances; and training a second-pass portion of the ASR model, wherein training the second-pass portion of the ASR model comprises updating one or more portions of an additional encoder portion of the ASR model and/or updating one or more portions of a listen attend spell (LAS) decoder based on processing the plurality of training instances; subsequent to training the ASR model, processing audio data capturing a spoken utterance using the ASR model to generate a text representation of the spoken utterance; and causing a client device to perform one or more actions based on the text representation of the spoken utterance.
2. The method of claim 1, wherein training the first-pass portion of the ASR model further comprises: for each of the plurality of training instances and until one or more conditions are satisfied: processing an instance of training audio data portion of the training instance using the shared encoder to generate shared encoder training output, wherein the training audio data captures a spoken training utterance; processing the shared encoder training output using the RNN-T decoder portion to generate predicted RNN-T training output; determining a loss based on comparing the predicted RNN-T training output and a ground truth text representation of the training utterance; updating the one or more portions of the shared encoder and/or updating the one or more portions of the RNN-T decoder based on the loss.
3. The method of claim 2, wherein training the second-pass portion of the ASR model further comprises: for each of the plurality of training instances and until one or more second conditions are satisfied: processing the instance of training audio data portion of the training instance using the shared encoder to generate second shared encoder training output; processing the second shared encoder training output using the additional encoder to generate additional encoder training output; processing the additional encoder training output using the LAS decoder to generate LAS training output; determining a second loss based on comparing the LAS training output and the ground truth representation of the training utterance; and updating one or more portions of the additional encoder based on the second loss and/or updating one or more portions of the LAS decoder based on the second loss.
4. The method of claim 3, wherein training the ASR model further comprises: for each of the plurality of training instances and until one or more third conditions are satisfied: processing the instance of training audio data portion of the training instance using the shared encoder to generate third shared encoder training output; processing the third shared encoder training output using the RNN-T decoder to generate third RNN-T training output; determining a RNN-T training loss based on comparing the RNN-T training output and the ground truth representation of the training utterance; processing the third shared encoder training output using the additional encoder to generate third additional encoder training output; processing the third additional encoder training output using the LAS decoder to generate third LAS training output; determining a LAS training loss based on comparing the third LAS training output and the ground truth representation of the training utterance; determining a common loss based on comparing the RNN-T training loss and the LAS training loss; and updating one or more portions of the shared encoder based on the common loss and/or updating one or more portions of the additional encoder based on the common loss and/or updating one or more portions of the RNN-T decoder based on the common loss and/or updating one or more portions of the LAS decoder based on the common loss.
5. The method of claim 4, wherein training the ASR model further comprises training the LAS decoder using mean word error rate training.
6. The method of claim 2, wherein the RNN-T training output includes an end of query token indicating a human speaker has finished speaking using the first-pass portion of the ASR model.
7. The method of claim 6, wherein determining the human speaker has finished speaking the utterance comprises determining the human speaker has finished speaking the utterance in response to identifying the end of query token in the RNN-T training output.
8. The method of claim 7, wherein training the ASR model comprises penalizing the RNN-T decoder portion for generating the end of query token too early or too late.
9. The method of claim 1, wherein the audio data capturing the spoken utterance comprises a sequences of segments, and wherein processing the audio data capturing the spoken utterance using the ASR model to generate the text representation of the spoken utterance comprises: for each of the segments, and in the sequence: processing the segment using the first-pass portion of the trained ASR model to generate RNN-T output, wherein processing the segment using the first-pass portion of the ASR model comprises: processing the segment using the shared encoder to generate shared encoder output; adding the shared encoder output as the next item in a shared encoder buffer, and processing the shared encoder output using the RNN-T decoder to generate a corresponding portion of RNN-T output; determining one or more first-pass candidate text representations of the utterance based on the RNN-T output; determining a human speaker of the utterance has finished speaking the utterance; in response to determining the human speaker has finished speaking the utterance: processing the shared encoder output from the shared encoder buffer using an additional encoder to generate additional encoder output; generating LAS output based on processing the additional encoder output using the LAS portion of the ASR model along with at least one of (a) the RNN-T output or (b) the one or more first-pass candidate text representations of the utterance; and generating a final text representation of the utterance based on the LAS output.
10. A computing device comprising: memory storing instructions; and one or more processors that execute the instructions, stored in the memory, to: train a two-pass automatic speech recognition (ASR) model to generate a text representation of a spoken utterance, wherein training the ASR model comprises: train a first-pass portion of the ASR model, wherein training the first-pass portion of the ASR model comprises updating one or more portions of a shared encoder portion of the ASR model and/or updating one or more portions of a recurrent neural network transformer (RNN-T) decoder portion of the first-pass portion of the ASR model based on processing a plurality of training instances; and train a second-pass portion of the ASR model, wherein training the second-pass portion of the ASR model comprises updating one or more portions of an additional encoder portion of the ASR model and/or updating one or more portions of a listen attend spell (LAS) decoder based on processing the plurality of training instances; subsequent to training the ASR model, processing audio data capturing a spoken utterance using the ASR model to generate a text representation of the spoken utterance; and cause a client device to perform one or more actions based on the text representation of the spoken utterance.
11. The computing device of claim 10, wherein the instructions for causing the computing device to train the first-pass portion of the ASR model further comprises: for each of the plurality of training instances and until one or more conditions are satisfied: process an instance of training audio data portion of the training instance using the shared encoder to generate shared encoder training output, wherein the training audio data captures a spoken training utterance; process the shared encoder training output using the RNN-T decoder portion to generate predicted RNN-T training output; determine a loss based on comparing the predicted RNN-T training output and a ground truth text representation of the training utterance; update the one or more portions of the shared encoder and/or update the one or more portions of the RNN-T decoder based on the loss.
12. The computing device of claim 11, wherein the instructions for causing the computing device to train the second-pass portion of the ASR model further comprises: for each of the plurality of training instances and until one or more second conditions are satisfied: process the instance of training audio data portion of the training instance using the shared encoder to generate second shared encoder training output; process the second shared encoder training output using the additional encoder to generate additional encoder training output; process the additional encoder training output using the LAS decoder to generate LAS training output; determine a second loss based on comparing the LAS training output and the ground truth representation of the training utterance; and update one or more portions of the additional encoder based on the second loss and/or update one or more portions of the LAS decoder based on the second loss.
13. The computing device of claim 12, wherein the instructions for causing the computing device to train the ASR model further comprises: for each of the plurality of training instances and until one or more third conditions are satisfied: process the instance of training audio data portion of the training instance using the shared encoder to generate third shared encoder training output; process the third shared encoder training output using the RNN-T decoder to generate third RNN-T training output; determine a RNN-T training loss based on comparing the RNN-T training output and the ground truth representation of the training utterance; process the third shared encoder training output using the additional encoder to generate third additional encoder training output; process the third additional encoder training output using the LAS decoder to generate third LAS training output; determine a LAS training loss based on comparing the third LAS training output and the ground truth representation of the training utterance; determine a common loss based on comparing the RNN-T training loss and the LAS training loss; and update one or more portions of the shared encoder based on the common loss and/or update one or more portions of the additional encoder based on the common loss and/or update one or more portions of the RNN-T decoder based on the common loss and/or updating one or more portions of the LAS decoder based on the common loss.
14. The computing device of claim 13, wherein the instructions causing the computing device to train the ASR model further comprises training the LAS decoder using mean word error rate training.
15. The computing device of claim 11, wherein the RNN-T training output includes an end of query token indicating a human speaker has finished speaking using the first-pass portion of the ASR model.
16. The computing device of claim 15, wherein the instructions further comprise: determine the human speaker has finished speaking the utterance in response to identifying the end of query token in the RNN-T training output.
17. The computing device of claim 15, wherein training the ASR model comprises penalizing the RNN-T decoder portion for generating the end of query token too early or too late.
18. The computing device of claim 10, wherein the audio data capturing the spoken utterance comprises a sequences of segments, and wherein the instructions causing the computing device to process the audio data capturing the spoken utterance using the ASR model to generate the text representation of the spoken utterance comprise: for each of the segments, and in the sequence: process the segment using the first-pass portion of the trained ASR model to generate RNN-T output, wherein processing the segment using the first-pass portion of the ASR model comprises: process the segment using the shared encoder to generate shared encoder output; add the shared encoder output as the next item in a shared encoder buffer, and process the shared encoder output using the RNN-T decoder to generate a corresponding portion of RNN-T output; determine one or more first-pass candidate text representations of the utterance based on the RNN-T output; determine a human speaker of the utterance has finished speaking the utterance; in response to determining the human speaker has finished speaking the utterance: process the shared encoder output from the shared encoder buffer using an additional encoder to generate additional encoder output; generate LAS output based on processing the additional encoder output using the LAS portion of the ASR model along with at least one of (a) the RNN-T output or (b) the one or more first-pass candidate text representations of the utterance; and generate a final text representation of the utterance based on the LAS output.
19. A non-transitory computer-readable storage medium storing instructions executable by one or more processors of a computing system to perform a method comprising: training a two-pass automatic speech recognition (ASR) model to generate a text representation of a spoken utterance, wherein training the ASR model comprises: training a first-pass portion of the ASR model, wherein training the first-pass portion of the ASR model comprises updating one or more portions of a shared encoder portion of the ASR model and/or updating one or more portions of a recurrent neural network transformer (RNN-T) decoder portion of the first-pass portion of the ASR model based on processing a plurality of training instances; and training a second-pass portion of the ASR model, wherein training the second-pass portion of the ASR model comprises updating one or more portions of an additional encoder portion of the ASR model and/or updating one or more portions of a listen attend spell (LAS) decoder based on processing the plurality of training instances; subsequent to training the ASR model, processing audio data capturing a spoken utterance using the ASR model to generate a text representation of the spoken utterance; and causing a client device to perform one or more actions based on the text representation of the spoken utterance.
1. (Currently Amended) A method implemented by one or more processors, the method comprising: receiving audio data comprising a sequence of segments and capturing an utterance spoken by a human speaker; for each of the segments, and in the sequence: processing the segment using a first-pass portion of an automatic speech recognition ("ASR") model to generate recurrent neural network transformer ("RNN-T") output, wherein processing the segment using the first-pass portion of the ASR model comprises: processing the segment using a shared encoder portion to generate shared encoder output, adding the shared encoder output as the next item in a shared encoder buffer, and processing the shared encoder output using a RNN-T decoder portion to generate a corresponding portion of RNN-T output, and processing the shared encoder output using an additional encoder to generate additional encoder output; determining one or more first-pass candidate text representations of the utterance based on the RNN-T output; determining the human speaker has finished speaking the utterance; in response to determining the human speaker has finished speaking the utterance: processing the shared encoder output from the shared encoder buffer using an additional encoder to generate additional encoder output; generating listen attend spell ("LAS") output based on processing, using a second-pass LAS decoder portion of the ASR model, the additional encoder output along with at least one of (a) the RNN-T output or (b) the one or more first-pass candidate text representations of the utterance; and generating a final text representation of the utterance based on the LAS output.
2. (Original) The method of claim 1, wherein receiving the audio data comprising the sequence of segments and capturing the utterance spoken by the human speaker comprises capturing the audio data using one or more microphones of a client device.
3. (Previously Presented) The method of claim 1, wherein the one or more first-pass candidate text representations of the utterances is a first-pass lattice representation.
4. (Original) The method of claim 3, wherein generating LAS output based on processing, using the second-pass LAS decoder portion of the ASR model, the additional encoder output along with the one or more first-pass candidate text representations of the utterance comprises:for each lattice arc in the first-pass lattice representation, processing the lattice arc using the LAS decoder in a teacher-forcing mode with attention on the additional encoder output to update the probability of the first-pass candidate text representation corresponding to the arc; and generating the LAS output by selecting the candidate first-pass text representation with the highest updated probability.
5. (Previously Presented) The method of claim 1, further comprising:generating a plurality of training instances, wherein generating each training instance comprises: selecting an instance of training audio data capturing a training utterance spoken by a training human speaker; determining a ground truth representation of the training utterance; and storing the training instance including the training audio data along with the ground truth text representation of the training utterance.
6. (Original) The method of claim 5, further comprising training the ASR model, wherein training the ASR model comprises:for each of the plurality of training instances and until one or more conditions are satisfied: processing the instance of training audio data using the shared encoder to generate shared encoder training output; processing the shared encoder training output using the RNN-T decoder to generate predicted RNN-T training output; determining a loss based on the predicted RNN-T training output and the ground truth representation of the training utterance; updating one or more portions of the shared encoder portion based on the determined loss and/or updating one or more portions of the RNN-T decoder portion based on the determined loss.
7. (Original) The method of claim 6, wherein training the ASR model further comprises:for each of the plurality of training instances and until one or more second conditions are satisfied: processing the instance of training audio data using the shared encoder to generate second shared encoder training output; processing the second shared encoder training output using the additional encoder to generate additional encoder training output; processing the additional encoder training output using the LAS decoder to generate LAS training output; determining a second loss based on the LAS training output and the ground truth representation of the training utterance; and updating one or more portions of the additional encoder based on the determined loss and/or updating one or more portions of the LAS decoder based on the determined loss.
8. (Original) The method of claim 7, wherein training the ASR model further comprises:for each of the plurality of training instances and until one or more third conditions are satisfied:processing the instance of training audio data using the shared encoder to generate third shared encoder training output; processing the third shared encoder training output using the RNN-T decoder to generate second RNN-T training output; determining a RNN-T loss based on the second RNN-T training output and the ground truth representation of the training utterance; processing the third shared encoder training output using the additional encoder to generate second additional encoder training output; processing the second additional encoder training output using the LAS decoder to generate second LAS training output; determining a LAS loss based on the second LAS training output and the ground truth representation of the training utterance; determining a common loss based on the RNN-T loss and the LAS loss; and updating one or more portions of the shared encoder based on the common loss and/or updating one or more portions of the additional encoder based on the common loss and/or updating one or more portions of the RNN-T decoder based on the common loss and/or updating one or more portions of the LAS decoder based on the common loss.
9. (Original) The method of claim 8, wherein training the ASR model further comprises training the LAS decoder using mean word error rate training.
10. (Previously Presented) The method of claim 1, wherein the RNN-T output includes an end of query token indicating the human speaker has finished speaking generated using the first-pass portion of the ASR model.
11. (Currently Amended) The method of claim 10,wherein determining the human speaker has finished speaking the utterance comprises determining the human speaker has finished speaking the utterance in response to identifying the end of query token in the RNN-T output.
12. (Original) The method of claim 11, wherein training the ASR model comprises penalizing the RNN-T decoder portion for generating the end of query token too early or too late.
13. (Canceled)
14. (Currently Amended) A client device comprising:one or more microphones; memory storing instructions; one or more processors that execute the instructions, stored in the memory, to: receive audio data detected via the microphone, the audio data comprising a sequence of segments and capturing an utterance spoken by a human speaker; for each of the segments, and in the sequence: process the segment using a first-pass portion of an automatic speech recognition ("ASR") model to generate recurrent neural network transformer ("RNN-T") output, wherein in processing the segment using the first-pass portion of the ASR model one or more of the processors are to: process the segment using a shared encoder portion to generate shared encoder output, add the shared encoder output as the next item in a shared encoder buffer,andprocess the shared encoder output using a RNN-T decoder portion to generate a corresponding portion of RNN-T output, andproccss the sharcd cncodcr output using an additional cncodcr to gcncratc additional cncodcr output; determine one or more first-pass candidate text representations of the utterance based on the RNN-T output; determine the human speaker has finished speaking the utterance; in response to determining the human speaker has finished speaking the utterance:process the shared encoder output from the shared encoder buffer using an additional encoder to generate additional encoder output; generate listen attend spell ("LAS") output based on processing, using a second-pass LAS decoder portion of the ASR model, the additional encoder output along with at least one of (a) the RNN-T output or (b) the one or more first-pass candidate text representations of the utterance; and generate a final text representation of the utterance based on the LAS output.
15. (Currently Amended) A non-transitory computer-readable storage medium storing instructions executable by one or more processors of a computing system to perform a method comprising:receiving audio data comprising a sequence of segments and capturing an utterance spoken by a human speaker; for each of the segments, and in the sequence: processing the segment using a first-pass portion of an automatic speech recognition ("ASR") model to generate recurrent neural network transformer ("RNN-T") output, wherein processing the segment using the first-pass portion of the ASR model comprises:processing the segment using a shared encoder portion to generate shared encoder output, adding the shared encoder output as the next item in a shared encoder buffer, andprocessing the shared encoder output using a RNN-T decoder portion to generate a corresponding portion of RNN-T output, and proccssing the sharcd cncodcr output using an additional cncodcr to gcncratc additional cncodcr output; determining one or more first-pass candidate text representations of the utterance based on the RNN-T output; determining the human speaker has finished speaking the utterance; in response to determining the human speaker has finished speaking the utterance:processing the shared encoder output from the shared encoder buffer using an additional encoder to generate additional encoder output; generating listen attend spell ("LAS") output based on processing, using a second-pass LAS decoder portion of the ASR model, the additional encoder output along with at least one of (a) the RNN-T output or (b) the one or more first-pass candidate text representations of the utterance; and generating a final text representation of the utterance based on the LAS output.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1, 10, and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 20110054899 A1 Phillips; Michael S. et al. (hereinafter Phillips).
Re claim 1, Phillips teaches
1. A method implemented by one or more processors, the method comprising: (fig. 1 professor on hardware)
training a two-pass automatic speech recognition (ASR) model to generate a text representation of a spoken utterance, wherein training the ASR model comprises: (ASR operations using speech recognition models for training using past utterances, user history, context, and corrections thereof for instance 0071 fig. 1 and fig. 2… e.g. the first pass and second pass involving a user command with corrections or two separate interactions, both are used to train the model 0105 and fig. 7b-7c)
training a first-pass portion of the ASR model, wherein training the first-pass portion of the ASR model comprises updating one or more portions of a shared encoder portion of the ASR model (the first pass is a literal first interaction with an ASR model by a user such as a first command 0105 and fig. 7b-7c, an ASR model containing encoded data or an encoder per se as in 0177 and 0185, ASR operations using speech recognition models for training using past utterances, user history, context, and corrections thereof for instance 0071 fig. 1 and fig. 2) and/or updating one or more portions of a recurrent neural network transformer (RNN-T) decoder portion of the first-pass portion of the ASR model based on processing a plurality of training instances; and
training a second-pass portion of the ASR model, wherein training the second-pass portion of the ASR model comprises updating one or more portions of an additional encoder portion of the ASR model (the second pass can be a literal second interaction with an ASR model by a user such as the correction to a first input of a first command 0105 and fig. 7b-7c, an ASR model containing encoded data or an encoder per se as in 0177 and 0185, ASR operations using speech recognition models for training using past utterances, user history, context, and corrections thereof for instance 0071 fig. 1 and fig. 2) and/or updating one or more portions of a listen attend spell (LAS) decoder based on processing the plurality of training instances;
subsequent to training the ASR model, processing audio data capturing a spoken utterance using the ASR model to generate a text representation of the spoken utterance; and (ASR inherently and explicitly in Phillips, converts speech to text, particularly after iterations of model training, e.g. the first pass and second pass involving a user command 0105 and fig. 7b-7c, an ASR model containing encoded data or an encoder per se as in 0177 and 0185, ASR operations using speech recognition models for training using past utterances, user history, context, and corrections thereof for instance 0071 fig. 1 and fig. 2)
causing a client device to perform one or more actions based on the text representation of the spoken utterance. (in fig. 7c the action is opening and populating the app fields just by uttering a command… ASR inherently and explicitly in Phillips, converts speech to text, particularly after iterations of model training, e.g. the first pass and second pass involving a user command 0105 and fig. 7b-7c, an ASR model containing encoded data or an encoder per se as in 0177 and 0185, ASR operations using speech recognition models for training using past utterances, user history, context, and corrections thereof for instance 0071 fig. 1 and fig. 2)
Re claim 10, this claim has been rejected for teaching a broader, or narrower claim based on general inclusion of hardware alone (e.g. processor, memory, instructions), representation of claim 1 omitting/including hardware for instance, otherwise amounting to a virtually identical scope
For instance, see fig. 1 and 0194 which contains the necessary memory and processors.
Re claim 19, this claim has been rejected for teaching a broader, or narrower claim based on general inclusion of hardware alone (e.g. processor, memory, instructions), representation of claim 1 omitting/including hardware for instance, otherwise amounting to a virtually identical scope
For instance, see fig. 1 and 0194 which contains the necessary memory and processors.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 2 and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20110054899 A1 Phillips; Michael S. et al. (hereinafter Phillips) in view of US 20190318261 A1 Deng; Yue et al. (hereinafter Deng).
Re claims 2 and 11, Phillips teaches
2. The method of claim 1, wherein training the first-pass portion of the ASR model further comprises: (in fig. 7c the action is opening and populating the app fields just by uttering a command… ASR inherently and explicitly in Phillips, converts speech to text, particularly after iterations of model training, e.g. the first pass and second pass involving a user command 0105 and fig. 7b-7c)
for each of the plurality of training instances and until one or more conditions are satisfied: (confidence levels met 0083… in fig. 7c the action is opening and populating the app fields just by uttering a command… ASR inherently and explicitly in Phillips, converts speech to text, particularly after iterations of model training, e.g. the first pass and second pass involving a user command 0105 and fig. 7b-7c)
processing an instance of training audio data portion of the training instance using the shared encoder to generate shared encoder training output, wherein the training audio data captures a spoken training utterance; (an ASR model containing encoded data or an encoder per se as in 0177 and 0185, ASR operations using speech recognition models for training using past utterances, user history, context, and corrections thereof for instance 0071 fig. 1 and fig. 2)
However, while Phillips teaches encoder or encoding based ASR model learning, it fails to teach RNN concepts:
processing the shared encoder training output using the RNN-T decoder portion to generate predicted RNN-T training output; (Deng a sequence learning on real time data RNN is analogous to RNN-T under BRI, training by using sequenced learning on real tie data through loss in comparing prediction with ground truth and updating encoders per se 0055-0057 and 0065)
determining a loss based on comparing the predicted RNN-T training output and a ground truth text representation of the training utterance; (Deng a sequence learning on real time data RNN is analogous to RNN-T under BRI, training by using sequenced learning on real tie data through loss in comparing prediction with ground truth and updating encoders per se 0055-0057 and 0065)
updating the one or more portions of the shared encoder (Deng a sequence learning on real time data RNN is analogous to RNN-T under BRI, training by using sequenced learning on real tie data through loss in comparing prediction with ground truth and updating encoders per se 0055-0057 and 0065) and/or updating the one or more portions of the RNN-T decoder based on the loss.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Deng to allow for simple substitution of one known element for another to obtain predictable results such as the trainable model with encoding and ASR of Phillips with an RNN equivalent for speech to allow for superior mapping of variable-length audio to text by leveraging sequential memory, improving both recognition accuracy and adaptability to new data, by having the encoder learn complex temporal dynamics and context from raw audio e.g. streamed for using the ground truth (transcriptions) to update these encoders, often in an end-to-end (E2E) fashion functionally analogous to RNN with sequenced data.
Claims 6-8 and 15-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20110054899 A1 Phillips; Michael S. et al. (hereinafter Phillips) in view of US 20180166066 A1 Dimitriadis; Dimitrios B. et al. (hereinafter Dimitriadis).
Re claims 6 and 15, while Phillips teaches encoder or encoding based ASR model learning, it fails to teach speaking start/stop and RNN concepts:
6. The method of claim 2, wherein the RNN-T training output includes an end of query token indicating a human speaker has finished speaking using the first-pass portion of the ASR model. (Dimitriadis user start and stop areas determined by applying RNN to sequenced frames of real time audio data 0041 with 0063-0064 fig. 4-6 and fig. 9, evidenced further in 0050-0054 for training or having the RNN learn per se, a label analogous to a token with nth number of passes possible, to prevent erroneous speech)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Dimitriadis to allow for simple substitution of one known element for another to obtain predictable results such as the trainable model with encoding and ASR of Phillips with an RNN and start/stop labeling for nth iterations of speech to train a model analogous to the model of Phllips with a weighting to enforce good learning avoid or prevent garbage or wrong recognitions, to allow for improved speaker diarization and accuracy of who is talking and when they start/stop, with learning to prevent erroneous speech being analogous to loss per se which RNNs utilize to handle arbitrary input sequences.
Re claims 7 and 16, while Phillips teaches encoder or encoding based ASR model learning, it fails to teach speaking start/stop and RNN concepts:
7. The method of claim 6, wherein determining the human speaker has finished speaking the utterance comprises determining the human speaker has finished speaking the utterance in response to identifying the end of query token in the RNN-T training output. (Dimitriadis user start and stop areas determined by applying RNN to sequenced frames of real time audio data 0041 with 0063-0064 fig. 4-6 and fig. 9, evidenced further in 0050-0054 for training or having the RNN learn per se, a label analogous to a token with nth number of passes possible, to prevent erroneous speech)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Dimitriadis to allow for simple substitution of one known element for another to obtain predictable results such as the trainable model with encoding and ASR of Phillips with an RNN and start/stop labeling for nth iterations of speech to train a model analogous to the model of Phllips with a weighting to enforce good learning avoid or prevent garbage or wrong recognitions, to allow for improved speaker diarization and accuracy of who is talking and when they start/stop, with learning to prevent erroneous speech being analogous to loss per se which RNNs utilize to handle arbitrary input sequences.
Re claims 8 and 17, while Phillips teaches encoder or encoding based ASR model learning, it fails to teach speaking start/stop and RNN concepts:
8. The method of claim 7, wherein training the ASR model comprises penalizing the RNN-T decoder portion for generating the end of query token too early or too late. (Dimitriadis RNNs use loss to predict outcomes and to prevent erroneous speech 0073, user start and stop areas determined by applying RNN to sequenced frames of real time audio data 0041 with 0063-0064 fig. 4-6 and fig. 9, evidenced further in 0050-0054 for training or having the RNN learn per se, a label analogous to a token with nth number of passes possible)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Phillips to incorporate the above claim limitations as taught by Dimitriadis to allow for simple substitution of one known element for another to obtain predictable results such as the trainable model with encoding and ASR of Phillips with an RNN and start/stop labeling, in which an RNN inherently uses loss i.e. penalization under BRI, for nth iterations of speech to train a model analogous to the model of Phllips with a weighting to enforce good learning avoid or prevent garbage or wrong recognitions, to allow for improved speaker diarization and accuracy of who is talking and when they start/stop, with learning to prevent erroneous speech being analogous to loss per se which RNNs utilize to handle arbitrary input sequences.
Allowable Subject Matter
Claims 3-5 and 12-14
Claims 9 and 18
Are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. After searching through patent and non-patent literature, there was no evidence that there exists a limitation in direct relation or an obvious variant to such limitations as a whole as precisely limited. When searching for a secondary prior art for the limitation as recited in the above claims, the most relevant topics pertained to material from the same Inventor and Assignee but did not teach or suggest the aforementioned complex limitations as a whole as precisely limited.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20180247643 A1 BATTENBERG; Eric et al.
LAS concepts
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL C COLUCCI whose telephone number is (571)270-1847. The examiner can normally be reached on M-F 9 AM - 5 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MICHAEL COLUCCI/Primary Examiner, Art Unit 2655 (571)-270-1847
Examiner FAX: (571)-270-2847
Michael.Colucci@uspto.gov