Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Response to Arguments
Applicant's arguments with respect to claims 1-3., 5-14, and 16-22 have been considered but are moot in view of the new ground(s) of rejection. Applicant’s arguments are directed to the amended subject matter; new prior art citations in existing are provided below to address the claim amendments in context. As a result of amendment, clarification of claim scope, and in light of thereof, further analysis of the prior art, claims 8-11 and 19-22 are now rejected.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5-14, and 16-22 are is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200219517 A1 Wang; Chong et al. (hereinafter Wang) in view of US 20200334538 A1 MENG; Zhong et al (hereinafter MENG).
Re claim 1, Wang teaches
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: (fig 1a)
receiving a plurality of training samples spanning multiple different domains, each corresponding training sample comprising audio data characterizing an utterance paired with a corresponding transcription of the utterance the corresponding transcription comprising: (training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
a whole transcript of all speech present in the corresponding audio data; and (the entire conversation must be processed prior to identifying who is speaking, “a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” as in 0026, based on initial training e.g. 0003, then real-time training, training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as 7time stamped or isolated speaker boundaries/segments i.e. annotating)
a primary transcript of only speech spoken by a primary speaker in the same corresponding audio data (“a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” as in 0026, further in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”… the whole conversation once segmented, the transcript is pertinent to the speaker individually… training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
for each corresponding training sample of the plurality of training samples, identifying one or more speaker tag boundaries based on the whole transcript and the primary transcript of the corresponding transcription of the utterance (as in fig. 1b with 0034, a form of re-labeling per se is shown based on initially labeled data 202 with speaker labels 250T labels for further learning or training a generative model which further predicts another label at 260 i.e. re-label, further in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… in an example for ‘a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” fig. 1a with 0026)
re-labeling each corresponding training sample of the plurality of training samples by annotating the whole transcription of the corresponding transcription of the utterance with one or more speaker tags based on the identified one or more speaker tag boundaries, each speaker tag indicating a respective segment of the whole transcription for speech that was spoken by a particular type of speaker; and (note: there is no initial “labeling” claimed so it is unclear as to which step “re-labeling” takes place at… as in fig. 1b with 0034, a form of re-labeling per se is shown based on initially labeled data 202 with speaker labels 250T labels for further learning or training a generative model which further predicts another label at 260 i.e. re-label, further in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating or tagging the text based on speaker label 0024-0026 and training thereof… training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
However, while the meaning of domain is broad and can be interpreted as the context of audio input or the unique input by a user, and Wang covers various types of inputs, it fails to teach other types of domains as follows:
domains (MENG different domains 0033, including a specific user voice 0002, ASR dictation and different domains 0035 handling queries/commands 0050 with fig 2a and 2c)
training a multi-domain speech recognition model on the re-labeled training samples to teach the multi-domain speech recognition model to learn to share parameters for recognizing speech across each of the multiple different domains. (MENG domains including as low as a specific user voice 0002, training a final model using all of the shared data across different domains 0033 ASR dictation and different domains 0035 handling queries/commands 0050 with fig 2a and 2c)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Wang to incorporate the above claim limitations as taught by MENG to allow for a simple substitution of one known element, such as training teacher and student joint model to share data and multiple explicit domains of input, for another, such as Wang’s multi-speaker model training with multiple types of use, to obtain predictable results which expands the contexts of domain uses, and allows for enhanced modeling by overlapping student and teacher into a final shared model for cross-intent evaluation and less errors.
Re claim 12, this claim has been rejected for teaching a broader, or narrower claim based on general inclusion of hardware alone (e.g. processor, memory, instructions), representation of claim 1 omitting/including hardware for instance, otherwise amounting to a virtually identical scope,
For instance the hardware shown in Wang’s fig. 1a.
Re claims 2 and 13, Wang teaches
2. The computer-implemented method of claim 1, wherein the multiple different domains comprise:
a dictation domain. (i.e. ASR…training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
However, while the meaning of domain is broad and can be interpreted as the context of audio input or the unique input by a user, and Wang covers various types of inputs, it fails to teach:
a short-form query domain; and (MENG i.e. a command/query… training a final model using all of the shared data across different domains 0033 ASR dictation and different domains 0035 handling queries/commands 0050 with fig 2a and 2c)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Wang to incorporate the above claim limitations as taught by MENG to allow for a simple substitution of one known element, such as training teacher and student joint model to share data and multiple explicit domains of input, for another, such as Wang’s multi-speaker model training with multiple types of use, to obtain predictable results which expands the contexts of domain uses, and allows for enhanced modeling by overlapping student and teacher into a final shared model for cross-intent evaluation and less errors.
Re claims 3 and 14, Wang teaches
3. The computer-implemented method of claim 2, wherein the multiple different domains further comprise a captions domain. (video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
Re claims 5 and 16, Wang teaches
5. The computer-implemented method of claim 4, wherein identifying the one or more speaker tag boundaries based on the whole transcript and the primary transcript of the corresponding transcription of the utterance comprises: (as in fig. 1b with 0034, a form of re-labeling per se is shown based on initially labeled data 202 with speaker labels 250T labels for further learning or training a generative model which further predicts another label at 260 i.e. re-label, further in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… in an example for ‘a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” fig. 1a with 0026)
performing a sub-sequence match between the whole transcript and the primary transcript to identify one or more speaker tag boundaries; and (the primary speaker is the user of interest…training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
wherein re-labeling each corresponding training sample of the plurality of training samples comprises annotating the whole transcript with the one or more speaker tags based on the one or more speaker tag boundaries identified by performing the sub-sequence match between the whole transcript and the primary transcript. (claim 12, 0006, and 0010 for instance annotating, the whole conversation derives the individual speakers…once segmented, the transcript is pertinent to the speaker individually… training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
Re claims 6 and 17, Wang teaches
6. The computer-implemented method of claim 1, wherein the particular type of speaker indicated by each speaker tag comprises a primary speaker or a non-primary speaker. (any other speaker but the speaker of interest… training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
Re claims 7 and 18, Wang teaches
7. The computer-implemented method of claim 6, wherein:
speech spoken by the primary speaker corresponds to speech directed toward a target application; and (target application such as word processing or any application 0057…any other speaker but the speaker of interest… training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
speech spoken by the non-primary speaker comprises at least one of: background speech spoken by a speaker other than the primary speaker; (any other speaker but the speaker of interest is background or other speakers… training based on multiple samples of ASR/transcribed-text and audio from multiple speakers in various contexts such as audio fig 1a with 0026, video with audio/captions extracted thereof, phone conferences, and any microphone based or audio based input 0028, for a full conversation segmented by labeled speakers and short utterance 0020 such as time stamped or isolated speaker boundaries/segments i.e. annotating the text based on speaker label 0024-0026 and training thereof)
recorded or broadcasted speech emanating from an audio output device; or
synthesized speech.
Re claims 8 and 19, Wang teaches
8. (Currently Amended) The computer-implemented method of claim 1, wherein the operations further comprise, for at least one the respective speaker for identifying what each speaker said” fig. 1a with 0026)
receiving[[ a]]the primary transcript of only speech spoken by[[ a]]the primary speaker in the corresponding audio data; and (in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… in an example for ‘a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” fig. 1a with 0026)
…the corresponding audio data to obtain[[ a]] the whole transcript of all speech present in the same corresponding audio data; (as in fig. 1b with 0034, a form of re-labeling per se is shown based on initially labeled data 202 with speaker labels 250T labels for further learning or training a generative model which further predicts another label at 260 i.e. re-label, further in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… in an example for ‘a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” fig. 1a with 0026)
However, while Wang uses a generative model to segment speaker transcripts e.g. with ASR, it does not specify a teach model per se:
processing, using a general teacher speech recognition model… (MENG a teacher model for ASR i.e. transcription 0005 for a target speaker 0064 in a supervised manner 0043 including background noise analysis 0091)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Wang to incorporate the above claim limitations as taught by MENG to allow for simple substitution of one known element for another to obtain predictable results such as the generative model in Wang for the teacher model in MENG to improve speaker identification, especially in noisy environments, by transferring knowledge from a robust "teacher" (trained on clean/large data) to a specialized student model, and further to reduce Diarization Error Rate (DER), for example, classroom settings, and enables effective, noise-robust speaker diarization.
Re claims 9 and 20, Wang teaches
9. (Original) The computer-implemented method of claim 8, wherein the general teacher speech recognition model is trained on a training data set to teach the general teacher speech recognition model to recognize primary speech, secondary speech… (primary speaker is at the user device, all other speakers are at their respective devices, as in fig. 1b with 0034, a form of re-labeling per se is shown based on initially labeled data 202 with speaker labels 250T labels for further learning or training a generative model which further predicts another label at 260 i.e. re-label, further in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… in an example for ‘a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” fig. 1a with 0026)
However, while Wang uses a generative model to segment speaker transcripts e.g. with ASR, it does not specify a teach model per se:
…and background noise speech. (MENG a teacher model for ASR i.e. transcription 0005 for a target speaker 0064 in a supervised manner 0043 including background noise analysis 0091)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Wang to incorporate the above claim limitations as taught by MENG to allow for simple substitution of one known element for another to obtain predictable results such as the generative model in Wang for the teacher model to handle noise per se, in MENG to improve speaker identification, especially in noisy environments, by transferring knowledge from a robust "teacher" (trained on clean/large data) to a specialized student model, and further to reduce Diarization Error Rate (DER), for example, classroom settings, and enables effective, noise-robust speaker diarization.
Re claims 10 and 21, Wang teaches
10. (Currently Amended) The computer-implemented method of claim 1, wherein the operations further comprise, for at least one 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… in an example for ‘a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” fig. 1a with 0026)
receiving[[ a]]the whole transcript of all speech present in the corresponding audio data, and (from the entire conversation, in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… in an example for ‘a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” fig. 1a with 0026)
…the corresponding audio data to obtain[[ a]] the primary transcript of only speech spoken by[[ a]] the primary speaker in the same corresponding audio data- (as in fig. 1b with 0034, a form of re-labeling per se is shown based on initially labeled data 202 with speaker labels 250T labels for further learning or training a generative model which further predicts another label at 260 i.e. re-label, further in 0025 the system is able to “not only identify who is speaking during a given segment 220, but also identify when speaker changes occur between adjacent segments”, as time stamped to identify the beginning and end of each speaker segment… in an example for ‘a transcription of a conversation between multiple co-workers (e.g., speakers 10) during a business meeting may be indexed by speaker to associate portions of the transcription with the respective speaker for identifying what each speaker said” fig. 1a with 0026)
However, while Wang uses a generative model to segment speaker transcripts e.g. with ASR, it does not specify a teach model per se:
processing, using a primary teacher speech recognition model… (MENG a teacher model for ASR i.e. transcription 0005 for a target speaker 0064 in a supervised manner 0043 including background noise analysis 0091)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Wang to incorporate the above claim limitations as taught by MENG to allow for simple substitution of one known element for another to obtain predictable results such as the generative model in Wang for the teacher model in MENG to improve, via a teacher model, Target Speaker Extraction (TSE) and speaker identification by enabling a smaller, specialized "student" model to learn to isolate a target voice from complex, noisy, multi-speaker mixtures, which improves accuracy in identifying a specific speaker, even when that speaker's characteristics are similar to others in the group.
Re claims 11 and 22, while Wang uses a generative model to segment speaker transcripts e.g. with ASR, it does not specify a teach model per se:
11. (Original) The computer-implemented method of claim 10, wherein the primary teacher speech recognition model is trained on supervised data obtained from domains that require only a primary speaker transcript. (MENG e.g. a single domain of a users voice 0002… a teacher model for ASR i.e. transcription 0005 for a target speaker 0064 in a supervised manner 0043 including background noise analysis 0091)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Wang to incorporate the above claim limitations as taught by MENG to allow for simple substitution of one known element for another to obtain predictable results such as the generative model in Wang for the teacher model in MENG to improve, via a teacher model, Target Speaker Extraction (TSE) and speaker identification by enabling a smaller, specialized "student" model to learn to isolate a target voice from complex, noisy, multi-speaker mixtures, which improves accuracy in identifying a specific speaker, even when that speaker's characteristics are similar to others in the group.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20230065468 A1 LU; Yunzhao et al.
Classifying speakers
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL COLUCCI whose telephone number is (571)270-1847. The examiner can normally be reached on M-F 9 AM - 7 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MICHAEL COLUCCI/Primary Examiner, Art Unit 2655 (571)-270-1847
Examiner FAX: (571)-270-2847
Michael.Colucci@uspto.gov