Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
2. Applicant’s arguments with respect to claim 1, 8 and 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
3. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
4. Claims 1-4, 6-11, 13-17 and 19-20 are 35 U.S.C. 103 as being unpatentable over Wang (US 2022/0301573) in view of Tripathi (US 2021/0343273).
Regarding Claim 1:
Wang discloses a system for speech separation (Wang: Fig. 1 discloses a system for voice separation) comprising:
data processing hardware (Wang: p[0016] discloses a system implemented on a client computing device that performs speaker-conditioned speech separation using a voice filter model);
memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations (Wang: p[0016] discloses a system implemented on a client computing device including a memory and processor): comprising:
generating a two-dimensional representation of a speech mixture (Wang: p[0051]-p[0053], p[0063] discloses generating frequency representations of audio using FFT or filter bank, both of which are 2D representations (time and frequency) of a source signal);
separating the speech mixture into an initial separation (Wang: Fig. 3p[0050-0053], p[0066] discloses initial separations done with a frequency representation (generated using automatic speech recognition and is a preliminary separation step) which then goes through power compression and normalization);
supplying the initial separation and speaker representations to a refinement module (Wang: Fig. 3 p[0050-0053], p[0066] discloses supplying an initial representation (normalized frequency representation 306) together with speaker embeddings 218 to a refinement model (voice filter model 112), which refines the initial representation by generating and applying a predicted mask 322 to later be used for completed separation);
refining the initial separation based on the initial separation and the speaker representations (Wang: Fig. 3 the initial separation (normalized representation 306) is supplied to the refinement module (voice filter model) which is then used to produce the predicted mask (a refined representation));
estimating a mask per speaker (Wang: Fig. 3 308 shows the convolving of the initial input and the predicted mask (i.e., the mask is used on the speaker input));
and applying the masks to the two-dimensional representation to create two- dimensional, per-speaker representations (Wang: Fig. 3 308 shows the convolving of the initial input and the predicted mask to produce a revised frequency result (a completely isolated speaker representation)).
Wang does not explicitly disclose:
generating per-frame speaker representations of the initial separation;
comparing the per-frame speaker representations to stored speaker representations in a speaker embedding table and enforcing, during training, that the per-frame speaker representations are consistent with the stored speaker representations;
supplying feature projections of the per frame speaker representations; and
refining the initial separation based on the feature projections of the per frame speaker representation;
However, Tripathi discloses:
generating per-frame speaker representations of the initial separation (Tripathi: ¶[0008] produces speaker-specific representations at each frame, the per-frame speaker representations correspond to the respective masked audio embeddings generated per frame for each speaker);
comparing the per-frame speaker representations to stored speaker representations in a speaker embedding table and enforcing, during training, that the per-frame speaker representations are consistent with the stored speaker representations (Tripathi: ¶[0050]-[0051] compares the per-frame speaker representations is satisfied by computing speaker embeddings derived from the per frame masked embeddings and comparing them as part of EmbLoss; the so called speaker embedding table reads on stored reference speaker embeddings (e.g., speaker embedding vectors used as training targets), and the per-frame speaker representations are enforced to remain consistent with the stored/reference embeddings via the embedding loss);
supplying feature projections of the per frame speaker representations (Tripathi: ¶[0031]-[0033] discloses these embeddings are projected outputs of the encoder network, wherein the encoder output embedding is a feature projection (projection from time-domain or Mel features into an embedding space); and
refining the initial separation based on the feature projections of the per frame speaker representation (Tripathi: ¶[0033] discloses an initial separation at the feature level, the audio encoder generates encoded embeddings from a monophonic mixture, these embedding still represent mixed speakers, i.e., an initial separation. ¶[0008] discloses per frame encoder embeddings and ¶[0041] discloses speaker embeddings supplied per frame which are explicitly feature space projections representing speaker identity at the frame level. The masking model refines the encoded embeddings based on per-frame encoder embeddings and speaker embeddings/condition inputs to generate masked audio embeddings. During training, embedding loss is applied to enforce that masked embeddings correspond to only one speaker);
Wang is directed to speech separation using speaker conditioned masking and refinement while Tripathi is directed to end-to-end multi-speaker speech recognition that refines speaker-specific representations at a per-frame level using embedding based losses. These references are in the same field of endeavor (speech separation) and address the same problem of improving speaker specific representations in overlapping speech. Tripathi expressly discloses refining speaker-specific representations by enforcing consistency of per frame speaker embeddings during training, stating that “the embedding loss to each of (i) the respective masked audio embedding generated for the first speaker to enforce that an entirety of the respective masked audio embedding generated for the first speaker corresponds to only audio spoken by the first speaker and (ii) the respective masked audio embedding generated for the second speaker to enforce that an entirety of the respective masked audio embedding generated for the second speaker corresponds to only audio spoken by the second speaker” in ¶[0004].
Regarding Claim 2:
The proposed combination of Wang and Tripathi further discloses the system of Claim 1, wherein generating a two-dimensional representation of the speech mixture includes passing the speech mixture through an encoder (Wang: p[0050] states that the automatic speech recognition (ASR) engine 114 can process audio to determine a frequency representation, it states that this engine may use a Fast Fourier Transform (FFT) or a filter bank, these are both forms of encoding and therefore the ASR engine as a whole is acting an encoder as is well understood in this field of endeavor).
Regarding Claim 3:
The proposed combination of Wang and Tripathi further the discloses the system of Claim 1, further comprising supplying at least one feature projection to the refinement module for use by the refinement module in refining the initial separation (Wang: p[0063]-p[0065] frequency representation 302 and speaker embedding 318 are both optionally normalized and/or power compressed before entering the voice filter model this normalization especially of the speaker embedding (which is interpreted as the speaker representations) is equivalent to the claimed supply at least one feature projection).
Regarding Claim 4:
The proposed combination of Wang and Tripathi further the discloses the system of Claim 1, further comprising using a speaker embedding table to refine the speaker representations (Wang: p[0005] discloses pre-generated speaker embeddings. The sequence of audio data can be associated with the pre-generated speaker embedding after verification of the first human speaker, this is voice printing which satisfies Applicants claimed speaker embedding table as described in p[0048] of Applicants disclosure because the pre-enrolled audio is mapped to just like in a table).
Regarding Claim 6:
The proposed combination of Wang and Tripathi further the discloses the system of Claim 1, further comprising a microphone in communication with the data processing hardware and configured to detect the speech mixture (Wang: p[0002] discloses a microphone of a client device).
Regarding Claim 7:
The proposed combination of Wang and Tripathi further the discloses a vehicle incorporating the microphone of Claim 6 (Wang: p[0048] the client device may be an in-vehicle system).
Regarding Claim 8:
Claim 8 has been analyzed with regard to claim 1 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 9:
Claim 9 has been analyzed with regard to claim 2 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 10:
Claim 10 has been analyzed with regard to claim 3 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 11:
Claim 11 has been analyzed with regard to claim 4 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 13:
Claim 13 has been analyzed with regard to claim 6 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 14:
Claim 14 has been analyzed with regard to claim 7 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 15:
Claim 15 has been analyzed with regard to claim 1 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 16:
Claim 16 has been analyzed with regard to claim 2 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 17:
Claim 17 has been analyzed with regard to claim 4 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 19:
Claim 19 has been analyzed with regard to claim 6 (see rejection above) and
is rejected for the same reasons of anticipation used above.
Regarding Claim 20:
Claim 20 has been analyzed with regard to claim 7 (see rejection above) and
is rejected for the same reasons of anticipation used above.
5. Claims 5, 12, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Tripathi and further in view of Yang (US 2024/0404509).
Regarding Claim 5:
The proposed combination of Wang and Tripathi further discloses the system of Claim 1, except further comprising passing the two-dimensional, per-speaker representations through a decoder to generate per-speaker waveforms.
However, Yang discloses this limitation: (Yang: p[0034] discloses that an intermediate representation and an embedding can be put into a decoder and obtain a final representation which is converted into a waveform by using the decoder to generate speech).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to apply Yang to the per-speaker frequency representations of Wang and Tripathi in order to generate per=speaker waveforms. Decoding spectrogram or frequency domain features back into waveforms (via STFT, Griffin-Lim) is a well-known and predictable step in speech processing at the time of the invention. A person of ordinary skill in the art would recognize that producing per-speaker waveforms enables additional uses such as playback, user experience, real-time responsiveness or downstream processing beyond ASR transcription (see Yang p[0003]) and Wang’s system is completely capable of all these tasks but chooses to stick to ASR as a matter of personal implementation (see last line of Wang p[0003] where it is states the voice filter model can be used for automatic speech recognition without [emphasis added] reconstructing the audio).
Regarding Claim 12:
Claim 12 has been analyzed with regard to claim 5 (see rejection above) and
is rejected for the same reasons of obviousness used above.
Regarding Claim 18:
Claim 18 has been analyzed with regard to claim 5 (see rejection above) and
is rejected for the same reasons of obviousness used above.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN SCOTT MCLEAN whose telephone number is (703)756-4599. The examiner can normally be reached "Monday - Friday 8:00-5:00 EST, off Every 2nd Friday".
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/IAN SCOTT MCLEAN/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654