DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
This application includes one or more claim limitations that use the word “means” or “step” but are nonetheless not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph because the claim limitation(s) recite(s) sufficient structure, materials, or acts to entirely perform the recited function. Such claim limitation(s) is/are:
"A first module" in Claims 1-3, 11, 13-15, and 20, as shown in Fig. 3:310 and ¶0016, ¶0078, ¶0080-0086.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Allowable Subject Matter
Claims 9-12 and 19-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-8 and 13-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (US #2023/0097197) in view of Mansour (US Patent #11380312).
Regarding Claim 1, Huang discloses a method of operating a cascade audio spotting system (title, abstract, Figs. 1-5), comprising:
receiving, by a first module of the cascade audio spotting system, an audio stream from one or more audio streams (Huang ¶0033 discloses the first stage hotword detector 210 detects the hotword in the streaming multi-channel audio 118; Figs. 1-2B);
processing, by the first module, the audio stream to detect a first target sound activity in the audio stream (Huang ¶0033 discloses the DSP 110 provides chomped multi-channel raw audio data 212, 212a-n to the AP 120. Optionally, the DSP 110 may provide another signal or instruction that triggers/invokes the AP 120 to transition from the sleep mode to the hotword detection mode);
providing a first signal by the first module in response to detecting the first target sound activity in the audio stream (Huang Figs. 1-2C: 212, 212a-212n);
in response to the first signal being provided by the first module:
receiving the one or more audio streams by a high-power subsystem (Huang Figs. 1-2C: application processor 120); and
processing the one or more audio streams by the high-power subsystem to detect a second target sound activity in the one or more audio streams (Huang Figs. 1-2C: 2nd stage hotword detector 220, 220a, 220b), wherein processing the one or more audio streams by the high-power subsystem includes:
wherein performing MCNR (Huang Fig. 4: “processing, by the second processor, using a first noise cleaning algorithm, each channel of the chomped multi-channel raw audio data to generate a clean monophonic audio chomp” 408) on the one or more echo canceled audio streams includes:
estimating a first direction of a first portion of sound activity with reference to the cascade audio spotting system (Huang ¶0022 discloses a voice-enabled device [e.g., a user device executing a voice assistant] allows a user to speak a query or a command out loud and field and answer the query and/or perform a function based on the command. Through the use of a "hotword" [also referred to as a "keyword", "attention word", "wake-up phrase/word", "trigger phrase", "invocation phrase", or "voice action initiation command"], in which by agreement a predetermined term/phrase that is spoken to invoke attention for the voice enabled device is reserved, the voice enabled device is able to discern between utterances directed to the system [i.e., to initiate a wake-up process for processing one or more terms following the hotword in the utterance] and utterances directed to an individual in the environment);
generating a first MCNR output for the first portion of sound activity based on the first direction (Huang ¶0013 discloses the first noise cleaning algorithm can apply a first finite impulse response [FIR] including a first filter length on each channel of the chomped multi-channel raw audio data to generate the chomped monophonic clean audio data);
estimating a second direction of a second portion of sound activity with reference to the cascade audio spotting system (Huang ¶0022 discloses a voice-enabled device [e.g., a user device executing a voice assistant] allows a user to speak a query or a command out loud and field and answer the query and/or perform a function based on the command. Through the use of a "hotword" [also referred to as a "keyword", "attention word", "wake-up phrase/word", "trigger phrase", "invocation phrase", or "voice action initiation command"], in which by agreement a predetermined term/phrase that is spoken to invoke attention for the voice enabled device is reserved, the voice enabled device is able to discern between utterances directed to the system [i.e., to initiate a wake-up process for processing one or more terms following the hotword in the utterance] and utterances directed to an individual in the environment); and
generating a second MCNR output for the second portion of sound activity based on the second direction (Huang ¶0013 discloses the first noise cleaning algorithm can apply a second FIR including a second filter length on each channel of the streaming multi-channel audio to generate the monophonic clean audio stream); and
detecting whether the second target sound activity is included in the one or more MCNR outputs (Huang ¶0028 discloses the first and second processors 110, 120 provide the cascade hotword detection architecture 200 in which a first stage hotword detector 210 runs on the first processor 110 and a second stage hotword detector 220 runs on the second processor 120 for cooperatively detecting the presence of a hotword in streaming multi-channel audio 118 in a manner that optimizes power consumption, latency, and noise robustness. ¶0033 discloses the cleaner 250 employs a noise cleaning algorithm to provide adaptive noise cancellation to the multi-channel noisy audio; Figs. 1-3).
Huang may not explicitly disclose wherein processing the one or more audio streams by the high-power subsystem includes: performing echo cancellation on the one or more audio streams based on a reference signal to generate one or more echo canceled audio streams; performing multiple channel noise reduction (MCNR) on the one or more echo canceled audio streams to generate one or more MCNR outputs.
However, Mansour (title, abstract, Figs. 1-12) teaches wherein processing the one or more audio streams by the high-power subsystem (Mansour Fig. 1: near-end speech s(t); microphone audio data Xm(t); AEC 120; reference generator 130 [i.e., high-power system]) includes:
performing echo cancellation on the one or more audio streams based on a reference signal to generate one or more echo canceled audio streams (Mansour Fig. 1: AEC 120; perform echo cancellation 148; Col. 3, line 65 to Col. 4, line 3 discloses the device 110 can perform acoustic echo cancellation [AEC], adaptive interference cancellation [AIC], residual echo suppression [RES], and other audio processing to isolate local speech captured by the microphone(s) 112 and to suppress unwanted audio data [e.g., echoes and noise]);
performing multiple channel noise reduction (MCNR) on the one or more echo canceled audio streams to generate one or more MCNR outputs (Mansour Col. 8, lines 4-9 discloses as the input audio data is not limited to the echo signal, the ARA [Adaptive Reference Algorithm] processing can remove other acoustic noise represented in the input audio data in addition to removing the echo. Therefore, the ARA processing can be referred to as performing AIC, adaptive noise cancellation [ANC], AEC, and the like).
Huang and Mansour are analogous art as they pertain to cascading keyword spotting. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify keyword spotting device (as taught by Huang) to perform AIC using the ARA processing to isolate the speech in the input audio data since the target signal(s) and/or the reference signal(s) may be continually changing over time based on speech, acoustic noise(s), ambient noise(s), and/or the like in an environment around the device (as taught by Mansour, Col. 8, lines 11-18) to improve wakeword detection that selectively rectify a portion of an audio signal based on energy statistics corresponding to a keyword (Mansour, Col. 2, lines 8-11).
Regarding Claim 2, Huang in view of Mansour discloses the method of claim 1, further comprising
switching the high-power subsystem from a low power mode to an active mode in response to the first signal being provided by the first module (Huang ¶0026 discloses the hotword detectors employed by battery-powered user devices must implement hotword detection algorithms that not only detect hotwords with a degree of accuracy, but must also achieve conflicting objectives of low latency, small memory footprint, and light computational load. To obtain these objectives, the user devices can employ a cascade hotword detection architecture that includes two hotword detectors: a first-stage hotword detector and a second stage hotword detector. Here, the first-stage hotword detector resides on a specialized DSP [e.g., a first processor], includes a small model size, and is computationally efficient for coarsely screening an input audio stream for hotword candidates. Detection of a hotword candidate in the input audio stream by the first-stage hotword detector triggers the DSP to pass/provide a small buffer/chomp of audio data of a duration suitable for safely containing the hotword to the second-stage hotword detector residing/executing on the device SoC. Then, the second-stage hotword detector on the device SoC [e.g., the main AP] includes a larger model size and provides more computational output than the first-stage hotword detector for providing a more accurate detection of the hotword, and thus, serve as the final arbitrator for deciding if the input audio stream does in fact include the hotword. This cascade architecture allows the more power consuming device SoC to operate in a sleep mode to reserve battery-life until the first-stage hotword detector running/executing on the DSP detects a candidate hotword in streaming input audio. Only once the candidate hotword is detected, does the DSP trigger the device SoC to transition from the sleep mode and into a hotword detection mode for running the second-stage hotword detector. ¶0028 discloses the first processor 110 consumes less power while operating than the second processor 120 consumes while operating. The first and second processors 110, 120 provide the cascade hotword detection architecture 200 in which a first stage hotword detector 210 runs on the first processor 110 and a second stage hotword detector 220 runs on the second processor 120 for cooperatively detecting the presence of a hotword in streaming multi-channel audio 118 in a manner that optimizes power consumption, latency, and noise robustness. ¶0051 discloses the FAR [False Accept Rate, e.g., failing to detect a present hotword] of the first stage hotword detector 210 is set to a reasonable value such that the second stage hotword detector 220 will not be frequently triggered to mitigate power consumption by the AP 120).
Regarding Claim 3, Huang in view of Mansour discloses the method of claim 1, wherein the first module includes one of:
an analog voice activity detector (VAD), wherein the audio stream includes an analog audio stream; a digital VAD, wherein the audio stream includes a stream of digital audio frames converted from the analog audio stream; or a low-power trigger, wherein the audio stream includes the stream of digital audio frames converted from the analog audio stream (Huang ¶0026 discloses the hotword detectors employed by battery-powered user devices must implement hotword detection algorithms that not only detect hotwords with a degree of accuracy, but must also achieve conflicting objectives of low latency, small memory footprint, and light computational load. To obtain these objectives, the user devices can employ a cascade hotword detection architecture that includes two hotword detectors: a first-stage hotword detector and a second stage hotword detector. Here, the first-stage hotword detector resides on a specialized DSP [e.g., a first processor], includes a small model size, and is computationally efficient for coarsely screening an input audio stream for hotword candidates. Detection of a hotword candidate in the input audio stream by the first-stage hotword detector triggers the DSP to pass/provide a small buffer/chomp of audio data of a duration suitable for safely containing the hotword to the second-stage hotword detector residing/executing on the device SoC. Then, the second-stage hotword detector on the device SoC [e.g., the main AP] includes a larger model size and provides more computational output than the first-stage hotword detector for providing a more accurate detection of the hotword, and thus, serve as the final arbitrator for deciding if the input audio stream does in fact include the hotword. This cascade architecture allows the more power consuming device SoC to operate in a sleep mode to reserve battery-life until the first-stage hotword detector running/executing on the DSP detects a candidate hotword in streaming input audio. Only once the candidate hotword is detected, does the DSP trigger the device SoC to transition from the sleep mode and into a hotword detection mode for running the second-stage hotword detector).
Regarding Claim 4, Huang in view of Mansour discloses the method of claim 3,
wherein the low-power trigger includes a first set of one or more detection models to identify the first target sound activity in the audio stream (Huang ¶0033 discloses when the first stage hotword detector 210 detects the hotword in the streaming multi-channel audio 118, the DSP 110 provides chomped multi-channel raw audio data 212, 212a-n to the AP 120. Alternatively, the DSP 110 providing the chomped multi-channel raw audio data 212 to the AP 120 triggers/invokes the AP 120 to transition from the sleep mode to the hotword detection mode. Optionally, the DSP 110 can provide another signal or instruction that triggers/invokes the AP 120 to transition from the sleep mode to the hotword detection mode), wherein:
the first set of one or more detection models is associated with a first set of one or more hyperparameters for the low-power trigger (Huang ¶0033 discloses each channel of the chomped multi-channel raw audio data 212a-n corresponds to a respective channel 119a-n of the streaming multi-channel audio 118 and includes raw audio data chomped from the respective audio features of the respective channel 119 of the streaming multi-channel audio 118. Alternatively, each channel of the chomped multichannel raw audio data 212 includes an audio segment characterizing the hotword detected by the first stage hotword detector 210 in the streaming multi-channel audio 118. That is, the audio segment associated with each channel of the chomped multi-channel raw audio data 212 includes a duration sufficient to safely contain the detected hotword. Additionally, each channel of the chomped multi-channel raw audio data 212 includes a prefix segment 214 containing a duration of audio immediately preceding the point in time from when the first stage hotword detector 210 detects the hotword in the streaming multi-channel audio 118. A portion of the each channel for the chomped multi-channel raw audio data 212 may also include a suffix segment containing a duration of audio subsequent to the audio segment 213 containing the detected hotword.); and
the first target sound activity includes one or more spoken keywords in the audio stream (Huang ¶0041 discloses when the first stage hotword detector 210 detects the hotword in the streaming multi-channel audio 118 [e.g., in the first channel 119a], the DSP 110 triggers/fires the audio chomper 215 to generate and provide chomped multichannel raw audio data 212, 212a-b to the AP 120. ¶0043 discloses the AP 120 can process the clean monophonic audio chomp 260 at a first branch 220a of the second stage hotword detector 220 in parallel with processing the respective raw audio data 212a of the one channel of the chomped multi-channel raw audio data 212 at the second branch 220b of the second stage hotword detector 220. In the example shown, when a logical OR 270 operation indicates that the hotword is detected by the second stage hotword detector 220 in either one of the clean monophonic audio chomp 260 [e.g., at the first branch 220a] or the respective raw audio data 212a [e.g. at the second branch 220b], the AP 120 initiates a wake-up process on the user device 102 for processing the hotword and/or one or more other terms following the hotword in the streaming multi-channel audio 118; Figs. 2A-2C).
Regarding Claim 5, Huang in view of Mansour discloses the method of claim 4,
wherein the high-power subsystem includes a high-power trigger to detect a second target sound activity in the one or more audio streams (Huang Fig. 2C: 250b; ¶0052), wherein:
the high-power trigger includes a second set of one or more detection models to identify the second target sound activity (Huang ¶0052 discloses Fig. 2C includes the DSP 110 employing a first stage cleaner 250b [e.g., cleaner-lite] that executes a second noise cleaning algorithm to processes the respective audio features of each channel 119 of the streaming multi-channel audio 118 and generate a monophonic clean audio stream 255 prior to executing/running the first stage hotword detector 210 to determine whether the candidate hotword is detected in the monophonic clean audio stream 255);
the second set of one or more detection models is associated with a second set of one or more hyperparameters for the high-power trigger (Huang ¶0049 discloses detection of the hotword by the first stage hotword detector 210 also causes the DSP 110 to instruct the cleaner frontend 252 to provide the multi-channel cross-correlation matrix 254 to the cleaner engine 250a of the AP 120. Here, the cleaner engine 250a uses the multi-channel cross-correlation matrix 254 to compute cleaner filter coefficients 342 for the first noise cleaning algorithm; see Figs. 2C and 3: 252); and
the second target sound activity is the same as the first target sound activity (Huang ¶0051 discloses a hotword can be identified by any of the cascade hotword detection architectures 200 only when both the first stage hotword detector 210 and the second stage hotword detector 220 detect the hotword. Consequently, an overall FAR [False Accept Rate] of the cascade hotword detection architectures 200a, 200b is lower than either of the FARs of the first stage hotword detector 210 and the second stage hotword detector 220. Additionally, the overall FRR [False Reject Rate] of the cascade hotword detection architectures 200a, 200b is higher than either of the FRRs of the first stage hotword detector 210 and FRR of the second stage hotword detector 220. For example, when keeping the FRR of the first stage hotword detector 210 low, the overall FRR will be about the same as the FRR of the second stage hotword detector 220).
Regarding Claim 6, Huang in view of Mansour discloses the method of claim 5, wherein:
the second set of one or more detection models for the high-power trigger includes the first set of one or more detection models (Huang Fig. 2C: 250b; ¶0052); and
the set of one or more hyperparameters associated with the first set of one or more detection models for the high-power trigger differs from the first set of one or more hyperparameters (Huang ¶0053 discloses while a filter model for the first and second noise cleaning algorithms may be the same, or in the alternative, substantially similar, the second noise cleaning algorithm executing on the cleaner-lite 250b at the DSP 110 may include a shorter length [e.g., less filtering parameters] than the first noise cleaning algorithm executing on the cleaner engine 250a at the AP 120 since the DSP 110 includes a lower computational power than the computational power of the AP 120. For example, the first noise cleaning algorithm may apply a first finite impulse response [FIR] on each channel of the chomped multi-channel raw audio data 212 to generate the clean monophonic audio chomp 260, while the second noise cleaning algorithm may apply a second FIR on each channel 119 of the streaming multi-channel audio 118 to generate the monophonic clean audio stream 255. In this example, the first FIR at the cleaner engine 250a may include a first filter length and the second FIR at the cleaner-lite 250b may include a second filter length that is less than the first filter length. Accordingly, the cleaner-lite 250b employed by the DSP 110 sacrifices some performance [e.g., signal-to-noise ratio (SNR) performance] compared to the cleaner engine 250a employed by the AP 120, but still provides adequate noise robustness to improve the accuracy of the first stage hotword detector 210).
Regarding Claim 7, Huang in view of Mansour discloses the method of claim 5,
wherein the first set of one or more detection models and the second set of one or more detection models are stored in a shared memory for the low-power trigger and the high-power trigger (Huang ¶0047 discloses the matrix buffer 305 is in communication with the DSP 110 and may reside on the memory hardware 105 [Fig. 1] of the user device 102. Upon the first stage hotword detector 210 detecting the hotword to trigger/invoke the AP 120 to transition from the sleep mode to the hotword detection mode, the DSP 110 may pass the multichannel cross-correlation matrix 254 stored in the buffer to the cleaner engine 250a at the AP 120; see Figs. 1-3).
Regarding Claim 8, Huang in view of Mansour discloses the method of claim 1. But Huang may not explicitly disclose further comprising: receiving, by the high-power subsystem, the reference signal, wherein the reference signal is associated with the one or more audio streams and processing the one or more audio streams by the high-power subsystem includes: detecting whether the second target sound activity is included in the reference signal; and preventing detecting the second target sound activity in the one or more audio streams in response to detecting the second target sound activity in the reference signal.
However, Mansour (title, abstract, Figs. 1-12) teaches receiving, by the high-power subsystem, the reference signal, wherein the reference signal is associated with the one or more audio streams and processing the one or more audio streams by the high-power subsystem (Mansour Col. 16, lines 36-39 discloses for example, the system 100 can estimate the ERLE value using the microphone signal [e.g., microphone audio data 540] and the reference signal [e.g., reference audio data 510]; Figs. 1 and 5) includes:
detecting whether the second target sound activity is included in the reference signal (Mansour Col. 16, lines 14-27 discloses the device 110 can determine the ERLE value 620 for each microphone and each frequency band over a long-time window. For example, the device 110 can determine a ratio between the energy of the echo signal y(t) [e.g., estimated echo audio data 530] and the residual energy after the MCAEC component 520 [e.g., residual audio data 560]; equation 2); and
preventing detecting the second target sound activity in the one or more audio streams in response to detecting the second target sound activity in the reference signal (Mansour Col. 16, lines 27-30 discloses note that the existence of speech is assumed to be a sparse event and therefore its impact on long term ERLE estimate is ignored [i.e., prevented]).
Huang and Mansour are analogous art as they pertain to cascading keyword spotting. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify keyword spotting device (as taught by Huang) to perform AIC using the ARA processing to isolate the speech in the input audio data since the target signal(s) and/or the reference signal(s) may be continually changing over time based on speech, acoustic noise(s), ambient noise(s), and/or the like in an environment around the device (as taught by Mansour, Col. 8, lines 11-18) to improve wakeword detection that selectively rectify a portion of an audio signal based on energy statistics corresponding to a keyword (Mansour, Col. 2, lines 8-11).
Claims 13-18 are rejected for the same reasons as set forth in Claims 1-6.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2691