Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
This office action is in response to application 18/676,014, which was filed 05/28/24. Claims 1-20 are pending in the application and have been considered.
Specification
The abstract of the disclosure is objected to because it is over 150 words. Correction is required. See MPEP § 608.01(b).
Claim Objections
Claim 15 is objected to because of the following informalities: in line 5, “perfume” should be “perform”. Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Claims 13-15 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim 13 recites the limitation "the ASR module" in lines 2-3. There is insufficient antecedent basis for this limitation in the claim.
Claims 14 and 15 include the indefinite subject matter of claim 13 by virtue of their dependency on it, and do not remedy the indefiniteness. These claims are therefore also rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-3, 6, 16, 17, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lyon et al. (US 20180197533).
Consider claim 1, Lyon discloses a system for an automated voice command processing within a smart home (automatic processing of spoken commands by smart home devices in a house, [0036], [0037]) comprising:
a processor of a voice command processing server node configured to host a machine learning (ML) module and connected to at least one audio capture entity node and to at least one target node over a wireless network connection (voice activated electronic devices 190 has CPU 502, Fig. 2A; processes voice commands as server nodes, [0068-0069], Fig. 1, hosts PCEN frontend in voice processing module 538, [0078], which is a machine learning module, [0142-0146], and is connected to microphone in input devices nodes, [0077], Fig. 2A, and control target nodes such as appliances and media systems operated by voice commands, [0045], over wireless network, [0058]); and a
memory on which are stored machine-readable instructions that when executed by the processor (memory 506, [0077], Fig 2A), cause the processor to:
acquire raw audio data comprising an audio signal from the at least one audio capture entity node (voice activated electronic device receives audio from microphone 516, [0078], e.g. “OK Google, play cat videos on my Living room TV”, [0052]);
normalize the audio signal for volume consistency (normalization module using PCEN normalizes the received audio, [0078], and performs range compression and gain control, [0132], [0138]; this is considered to normalize the audio signal for perceived loudness, i.e. for a type of “volume consistency”);
convert the normalized audio signal into a spectrogram (the frontend stacks the PCEN features horizontally into a spectrogram, [0159]);
extract a set of classifying features from the spectrogram (features of the spectrogram are input to a neural network, i.e. extracted from the spectrogram as a series for the input layer, [0159]);
provide the set of classifying features to the ML module configured to generate a predictive model based on a neural network for producing at least one wake word parameter (these features are input to a CNN for keyword spotting, [0159], which produces a prediction of whether the audio contained “OK Google” based on the input features, [0151]; a probability distribution or score, i.e. a wake word parameter, for detection/not detection is inherent in this CNN architecture);
detect a wake word based on the at least one wake word parameter (when the output of the CNN indicates the wake word is present, e.g. “OK Google”, [0151]); and
switch the voice command processing server node to an active listening mode for processing subsequent user audio commands through the at least one audio capture entity node (upon detecting the wake word, the device is “awakened” by putting it into a state where the device is ready to receive voice requests to the voice assistant service, [0061]).
Consider claim 16, Lyon discloses a method for an automated voice command processing within a smart home (automatic processing of spoken commands by smart home devices in a house, [0036], [0037]), comprising:
acquiring, by a voice command processing server (VCPS) node, raw audio data comprising an audio signal from the at least one audio capture entity node (voice activated electronic device receives audio from microphone 516, [0078], e.g. “OK Google, play cat videos on my Living room TV”, [0052]);
normalizing, by the VCPS node, the audio signal for volume consistency (normalization module using PCEN normalizes the received audio, [0078], and performs range compression and gain control, [0132], [0138]; this is considered to normalize the audio signal for perceived loudness, i.e. for a type of “volume consistency”);
converting, by the VCPS node, the normalized audio signal into a spectrogram (the frontend stacks the PCEN features horizontally into a spectrogram, [0159]);
extracting, by the VCPS node, a set of classifying features from the spectrogram (features of the spectrogram are input to a neural network, i.e. extracted from the spectrogram as a series for the input layer, [0159]);
providing, by the VCPS node, the set of classifying features to the ML module configured to generate a predictive model based on a neural network for producing at least one wake word parameter (these features are input to a CNN for keyword spotting, [0159], which produces a prediction of whether the audio contained “OK Google” based on the input features, [0151]; a probability distribution or score, i.e. a wake word parameter, for detection/not detection is inherent in this CNN architecture);
detecting, by the VCPS node, a wake word based on the at least one wake word parameter (when the output of the CNN indicates the wake word is present, e.g. “OK Google”, [0151]); and
switching, by the VCPS node, the voice command processing server node to an active listening mode for processing subsequent user audio commands through the at least one audio capture entity node (upon detecting the wake word, the device is “awakened” by putting it into a state where the device is ready to receive voice requests to the voice assistant service, [0061]).
Consider claim 20, Lyon discloses non-transitory computer-readable medium comprising instructions, that when read by a processor (non-transitory computer-readable medium with instructions executed by a processor, [0078]), cause the processor to perform:
acquiring raw audio data comprising an audio signal from the at least one audio capture entity node (voice activated electronic device receives audio from microphone 516, [0078], e.g. “OK Google, play cat videos on my Living room TV”, [0052]);
normalizing the audio signal for volume consistency (normalization module using PCEN normalizes the received audio, [0078], and performs range compression and gain control, [0132], [0138]; this is considered to normalize the audio signal for perceived loudness, i.e. for a type of “volume consistency”);
converting the normalized audio signal into a spectrogram (the frontend stacks the PCEN features horizontally into a spectrogram, [0159]);
extracting a set of classifying features from the spectrogram (features of the spectrogram are input to a neural network, i.e. extracted from the spectrogram as a series for the input layer, [0159]);
providing the set of classifying features to the ML module configured to generate a predictive model based on a neural network for producing at least one wake word parameter (these features are input to a CNN for keyword spotting, [0159], which produces a prediction of whether the audio contained “OK Google” based on the input features, [0151]; a probability distribution or score, i.e. a wake word parameter, for detection/not detection is inherent in this CNN architecture);
detecting a wake word based on the at least one wake word parameter (when the output of the CNN indicates the wake word is present, e.g. “OK Google”, [0151]); and
switching the voice command processing server node to an active listening mode for processing subsequent user audio commands through the at least one audio capture entity node (upon detecting the wake word, the device is “awakened” by putting it into a state where the device is ready to receive voice requests to the voice assistant service, [0061]).
Consider claim 2, Lyon discloses the machine-readable instructions that when executed by the processor, cause the processor to detect the wake word by applying a confidence threshold to the wake word parameter (a decision boundary confidence threshold is inherent in the trained binary decision classifier CNN, [0151]).
Consider claim 3, Lyon discloses the machine-readable instructions that when executed by the processor, cause the processor to produce a wake word detection verdict responsive to the wake word parameter exceeding the confidence threshold (a decision, i.e. verdict, based on comparing the CNN prediction to a decision boundary confidence threshold is inherent in the trained binary decision classifier CNN, [0151]).
Consider claim 6, Lyon discloses the machine-readable instructions that when executed by the processor, cause the processor to normalize a volume and energy levels of the audio signal by application of Per-Channel Energy Normalization (normalization module using PCEN normalizes the received audio, [0078], and performs range compression and gain control, [0132], [0138]; this is considered to normalize the audio signal for perceived loudness, i.e. for a type of “volume consistency”).
Consider claim 17, Lyon discloses producing a wake word detection verdict responsive to the wake word parameter exceeding a confidence threshold (a decision, i.e. verdict, based on comparing the CNN prediction to a decision boundary confidence threshold is inherent in the trained binary decision classifier CNN, [0151]).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Lyon et al. (US 20180197533) in view of Lakshmikanth et al. (“Noise Cancellation in Speech Signal Processing-A Review”. International Journal of Advanced Research in Computer and Communication Engineering Vol. 3, Issue 1, January 2014).
Consider claim 4, Lyon disclsoes the machine-readable instructions that when executed by the processor, cause the processor to implement a noise module (noise module 790 for noise mitigation, [0130]).
Lyon does not specifically mention removing background noise by application of Infinite Impulse Response (IIR) filter for white noise and Kalman filter for non-stationary noise.
Lakshmikanth discloses removing background noise by application of Infinite Impulse Response (IIR) filter for white noise and Kalman filter for non-stationary noise (removal of white noise with IIR filters, Sections 1C and 2, page 5177-5178, and Kalman filters for non-stationary noise removal, Section VI., pages 5181-5182).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by removing background noise by application of Infinite Impulse Response (IIR) filter for white noise and Kalman filter for non-stationary noise in order to reduce the degradation of speech processing systems due to noise, as suggested by Lakshmikanth (page 5176). Doing so would have led to predictable results of making the system more usable in noisy environments, as suggested by Lakshmikanth (page 5175). The references cited are analogous art in the same field of audio processing.
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Lyon et al. (US 20180197533) in view of Wojogbe et al. (US 20190179611).
Consider claim 5, Lyon does not, but Wojogbe discloses executing beamforming processing to focus on an audio signal from a direction of a speaker while ignoring other directions (beamforming to capture sound from directions where voice activity is detected, [0098-0099], Fig. 5A, which implicitly ignores directions where voice activity is not detected).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by executing beamforming processing to focus on an audio signal from a direction of a speaker while ignoring other directions in order to assist in filtering background noise, as suggested by Wojogbe ([0096]). The references cited are analogous art in the same field of audio processing.
Claims 7-9 are rejected under 35 U.S.C. 103 as being unpatentable over Lyon et al. (US 20180197533) in view of Sharifi et al. (US 20220180866).
Consider claim 7, Lyon does not, but Sharifi discloses streaming the audio signal from a DSP module to an Automatic Speech Recognition (ASR) module (audio data is streamed from smart speaker to speech recognition system, Fig. 1A, [0020], [0022]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by streaming the audio signal from a DSP module to an Automatic Speech Recognition (ASR) module reduce processing on a resource constrained device, as suggested by Sharifi [0003]), predictably reducing expense. The references cited are analogous art in the same field of audio processing.
Consider claim 8, Lyon does not, but Sharifi discloses feeding the set of classifying features into a deep learning model comprising a sequence-to-sequence model to transcribe spoken words into text (sequence-to-sequence speech recognition model that generates a transcription from the features, [0047]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by feeding the set of classifying features into a deep learning model comprising a sequence-to-sequence model to transcribe spoken words into text for reasons similar to those for claim 7.
Consider claim 9, Lyon does not, but Sharifi discloses balancing latency and accuracy by adjusting a window size of transcription (adjusting the window size for transcription, [0033], [0047]; this is considered to balance latency and accuracy by promptly submitting the audio for transcription without cutting off more audio the user might utter).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by balancing latency and accuracy by adjusting a window size of transcription for reasons similar to those for claim 7.
Claims 10, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Lyon et al. (US 20180197533) in view of Rand et al. (US 20200150919).
Consider claim 10, Lyon discloses the machine-readable instructions that when executed by the processor, cause the processor to, responsive to the wake word detection, continuously monitor the audio signal (upon detecting the wake word, the device is “awakened” by putting it into a state where the device is ready, i.e. continuously monitoring, to receive voice requests to the voice assistant service, [0061]).
Lyon does not specifically mention converting the audio signal into a format suitable for VAD model.
Rand discloses converting the audio signal into a format suitable for VAD model (MFCCs for Gaussian Mixture model VAD, [0077]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by converting the audio signal into a format suitable for VAD model in order to enhance accuracy, as suggested by Rand ([0029]), predictably resulting in augmented overall functionality, as suggested by Rank ([0029]). The references cited are analogous art in the same field of audio processing.
Consider claim 11, Lyon does not, but Rand discloses feeding the converted audio signal into the VAD model comprising Gaussian Mixture Model or Silero VAD (MFCCs for Gaussian Mixture model VAD, [0077]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by feeding the converted audio signal into the VAD model comprising Gaussian Mixture Model or Silero VAD for reason similar to those for claim 10.
Consider claim 18, Lyon discloses responsive to the wake word detection, continuously monitor the audio signal (upon detecting the wake word, the device is “awakened” by putting it into a state where the device is ready, i.e. continuously monitoring, to receive voice requests to the voice assistant service, [0061]).
Lyon does not specifically mention converting the audio signal into a format suitable for VAD model.
Rand discloses converting the audio signal into a format suitable for VAD model (MFCCs for Gaussian Mixture model VAD, [0077]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by converting the audio signal into a format suitable for VAD model for reasons similar to those for claim 10.
Claims 12 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Lyon et al. (US 20180197533) in view of Rand et al. (US 20200150919), in further view of Rao (US 20220358913).
Consider claim 12, Lyon does not, but Rao discloses analyzing outputs of the VAD models to detect when the at least one audio capture entity node stops capturing the audio data and, responsive to the detection, stop recording and send the audio data for transcription (when the end of sentence is reached, switching mechanism sends a signal to the ASR to stop recording, [0064], [0065], Fig. 4).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by analyzing outputs of the VAD models to detect when the at least one audio capture entity node stops capturing the audio data and, responsive to the detection, stop recording and send the audio data for transcription in order to reduce pre-processing and postprocessing overheads, as suggested by Rao ([0005]), predictably reducing delay in speech processing, as suggested by Rao ([0005]). The references cited are analogous art in the same field of audio processing.
Consider claim 19, Lyon does not, but Rao discloses analyzing outputs of the VAD models to detect when the at least one audio capture entity node stops capturing the audio data and, responsive to the detection, stop recording and send the audio data for transcription (when the end of sentence is reached, switching mechanism sends a signal to the ASR to stop recording, [0064], [0065], Fig. 4).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon by analyzing outputs of the VAD models to detect when the at least one audio capture entity node stops capturing the audio data and, responsive to the detection, stop recording and send the audio data for transcription for reasons similar to those for claim 11.
Claims 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Lyon et al. (US 20180197533) in view of Rand et al. (US 20200150919), in further view of Sundararaman (US 20200050949).
Consider claim 13, Lyon discloses collecting a text output from the ASR module(extracting a user voice command, [0117], by a neural network speech recognizer, [0015]).
Lyon and Rand do not specifically mention performing text processing by tokenization, stemming, and lemmatization.
Sundararaman discloses performing text processing by tokenization, stemming, and lemmatization (tokenizing, stemming, lemmatizing, [0064]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon and Rand by performing text processing by tokenization, stemming, and lemmatization in order to improve flexibility in analyzing large amounts of data, as suggested by Sundararaman ([0017]), predictably improving the overall process of performing data transformations and analysis, as suggested by Sundararaman ([0017]). The references cited are analogous art in the same field of audio processing (Sundararaman discloses receiving the query from the user as audio data, [0061]).
Consider claim 14, Lyon and Rand do not, but Sundararaman discloses extracting features from the processed text and feed the features into an intent recognition model configured to classify intent, where in the intent recognition model comprising any of: a logistic regression model, a support vector machine, and a transformer-based model (intent classification using logistic regression or support vector machine based on text features, [0072-0075]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon and Rand by extracting features from the processed text and feed the features into an intent recognition model configured to classify intent, where in the intent recognition model comprising any of: a logistic regression model, a support vector machine, and a transformer-based model for reasons similar to those for claim 13.
Consider claim 15, Lyon discloses: map an intent to a specific action on a target object associated with the at least one target node (determining the relevance of the command included in the voice input to a particular device, e.g. “stop music” should refer to the device playing music, [0035]); and send a command to the at least one target node to perfume the mapped specific action (sending a command to the device playing music to stop playing music, [0035], [0075]).
Lyon and Rand do not specifically mention an intent classified by the intent recognition model.
Sundararaman discloses an intent classified by the intent recognition model (intent classification using logistic regression or support vector machine based on text features, [0072-0075]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lyon and Rand by including an intent classified by the intent recognition model for reasons similar to those for claim 13.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20220309343 Elkhatib discloses a wake word processing system that uses a PCEN -> spectrogram acoustic front end (see [0025])
US 20240062745 Smyth discloses low power detection of wake words
US 12488789 Huang discloses efficient open vocabulary keyword spotting
US 20230104431 Smyth discloses noise robust representations for keyword spotting systems
US 10360926 Mortensen discloses low-complexity voice activity detection
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jesse Pullias whose telephone number is 571/270-5135. The examiner can normally be reached on M-F 8:00 AM - 4:30 PM. The examiner’s fax number is 571/270-6135.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Andrew Flanders can be reached on 571/272-7516.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Jesse S Pullias/
Primary Examiner, Art Unit 2655 12/11/25