DETAILED ACTION
1. This action is responsive to Application no.18/819,226 filed 8/29/2024. All claims have been examined and are currently pending.
Notice of Pre-AIA or AIA Status
2. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
3. The information disclosure statement (IDS) submitted is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
4. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
5. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
6. Claims 1-17, 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al (2022/0309344) in view of Ahn et al (2021/0074270).
Regarding claim 1 Kim teaches A method for keyword spotting in a digital audio stream (0003: KWS; [0021] In some aspects, broadcasted residual learning is used to process and classify audio data and features (e.g., to perform KWS)), the method comprising:
processing the digital audio stream to extract a feature matrix (25: input tensor; audio data or audio features; two-dimensional, with frequency dimension and a temporal dimension);
applying a set of one-dimensional temporal convolutions to the feature matrix to obtain a first convolved feature matrix (31: feature maps are processed using a…convolution operation; depthwise-separable; 32: temporal depthwise convolution…for feature maps; 33: one-dimensional feature maps – applying temporal convolution (of given dimension) to extract temporal features);
transposing time and frequency dimensions of the feature matrix to obtain a transposed matrix (28-29 feature maps…dimension reduction; 33: converts the one-dimensional feature maps to multi-dimensional feature maps – adjusting dimensions of feature maps);
applying a set of one-dimensional frequency convolutions to the transposed matrix to obtain a second convolved feature matrix (27 convolution operation…depthwise; frequency depthwise convolution; 28: one-dimensional feature maps – applying frequency convolution to extract frequency features);
identifying a presence of a keyword based on further processing of a combination of the first and second convolved feature matrices (0003: KWS; [0021] In some aspects, broadcasted residual learning is used to process and classify audio data and features (e.g., to perform KWS); 52 – utilizing broadcasted residual learning (the application of multiple convolution operations for different domains) to perform KWS); and
does not specifically teach where Ahn teaches performing a function in response to the presence of the keyword
(158; 159 real-time KWS on mobile devices; [0161] Keyword spotting (KWS) aims to detect pre-defined keywords in a stream of audio signals. It is widely used for hands-free control of mobile applications. Since its use is commonly concentrated on recognizing wake-up words (e.g., “Hey Siri” [1], “Alexa” [2, 3], and “Okay Google” [4]) or distinguishing common commands (e.g., “yes” or “no”) on mobile devices).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate performing the function for an improved system, allowing the KWS to be utilized in a specific manner in an operating a device, while presenting a reasonable expectation of success. Kim already teaches KWS for use in mobile devices, and one could look to Ahn to further allow the KWS to perform a specific function on the mobile device once recognized.
Regarding claim 2 Kim teaches The method of claim 1, wherein the feature matrix is square, with a number of time slots equal to a number of frequency channels (25; 26 the feature maps 115 have dimensionality H×W×c, where H and W are spatial dimensions (e.g., a temporal dimension and a frequency dimension, respectively) and c is the number of channels.- representing a variety of numerical options for the domain dimensions, including identical to result in a square).
Regarding claim 3 Kim teaches The method of claim 1, further comprising concatenating frequency filters obtained using the first convolved feature matrix with temporal filters obtained using the second convolved matrix (fig 1;
25: broadcasted residual learning incorporating frequency (27) and temporal (32) convolution and 32 results in feature maps 135).
Regarding claim 4 Kim teaches The method of claim 3, wherein identifying the presence of the keyword includes:
implementing frequency and temporal separable convolutions using depthwise separable convolutions on the concatenating frequency and temporal filters, respectively (fig 1;
[0021] In some aspects, broadcasted residual learning is used to process and classify audio data and features (e.g., to perform KWS). Generally, the audio data and features can be represented using two-dimensional tensors (e.g., with a frequency dimension and a temporal dimension).
27 frequency depthwise; 31: depthwise-separable convolution; 32 temporal depthwise).
Regarding claim 5 Kim teaches The method of claim 4, wherein the depthwise separable convolutions are part of a deep residual network architecture comprising a plurality of residual blocks (31: depthwise-separable convolution; 34 the residual connection 150 reflects the residual nature of broadcasted residual learning; 36 residual block).
Regarding claim 6 Kim teaches The method of claim 5, wherein the plurality of residual blocks employ Swish activation functions positioned between depthwise separable convolution layers (36 residual block; 48: swish activation).
Regarding claim 7 Kim teaches The method of claim 1, wherein identifying the presence of the keyword further includes performing an average pooling operation (29 average pooling; 40; 45).
Regarding claim 8 Kim does not specifically teach where Ahn teaches The method of claim 1, wherein identifying the presence of the keyword further includes performing a classification using a fully connected layer followed by a softmax activation function ([0082] In step S150, a voice keyword is extracted by applying the output feature map to a trained machine learning model. The machine learning model may include a pooling layer, a full-connect layer, a softmax operation; 88-89; 145).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Ahn/connected layer and softmax for an improved system allowing for proper network output and classification.
Regarding claim 9 Kim teaches The method of claim 1, wherein the method is executed on a mobile device (3: mobile phones).
Regarding claim 10 Kim and Ahn teach An apparatus for keyword spotting in a digital audio stream (Kim fig 6 para 75-76), the apparatus comprising:
a memory storing instructions (Kim fig 6; 77); and
one or more processors configured to execute the instructions (kim fig 6; para 77) to:
process the digital audio stream to extract a feature matrix;
apply a set of one-dimensional temporal convolutions to the feature matrix to obtain a first convolved feature matrix;
transpose time and frequency dimensions of the feature matrix to obtain a transposed matrix;
apply a set of one-dimensional frequency convolutions to the transposed matrix to obtain a second convolved feature matrix;
identify a presence of a keyword based on further processing of a combination of the first and second convolved feature matrices; and
perform a function in response to the presence of the keyword.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning
Claims 11-17 recite limitations similar to claims 2-8 and are rejected for similar rationale and reasoning
Regarding claim 19 Kim teaches A method for enhancing keyword detection in a digital audio stream (3; 21), comprising:
executing a transformation of the digital audio stream into a feature matrix of Mel-frequency {cepstral coefficients (MFCC)} (25 tensor, audio data, mel spectrogram, audio features);
conducting one-dimensional depthwise separable convolutions on the feature matrix along temporal and frequency dimensions to obtain a convolved feature matrix (25; 27; 31-32 – temporal and frequency dimension; depthwise-separable convolution; frequency and temporal depthwise convolution);
integrating the convolved feature matrix using a deep learning model with Swish activation functions to output a keyword detection result (21 KWS; 34 broadcasted residual learning; 48 Swish activation); and
does not specifically teach where Ahn teaches performing a function in response to the keyword detection result.
Rejected for similar rationale and reasoning as claim 1
Kim does not specifically teach where Ahn teaches MFCC (190 MFCC)
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate MFCC for a more compact and informative representation of the digital audio stream for better processing and overall KWS.
Regarding claim 20 Kim teaches The method of claim 19, wherein the feature matrix is transposed to align the frequency dimensions with the temporal dimensions and the deep learning model comprises a set of concatenated residual blocks, each block configured to enhance feature discrimination for keyword detection (29; 33; 34; 36 – broadcasted residual learning with frequency and temporal operations, dimension adjustments, and residual blocks).
7. Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Kim et al (2022/0309344) in view of Ahn et al (2021/0074270) in further view of Timofejevs 2021/0406662.
Regarding claim 18 Kim and Ahn do not specifically teach where Timofejevs teaches The apparatus of claim 10, wherein the instructions further cause the one or more processors to perform noise reduction on the digital audio stream before extracting the feature matrix (132 noise reduction; 561 noise suppression).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate noise reduction for an improved system allowing for a cleaner signal for proper and more effective audio processing and KWS.
Conclusion
8. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892.
Drossos, Konstantinos, et al. "Sound event detection with depthwise separable and dilated convolutions." 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020.
Chollet, François. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541. The examiner can normally be reached Monday-Friday 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.
For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SHAUN ROBERTS/Primary Examiner, Art Unit 2655