Prosecution Insights
Last updated: April 19, 2026
Application No. 18/819,226

ELECTRONIC DEVICE AND METHOD FOR EFFICIENT KEYWORD SPOTTING

Non-Final OA §103
Filed
Aug 29, 2024
Examiner
ROBERTS, SHAUN A
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Samsung Electronics Co., Ltd.
OA Round
1 (Non-Final)
76%
Grant Probability
Favorable
1-2
OA Rounds
2y 10m
To Grant
86%
With Interview

Examiner Intelligence

Grants 76% — above average
76%
Career Allow Rate
491 granted / 647 resolved
+13.9% vs TC avg
Moderate +10% lift
Without
With
+10.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
31 currently pending
Career history
678
Total Applications
across all art units

Statute-Specific Performance

§101
7.6%
-32.4% vs TC avg
§103
49.2%
+9.2% vs TC avg
§102
29.5%
-10.5% vs TC avg
§112
3.5%
-36.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 647 resolved cases

Office Action

§103
DETAILED ACTION 1. This action is responsive to Application no.18/819,226 filed 8/29/2024. All claims have been examined and are currently pending. Notice of Pre-AIA or AIA Status 2. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Information Disclosure Statement 3. The information disclosure statement (IDS) submitted is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. Claim Rejections - 35 USC § 103 4. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 5. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 6. Claims 1-17, 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al (2022/0309344) in view of Ahn et al (2021/0074270). Regarding claim 1 Kim teaches A method for keyword spotting in a digital audio stream (0003: KWS; [0021] In some aspects, broadcasted residual learning is used to process and classify audio data and features (e.g., to perform KWS)), the method comprising: processing the digital audio stream to extract a feature matrix (25: input tensor; audio data or audio features; two-dimensional, with frequency dimension and a temporal dimension); applying a set of one-dimensional temporal convolutions to the feature matrix to obtain a first convolved feature matrix (31: feature maps are processed using a…convolution operation; depthwise-separable; 32: temporal depthwise convolution…for feature maps; 33: one-dimensional feature maps – applying temporal convolution (of given dimension) to extract temporal features); transposing time and frequency dimensions of the feature matrix to obtain a transposed matrix (28-29 feature maps…dimension reduction; 33: converts the one-dimensional feature maps to multi-dimensional feature maps – adjusting dimensions of feature maps); applying a set of one-dimensional frequency convolutions to the transposed matrix to obtain a second convolved feature matrix (27 convolution operation…depthwise; frequency depthwise convolution; 28: one-dimensional feature maps – applying frequency convolution to extract frequency features); identifying a presence of a keyword based on further processing of a combination of the first and second convolved feature matrices (0003: KWS; [0021] In some aspects, broadcasted residual learning is used to process and classify audio data and features (e.g., to perform KWS); 52 – utilizing broadcasted residual learning (the application of multiple convolution operations for different domains) to perform KWS); and does not specifically teach where Ahn teaches performing a function in response to the presence of the keyword (158; 159 real-time KWS on mobile devices; [0161] Keyword spotting (KWS) aims to detect pre-defined keywords in a stream of audio signals. It is widely used for hands-free control of mobile applications. Since its use is commonly concentrated on recognizing wake-up words (e.g., “Hey Siri” [1], “Alexa” [2, 3], and “Okay Google” [4]) or distinguishing common commands (e.g., “yes” or “no”) on mobile devices). It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate performing the function for an improved system, allowing the KWS to be utilized in a specific manner in an operating a device, while presenting a reasonable expectation of success. Kim already teaches KWS for use in mobile devices, and one could look to Ahn to further allow the KWS to perform a specific function on the mobile device once recognized. Regarding claim 2 Kim teaches The method of claim 1, wherein the feature matrix is square, with a number of time slots equal to a number of frequency channels (25; 26 the feature maps 115 have dimensionality H×W×c, where H and W are spatial dimensions (e.g., a temporal dimension and a frequency dimension, respectively) and c is the number of channels.- representing a variety of numerical options for the domain dimensions, including identical to result in a square). Regarding claim 3 Kim teaches The method of claim 1, further comprising concatenating frequency filters obtained using the first convolved feature matrix with temporal filters obtained using the second convolved matrix (fig 1; 25: broadcasted residual learning incorporating frequency (27) and temporal (32) convolution and 32 results in feature maps 135). Regarding claim 4 Kim teaches The method of claim 3, wherein identifying the presence of the keyword includes: implementing frequency and temporal separable convolutions using depthwise separable convolutions on the concatenating frequency and temporal filters, respectively (fig 1; [0021] In some aspects, broadcasted residual learning is used to process and classify audio data and features (e.g., to perform KWS). Generally, the audio data and features can be represented using two-dimensional tensors (e.g., with a frequency dimension and a temporal dimension). 27 frequency depthwise; 31: depthwise-separable convolution; 32 temporal depthwise). Regarding claim 5 Kim teaches The method of claim 4, wherein the depthwise separable convolutions are part of a deep residual network architecture comprising a plurality of residual blocks (31: depthwise-separable convolution; 34 the residual connection 150 reflects the residual nature of broadcasted residual learning; 36 residual block). Regarding claim 6 Kim teaches The method of claim 5, wherein the plurality of residual blocks employ Swish activation functions positioned between depthwise separable convolution layers (36 residual block; 48: swish activation). Regarding claim 7 Kim teaches The method of claim 1, wherein identifying the presence of the keyword further includes performing an average pooling operation (29 average pooling; 40; 45). Regarding claim 8 Kim does not specifically teach where Ahn teaches The method of claim 1, wherein identifying the presence of the keyword further includes performing a classification using a fully connected layer followed by a softmax activation function ([0082] In step S150, a voice keyword is extracted by applying the output feature map to a trained machine learning model. The machine learning model may include a pooling layer, a full-connect layer, a softmax operation; 88-89; 145). It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Ahn/connected layer and softmax for an improved system allowing for proper network output and classification. Regarding claim 9 Kim teaches The method of claim 1, wherein the method is executed on a mobile device (3: mobile phones). Regarding claim 10 Kim and Ahn teach An apparatus for keyword spotting in a digital audio stream (Kim fig 6 para 75-76), the apparatus comprising: a memory storing instructions (Kim fig 6; 77); and one or more processors configured to execute the instructions (kim fig 6; para 77) to: process the digital audio stream to extract a feature matrix; apply a set of one-dimensional temporal convolutions to the feature matrix to obtain a first convolved feature matrix; transpose time and frequency dimensions of the feature matrix to obtain a transposed matrix; apply a set of one-dimensional frequency convolutions to the transposed matrix to obtain a second convolved feature matrix; identify a presence of a keyword based on further processing of a combination of the first and second convolved feature matrices; and perform a function in response to the presence of the keyword. Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning Claims 11-17 recite limitations similar to claims 2-8 and are rejected for similar rationale and reasoning Regarding claim 19 Kim teaches A method for enhancing keyword detection in a digital audio stream (3; 21), comprising: executing a transformation of the digital audio stream into a feature matrix of Mel-frequency {cepstral coefficients (MFCC)} (25 tensor, audio data, mel spectrogram, audio features); conducting one-dimensional depthwise separable convolutions on the feature matrix along temporal and frequency dimensions to obtain a convolved feature matrix (25; 27; 31-32 – temporal and frequency dimension; depthwise-separable convolution; frequency and temporal depthwise convolution); integrating the convolved feature matrix using a deep learning model with Swish activation functions to output a keyword detection result (21 KWS; 34 broadcasted residual learning; 48 Swish activation); and does not specifically teach where Ahn teaches performing a function in response to the keyword detection result. Rejected for similar rationale and reasoning as claim 1 Kim does not specifically teach where Ahn teaches MFCC (190 MFCC) It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate MFCC for a more compact and informative representation of the digital audio stream for better processing and overall KWS. Regarding claim 20 Kim teaches The method of claim 19, wherein the feature matrix is transposed to align the frequency dimensions with the temporal dimensions and the deep learning model comprises a set of concatenated residual blocks, each block configured to enhance feature discrimination for keyword detection (29; 33; 34; 36 – broadcasted residual learning with frequency and temporal operations, dimension adjustments, and residual blocks). 7. Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Kim et al (2022/0309344) in view of Ahn et al (2021/0074270) in further view of Timofejevs 2021/0406662. Regarding claim 18 Kim and Ahn do not specifically teach where Timofejevs teaches The apparatus of claim 10, wherein the instructions further cause the one or more processors to perform noise reduction on the digital audio stream before extracting the feature matrix (132 noise reduction; 561 noise suppression). It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate noise reduction for an improved system allowing for a cleaner signal for proper and more effective audio processing and KWS. Conclusion 8. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892. Drossos, Konstantinos, et al. "Sound event detection with depthwise separable and dilated convolutions." 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020. Chollet, François. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541. The examiner can normally be reached Monday-Friday 9-5 EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SHAUN ROBERTS/Primary Examiner, Art Unit 2655
Read full office action

Prosecution Timeline

Aug 29, 2024
Application Filed
Mar 16, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12586599
AUDIO SIGNAL PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM WITH MACHINE LEARNING AND FOR MICROPHONE MUTE STATE FEATURES IN A MULTI PERSON VOICE CALL
2y 5m to grant Granted Mar 24, 2026
Patent 12586568
SYNTHETICALLY GENERATING INNER SPEECH TRAINING DATA
2y 5m to grant Granted Mar 24, 2026
Patent 12573376
Dynamic Language and Command Recognition
2y 5m to grant Granted Mar 10, 2026
Patent 12562157
GENERATING TOPIC-SPECIFIC LANGUAGE MODELS
2y 5m to grant Granted Feb 24, 2026
Patent 12555562
VOICE SYNTHESIS FROM DIFFUSION GENERATED SPECTROGRAMS FOR ACCESSIBILITY
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
86%
With Interview (+10.3%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 647 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month