DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 03/26/2026 has been entered.
Response to Arguments
1. Regarding the rejection under 35 U.S.C. § 101, Applicant’s arguments, see pgs. 7-12, filed 03/26/2026, have been fully considered and are persuasive. The rejection of claims 1-5, 7-16, 19-20, and 22 has been withdrawn.
2. Regarding the rejection under 35 U.S.C. § 103, Applicant’s arguments filed 03/26/2026 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
3. Claims 1 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Tomar & Myer (US 2022/0343895 A1, hereinafter Tomar) in view of Josh et al. (NPL Low-Power Low-Cost Audio Front-End for Keyword Spotting, hereinafter Josh) and further in view of Elders et al. (US 11,693,622 B1, hereinafter Elders).
Regarding claim 1, Tomar discloses A keyword spotting method based on a neutral network (NN) acoustic model (Abstract; para. 0061), comprising following steps of: registering a keyword for detection by the NN acoustic model (para. 0048 “The keyword spotting functionality 112 allows users to enroll personalized keywords or phrases with only having to provide a small number of samples of the personalized keywords/phrases examples…”; keyword detection performed via Siamese network 502: Fig. 5, output of 504; para. 0061 “Each of the keyword samples 204 is processed by a network vector encoder 502…First, the audio is processed by frequency domain features functionality 504 that generates acoustic features from the input keyword sample. The acoustic features are used as input to neural-network based encoder functionality 510, which outputs a vector encoding 512 of the speech content…. ...”), comprising: recording, via a microphone, a target keyword spoken by a user (para. 0048 “The keyword spotting functionality 112 allows users to enroll personalized keywords or phrases with only having to provide a small number of samples of the personalized keywords/phrases examples. In order to enroll the keyword, the user may speak their personalized phrase a few times while the device is in enrollment mode. While in the enrollment mode, keyword enrollment functionality 132 may receive an audio signal 134, or possibly a speech signal 136 from the VAD functionality 118, that comprises the keyword. The keyword enrollment functionality 132 may provide enrollment data 138 that is stored and used by the primary keyword detection functionality 122 as well as enrollment data 130 that is stored and used by the prototype vector keyword detection functionality 128…”; Fig. 1A, 110 ‘Microphone’); generating, for the target keyword and based on the plurality of audio fragments, a plurality of template acoustic model sequences corresponding to the target keyword (Fig. 5, output of 504; para. 0061 “Each of the keyword samples 204 is processed by a network vector encoder 502…First, the audio is processed by frequency domain features functionality 504 that generates acoustic features from the input keyword sample. The acoustic features are used as input to neural-network based encoder functionality 510, which outputs a vector encoding 512 of the speech content…. ...”); storing…the plurality of template acoustic model sequences to be inputs into the NN acoustic model for the target keyword (sequence input to NN acoustic model (Fig. 5, components 510-514, and Fig. 6); para. 0060 “In the current system, the neural network which is duplicated in the Siamese model functions as a vector encoder, which represents the input features as a vector in a new feature space.”; para. 0061 “First, the audio is processed by frequency domain features functionality 504 that generates acoustic features from the input keyword sample. The acoustic features are used as input to neural-network based encoder functionality 510, which outputs a vector encoding 512 of the speech content.”); and detecting the target keyword in real-time speech (para. 0057 “In addition to detecting whether the keyword is present, the system can also find the start and stop time of the keyword. This allows the system to accurately segment the audio when passing to a second stage detector, or when detecting a command following the keyword. After detecting the keyword, the system may continue calculating frame similarity scores and alignment lengths for a short duration afterwards, such as 50-100 ms. The system searches for the frame position with maximum similarity score in that period. This frame with the maximum similarity score may be assumed to be the end time of the keyword. To find the start time, the length of the alignment found at the end frame is subtracted from the end frame time. Following the keyword detection, there may be a timeout period, for example around 1 s, in which no keyword detection is performed, in order to prevent the system from detecting the same keyword multiple times.”; para. 0066 “When using the prototypical network for live keyword decoding, it must be able to detect the keyword, in the context where the user speaks a command immediately after the keyword.”), comprising: detecting, by a voice activity detector in real-time, a speech input of the user (para. 0046 “The keyword spotter 112 may comprise voice activity detection (VAD) functionality 118 that receives the audio signal 114 and determines if human speech is present in the audio signal 114.”; this detection is performed in real time (see above mapping)); constructing, based on the speech input of the user, an acoustic model sequence of the speech input (Fig. 6, output of 504 generated for speech input 126; para. 0061 “First, the audio is processed by frequency domain features functionality 504 that generates acoustic features from the input keyword sample.”); inputting the acoustic model sequence of the speech input (Fig. 6, output of 504) and the plurality of template acoustic model sequences for the target keyword (Fig. 5, outputs of 504) into the NN acoustic model (both sets of inputs input to Siamese model components 510; para. 0060 “In the current system, the neural network which is duplicated in the Siamese model functions as a vector encoder, which represents the input features as a vector in a new feature space.”; para. 0061 “First, the audio is processed by frequency domain features functionality 504 that generates acoustic features from the input keyword sample. The acoustic features are used as input to neural-network based encoder functionality 510, which outputs a vector encoding 512 of the speech content.”)…and causing…a stored action assigned to the target keyword to be performed (para. 0047 “Additionally or alternatively, the detected keyword may cause the device to perform an action, such as turning on a light, placing a telephone call, performing other actions possible with the user device, or transmitting an audio sample to a different device for further processing.”; para. 0048 “Using this technology, the user may register multiple personalized keywords. These different keywords can then be used to trigger different actions without having to speak another command afterwards.”).
Tomar does not specifically disclose:
storing, in a microcontroller unit (MCU)… and
[causing], by the MCU, [a stored action assigned to the target keyword to be performed]
Josh discloses [storing,] in a microcontroller unit (MCU) (pg. 4, section V. “This paper presented a low-power audio multistage system for keyword spotting using microcontrollers…”; pg. 4 2nd para. “Both microcontrollers (FRDM-K64F and MSP-EXP430FR5994) were placed into very low power modes to minimize the power consumption…”; Josh teaches storage capabilities for microcontroller: pg. 1st para. “This board was chosen due to the low power MCU and the large on board memory ”) and [causing], by the MCU, [a stored action assigned to the target keyword to be performed] (pg. 3, section IV, 1st para. “The functionality of the system was tested with live data, specifically two commands “light on” and “light off.” When told to turn the light on, the output would become high turning on a light bulb by means of a relay and remain on until the system is reset or told to turn off.”).
Tomar and Josh are considered to be analogous to the claimed invention as
they both are in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Tomar to incorporate the teachings of Josh in order to perform the storing using a microcontroller unit (MCU) and the causing of the stored action to be performed using the MCU. Doing so would be beneficial, as implementing keyword spotting using a low-power microcontroller system can help provide keyword detection for mobile and wearable applications with low power consumption and implementing voice detection helps to ensure that further power consumption can be saved while the microcontroller is in the waiting stage (Josh, pg. 1, section I, para. 1-3).
Tomar in view of Josh does not specifically disclose wherein the NN acoustic model is trained to output a probability based on a comparison of the acoustic model sequence of the speech input to multiple template acoustic model sequences of a keyword;
determining that a probability output by the NN acoustic model exceeds a threshold;
Elders teaches wherein the NN acoustic model (Col. 11 Lines 29-60 “Once speech is detected in the audio received by the device 110 (or separately from speech detection), the device 110 may use the keyword detection module 220 to perform keyword detection to determine when a user intends to speak a command to the device 110. The keyword detection module 220 may compare audio data to stored models or data associated with a keyword(s) to detect a keyword. … In another embodiment the keyword spotting system may be built on deep neural network (DNN)/recursive neural network (RNN) structures directly, without HMM involved. Such a system may estimate the posteriors of keywords with context information, either by stacking frames within a context window for DNN, or using RNN. Following-on posterior threshold tuning or smoothing is applied for decision making. Other techniques for keyword detection, such as those known in the art, may also be used.”) is trained (Col. 21 Lines 1-3 “The keyword detection module 220 may employ classifier(s) or other machine learning trained models to determine whether the audio signal includes the keyword.”) to output a probability (Col. 21 Lines 4-8 “The keyword detection module 220 may determine confidence levels or probabilities, indicating relative likelihoods that the wakeword has been detected in the corresponding audio signal(s). For example, a confidence level may be indicated as a percentage ranging from 0% to 100%.”) based on a comparison of the acoustic model sequence of the speech input to multiple template acoustic model sequences of a keyword (Col. 11 Lines 61-67 and Col. 12 Lines 1-3 “A keyword configuration module 210 may configure the system 100 to recognize a keyword. The keyword configuration module 210 may import models or data into keyword model storage 230. Each keyword may be associated with a plurality of models to allow the system to recognize the keyword in a number of different situations (loud, noisy, etc.) and will a number of different speakers. Thus the keyword model storage 230 may include models for each keyword the system is configured to recognize, such as keyword 1 model(s) 232-1, keyword 2 model(s) 232-2, etc.”; Col. 12 Lines 63-66 “As illustrated, each keyword may be associated with one or more models 232 such that the keyword detection module 220 may compare audio data to the model(s) 232 to detect a keyword.”).
determining that a probability output by the NN acoustic model exceeds a threshold (Col. 21 Lines 1-8 “The keyword detection module 220 may employ classifier(s) or other machine learning trained models to determine whether the audio signal includes the keyword. The keyword detection module 220 may determine confidence levels or probabilities, indicating relative likelihoods that the wakeword has been detected in the corresponding audio signal(s). For example, a confidence level may be indicated as a percentage ranging from 0% to 100%.”; Col. 15 Lines 25-31 “If a keyword is detected (426:Yes) the device 110 may send (428) a second indication of the detected keyword to the server 120. For example, if the first model (which may be an audio signature) matches the received audio with a sufficiently high confidence, the local device 110 may send the second indication to the server 120 indicating that the first keyword was detected.”).
Tomar, Josh, and Elders are considered to be analogous to the claimed invention as they are in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Tomar in view of Elders to incorporate the teachings of Elders in order to have the NN acoustic model be trained to output a probability based on a comparison of the acoustic model sequence of the speech input to multiple acoustic model sequences of a keyword, and to determine that a probability output by the NN acoustic model exceeds a threshold. Doing so would be beneficial, as comparing the user input to multiple acoustic sequences of a keyword allows for the model to recognize the keyword in a wider variety of different situations, such as in both noisy and loud situations (Elders, Col. 11 Lines 61-67 and Col. 12 Lines 1-3).
Regarding claim 12, claim 12 is a non-transitory computer readable medium claim with limitations similar to those in method claim 1, and thus is rejected under similar rationale.
Additionally, Tomar discloses A non-transitory computer readable medium storing instructions which, when processed by a microcontroller unit (MCU), performs steps comprising (para. 0071 “Each element in the embodiments of the present disclosure may be implemented as hardware, software/program, or any combination thereof. Software codes, either in its entirety or a part thereof, may be stored in a computer readable medium or memory (e.g., as a ROM, for example a non-volatile memory such as flash memory, CD ROM, DVD ROM, Blu-ray™, a semiconductor ROM, USB, or a magnetic recording medium, for example a hard disk). The program may be in the form of source code, object code, a code intermediate source and object code such as partially compiled form, or in any other form.”; para. 0045 “A user device 102 may provide a voice interface for interacting with, or controlling, the user device. The user device 102 comprises a processor 104 for executing instructions and a memory 106 for storing instructions and data… The processor 102, which may be provided by, for example a central processing unit (CPU), a microprocessor or micro-controller, a digital signal processor, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or other processing device, executes instructions stored in the memory 106, which when executed by the processor 102 configure the user device to provide various functionality including keyword spotting functionality 112.”).
4. Claims 2, 5, 10-11, 13, 16, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Tomar in view of Josh and Elders, and further in view of Sørensen et al. (NPL A depthwise separable convolutional neural network for keyword spotting on an embedded system, hereinafter Sørensen).
Regarding claim 2, Tomar in view of Josh and Elders does not specifically disclose wherein the NN acoustic model comprises at least one separable two-dimensional convolutional layer with a number of channels, the number of the channels corresponding to a number of inputs of the NN acoustic model.
Sørensen teaches wherein the NN acoustic model comprises at least one separable two-dimensional convolutional layer with a number of channels, the number of the channels corresponding to a number of inputs of the NN acoustic model (Fig. 3, “DS-Conv 1-N” layers, Fig. 4, N number of filters applied to N number of input layers; caption: “Overview of a single depthwise separable convolutional layer consisting of a depthwise convolution followed by a pointwise convolution. (1) The depthwise convolution separately applies a 2-dimensional filter to each of the channels in the input, extracting time-frequency patterns…”).
Tomar, Josh, Elders, and Sørensen are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Tomar in view of Josh and Elders to incorporate the teachings of Sørensen in order to have the NN acoustic model comprise at least one separable two-dimensional convolutional layer with a number of channels, the number of the channels corresponding to a number of input of the NN acoustic model. as depthwise separable convolutional neural networks are an efficient alternative to stands CNNs, which drastically reduce the number of required weights and computations, and are found to perform well on embedded platforms (Sørensen, pg. 3, 1st para.).
Regarding claim 5, Tomar in view of Josh and Elders does not specifically disclose wherein the NN acoustic model is trained by using an 8-bit quantization flow to represent weights and activations of the NN acoustic model.
Sørensen teaches wherein the NN acoustic model is trained by using an 8-bit quantization flow to represent weights and activations of the NN acoustic model (pg. 3, 2nd Column, 2nd para. “With the goal to reduce the memory footprint of the system, it was investigated how quantization of weights and activations affected performance by gradually lowering the bit widths using principles of mix and dynamic fixed point representations.”; pg. 7, 1st para. “The dynamic ranges of groups with weights and biases were fixed after training, while the ranges of activations were estimated by running inference on a large number of representative audio files from the dataset and generating statistical parameters for the activations of each layer…The precision of the weights and activations in the network was varied in experiment 3 between 32-bit floating point precision and low bit width fixed point formats ranging from 8 to 2 bit.”; pg. 7, section 3.8 “The deployed network use 8-bit weights and activations…”).
Tomar, Josh, Elders, and Sørensen are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Tomar in view of Josh and Elders to incorporate the teachings of Sørensen in order to have the NN acoustic model trained by using an 8-bit quantization flow to represent weights and activations of the NN acoustic model. Doing so would be beneficial, as quantization using the mixed and dynamic fixed point principles reduces memory footprint and computational requirements without lowering classification accuracy (Sørensen, Abstract).
Regarding claim 10, Tomar in view of Josh and Elders and in further view of Sørensen discloses wherein the threshold is 90% (Sørensen, Fig. 9, “Detection threshold” values include 0.9 (corresponding to 90%) for Test set 1 and Test set 2).
Tomar, Josh, Elders, and Sørensen are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Tomar in view of Josh and Elders to incorporate the teachings of Sørensen in order to specifically set the pre-set threshold to 90%. Doing so would be beneficial, as a pre-set threshold value of 0.9 has a small false alarm rate, which increases the accuracy of the system (Sørensen, Fig. 9, 0.9 has smaller false alarm rate than choices of threshold which are smaller than 0.9).
Regarding claim 11, Tomar in view of Josh and Elders does not specifically disclose wherein the NN acoustic model is a depthwise separable convolutional neural network.
Sørensen teaches wherein the NN acoustic model is a depthwise separable convolutional neural network (Fig. 3, caption: “General architecture of the DS-CNN…The following DS-convolution layers 1-N each consist of a depthwise convolution, followed by batch-normalization and ReLU activation…”; see Fig. 4).
Tomar, Josh, Elders, and Sørensen are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Tomar in view of Josh and Elders to incorporate the teachings of Sørensen in order to specifically use a depthwise separable convolutional neural network as the NN acoustic model. Doing so would be beneficial, as depthwise separable convolutional neural networks are an efficient alternative to stands CNNs, which drastically reduce the number of required weights and computations, and are found to perform well on embedded platforms (Sørensen, pg. 3, 1st para.).
Regarding claim 13, claim 13 is rejected for analogous reasons to claim 2.
Regarding claim 16, claim 16 is rejected for analogous reasons to claim 5.
Regarding claim 22, claim 22 is rejected for analogous reasons to claim 11.
5. Claims 3 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Tomar in view of Josh, Elders, and Sørensen, and further in view of Tian et al. (US 2022/02362352 A1, hereinafter Tian).
Regarding claim 3, Tomar in view of Josh, Elders, and Sørensen discloses wherein voice frames of the speech input and the target keyword …input to the NN acoustic model as Mel-frequency cepstral coefficients (MFCCs) in a form of Mel spectrograms (Josh, pg. 2, section C 1st para. “The audio is sampled at 16 kHz and the Mel frequency cepstral coefficients are calculated to extract frequencies as input features to a neural network…”).
Tomar, Sørensen, Elders, Josh are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Josh in order to specifically use as input to the NN acoustic model Mel-frequency cepstral coefficients (MFCCs). Doing so would be beneficial, as MFCCs are a commonly used human-engineered speech feature used in deep learning based speech-recognition applications (NPL Zhang, Hello Edge: Keyword Spotting on Microcontrollers, pg. 2, section 2.1, 1st para.).
Tomar in view of Josh, Elders, and Sørensen does not specifically disclose wherein voice frames of the speech input and the target keyword are marked with phonemes.
Tian teaches wherein voice frames of the speech input and the target keyword are marked with phonemes (frames of speech input marked with phonemes: para. 0041 “After the processing of FIG. 2, an Fbank feature vector or an MFCC feature vector can be extracted from the speech signal 210 for spotting whether a given keyword is included in the speech signal 210 or for training or optimizing an acoustic model.”; para. 0046 “The acoustic feature 305 of each frame extracted by the acoustic feature extraction module is input to the acoustic model 300, processed by L layers of LSTM (for example, LSTM 310, LSTM 320, . . . LSTM 330, etc.), and at last the classified phoneme probabilities 335 of the acoustic features of the frame are output. Phoneme probabilities 335 may be a probability vector that includes the probabilities of the frame on all phonemes.”; frames of keywords are marked with phonemes para. 0049-0051 “As shown in FIG. 4, at 410, the optimization data including a given keyword 405 is generated based on the given keyword 405…At 430, a phoneme-related label is assigned to the acoustic features of frame.”).
Tomar, Josh, Elders, Sørensen, and Tian are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Tian in order to have the voice frames of the speech input and the plurality of target keywords be marked with phonemes. Doing so would be beneficial, as phonemes can be used for systems with low power consumption requirements (Tian, para. 0043).
Regarding claim 14, claim 14 is rejected for analogous reasons to claim 3.
6. Claims 4 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Tomar in view of Josh and Elders and further in view of Tian.
Regarding claim 4, Tomar in view of Josh and Elders does not specifically disclose wherein the NN acoustic model is trained before use with a training dataset comprising phonemes marking a large amount of human speech.
Tian teaches wherein the NN acoustic model is trained before use with a training dataset comprising phonemes marking a large amount of human speech (para. 0029 “Acoustic model 120 may be a model, for example a seed acoustic model, pre-trained with a large amount of speech recognition data. The seed acoustic model may be trained for distinguishing between different phonemes, thereby achieving a mapping from acoustic features to phonemes.”).
Tomar, Josh, Elders, and Tian are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Tian in order to train the NN acoustic model before use with a training dataset comprising phonemes marking a large amount of human speech. Doing so would be beneficial, as this would enable the acoustic model to achieve mappings from acoustic features to phonemes (Tian, para. 0029).
Regarding claim 15, claim 15 is rejected for analogous reasons to claim 4.
7. Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Tomar in view of Josh and Elders and further in view of Tanaka & Shinozaki (NPL F-Measure Based End-to-End Optimization of Neural Network Keyword Detectors, hereinafter Tanaka).
Regarding claim 7, Tomar in view of Josh and Elders does not specifically disclose wherein one or more of the template acoustic model sequence is 3-5 seconds in size.
Tanaka teaches wherein one or more of the template acoustic model sequence is 3-5 seconds in size (pg. 4, 1st para. “To extract keyword templates, one utterance per a keyword was used. The utterance that provided the template was excluded from the segments used as the subject of the search of that keyword. The remaining utterances in sets A and B were concatenated for each original recording session, respectively, and then equally cut to make 5-second length segments for parallel processing. As a training-development set, 50 segements per a keyword from set A were used, where each segment contained one or more keywords…”; pg. 4, 2nd para. “As a baseline, we trained an acoustic embedding based keyword detector using LSTM with the weighted cross-entropy criterion…A keyword is input to an LSTM consisting of two layers having 128 units and its embedded vector is obtained as a hidden activation of the final time frame…”).
Tomar, Josh, Elders, and Tanaka are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Tomar in view of Josh and Elders to incorporate the teachings of Tanaka in order to have the acoustic model sequence be 3-5 seconds in size. Doing so would be beneficial, as using equal 5-second lengths segments would enable for parallel processing (pg. 4, 1st para. line 7).
8. Claims 8 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Tomar in view of Josh and Elders and further in view of Chen (NPL Query-by-Example Keyword Spotting Using Long Short-Term Memory Networks, hereinafter Chen).
Regarding claim 8, Tomar in view of Josh and Elders discloses one or more voice frames of the speech input comprise an acoustic sequence (Tomar: para. 0040 “In accordance with the present disclosure, DTW uses feature vectors generated from frames of a speech sample.”) and the template acoustic model sequence stored in the MCU (Josh, see claim mapping for claim 1).
Tomar in view of Josh and Elders does not specifically disclose a size of the acoustic sequence depends on the template acoustic model sequence…
Chen teaches a size of the acoustic sequence depends on the template acoustic model sequence (Fig. 2, “LSTM Feature Extractor”; pg. 2, section 2.3 “More specifically, given an acoustic feature x with T frames, the hidden units from the second layer of the LSTM are given as h2…We create a fixed-length representation f by choosing the last k state vectors, as denoted by… (1)…the parameter k can be estimated from the enrollment templates. In our experiments, we choose k to be the averaged number of frames of all the templates as we want to encode as much information as possible. Zeros are padded in front of f if the segment length T is smaller than the desired template length k”).
Tomar, Josh, Elders, and Chen are considered to be analogous to the claimed invention as they are all in the same field of keyword spotting. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Tomar in view of Josh and Elders to incorporate the teachings of Chen in order to have acoustic sequence of the input speech have a size dependent on the template acoustic model sequence. Doing so would be beneficial, as this would create a fixed-length representation for audio signals of varying time lengths, enabling similarity measurements utilizing vector distances (pg. 1 section 1, para. 5).
Regarding claim 19, claim 19 is rejected for analogous reasons to claim 8.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Gowayyed & Mohajer (US 2021/0335340 A1): comparing stored sound embedding with utterance audio using neural network acoustic model to generate phoneme probabilities (Fig. 4B, para. 0039)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CODY DOUGLAS HUTCHESON/Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659