Last updated: April 19, 2026
Application No. 18/592,066
PERSONAL AGENT USING VISION TRANSFORMER

Non-Final OA §103
Filed
Feb 29, 2024
Examiner
HUTCHESON, CODY DOUGLAS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
This examiner grants 62% of cases after interview

— +47.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 24 resolved cases, 2023–2026
Examiner Intelligence

HUTCHESON, CODY DOUGLAS View full profile →
Grants 62% of resolved cases
Career Allow Rate
15 granted / 24 resolved
+0.5% vs TC avg
Strong +47% interview lift
Without
With
+47.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
34 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
33.9%
-6.1% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
7.5%
-32.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/29/2024 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1. Claims 1, 4-6, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Garg et al. (US 2024/0221751 A1, hereinafter Garg) in view of Kimura et al. (NPL SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks, hereinafter Kimura).
	
	Regarding claim 1, Garg discloses A method implemented using one or more processors, the method comprising: receiving a waveform that represents reflections of an ultrasound signal that capture a mouth motion over a time interval (para. 0024 “According to one embodiment, the wearable device comprises a sensor configured to detect a position of a tongue of the user and transmit to the processing component a signal indicative of the position of the tongue. According to one embodiment, the sensor configured to detect the position of the tongue is one of a laser doppler sensor, a mechanomyography sensor, a sonomyography sensor, an ultrasound sensor, an infrared sensor, a fNIRS sensor, optical sensor, or a capacitive sensor.”; para. 0123 “In such examples, the input data may be from multiple sensors on wearable device 110, for example 302B may receive as inputs data from the accelerometer and data from an ultrasound sensor.”), wherein the mouth motion of the user formulates a silent utterance (para. 0024 “According to one embodiment, the wearable device comprises a sensor configured to detect a position of a tongue of the user and transmit to the processing component a signal indicative of the position of the tongue. According to one embodiment, the sensor configured to detect the position of the tongue is one of a laser doppler sensor, a mechanomyography sensor, a sonomyography sensor, an ultrasound sensor, an infrared sensor, a fNIRS sensor, optical sensor, or a capacitive sensor.”)…processing…using a transformer-based machine learning model (para. 0123 “The input layers for a particular signal may include convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”; para. 0126 “The separate neural networks may be structured to have multiple sets of layers, for example a set of input layers which extract features from individual signal data and a second set of layers which determine one or more predicted words or phrases from concatenated feature data from the input layers. The separate neural networks may be structured to have multiple sets of layers, for example input layers which extract features from multiple types of signal data and a second set of layers which output one or more predicted words or phrases from concatenated feature data from the input layers. The layers of the separate neural networks may be structured as convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”), to generate a model output from which word content of the silent utterance is derived (para. 0126 “The separate neural networks may be structured to have multiple sets of layers, for example a set of input layers which extract features from individual signal data and a second set of layers which determine one or more predicted words or phrases from concatenated feature data from the input layers. The separate neural networks may be structured to have multiple sets of layers, for example input layers which extract features from multiple types of signal data and a second set of layers which output one or more predicted words or phrases from concatenated feature data from the input layers. The layers of the separate neural networks may be structured as convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”; para. 0127 “The predicted words and phrases 312 from the separate neural networks may then be fed to a comparison module 313.”; para. 0128 “The neural network may output one or more predicted words or phrases 310.”); and causing a voice assistant to be controlled based on the derived word content of the silent utterance (para. 0033 “According to one embodiment, the external device is configured to provide a virtual assistant platform. According to one embodiment, the external device is configured to provide the one or more words or phrases to the virtual assistant platform; determine a response to the one or more words or phrases using the virtual assistant platform; and transmit the response to the wearable device.”).
	Garg does not specifically disclose:
	processing the received waveform to generate a sequence of time-aligned waterfall image chunks; [processing] the sequence of time-aligned waterfall image chunks.
	Kimura teaches processing the received waveform to generate a sequence of time-aligned waterfall image chunks (pg. 3, section 3.1 “Ultrasonic Imaging Device”; pg. 6, section 5 “The real end-to-end silent voice to audio conversion was examined. In this case, a user is asked to mouth a speech command without
actually emitting a sound, and the oral cavity movement is record by an ultrasonic imaging probe. The obtained image sequence is subsequently translated to a voice by the proposed system.”; pgs. 4-5, section 3.2 “Because the frame rate of the ultrasonic images is 30 frames per second, the duration of K ultrasonic images is thus 400 ms. This time duration covers the static and motion features of the utterance. Samples of the ultrasonic images are shown in Figure 5. The K-size image sequence is prepared repeatedly such that one Mel-scale spectrum is created every 20 ms…”); [processing] the sequence of time-aligned waterfall image chunks (Fig. 2, “Neural Network 1“ processes ultrasound image sequence to generate a model output).
Garg and Kimura are considered to be analogous to the claimed invention as
they both are in the same field of silent speech detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg to incorporate the teachings of Kimura in order to process the received waveform to generate a sequence of time-aligned waterfall image chunks and to process the sequence of time-aligned waterfall image chunks to generate a model output. Doing so would be beneficial, as this would enable 
silence voice interaction which can provide inconspicuous and accurate silent speech detection (Kimura, pg. 2, para. 0004-0005).
	
	Regarding claim 4, Garg in view of Kimura discloses causing the voice assistant to be invoked in response to the derived word content of the silent utterance including a hotword that invokes the voice assistant (Garg, para. 0142 “If a user desires to activate a personal assistant, the predicted speech module 122 of the external device 120 may identify as an output phrase “hey assistant” which will result in the natural language processor 405 to output an action to the application manager 406 which activates the personal assistant function of the device.”).

	Regarding claim 5, Garg in view of Kimura discloses causing as assistant action to be performed via the voice assistant in response to the recognized word content of the silent utterance including one or more words identifying the assistant action (Garg, para. 0142 “The user may then silently speak one or more commands which are recorded by the wearable device and result in output words or phrases determined by the external device. The output words or phrases may be provided as an input to the personal assistant of the device as application inputs 407.”; para. 0140 “For example, the determined actions may be used to control applications of the external device. The external device may identify an output phrase “take picture” and perform natural language processing on the phrase. In response to the natural language processing output action, the external device may perform the process of capturing a picture in the camera application by providing information about the action to application inputs 407.”).

Regarding claim 6, Garg in view of Kimura discloses wherein the transformer-based machine learning model includes a classifier to classify the silent utterance (Garg, para. 0136 “The output words or phrases may be transmitted to the wearable device 110, where they are analyzed to determine if the output words or phrases match one or more specific commands as discussed herein. In some examples the external device may determine if the output words or phrases are associated with one or more specific commands, using one or more natural language processors 405.”).

	Regarding claim 14, Garg in view of Kimura discloses wherein the voice assistant is controllable using an audible utterance (Garg, para. 0096 “The wearable silent speech device 110 may include one or more sensors 111 which are used to record input signals 101 form a user. The sensors 111 may include EMG electrodes for recording muscle activity associated with speech, a microphone for recording voiced and/or whispered speech…”; para. 0109 “For example, a user may say a particular word or phrase out loud, which is recorded by the microphone. The device activation logic 202 may recognize this word or phrase and in response will perform one or more actions. The one or more actions may include changing a mode of the device, activating one or more features of the device, and performing one or more actions.”).
	
2. Claims 2 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Garg in view of Kimura and further in view of Cheng et al. (NPL TwinkleTwinkle: Interacting with Your Smart Devices by Eye Blink, hereinafter Cheng).

Regarding claim 2, Garg in view of Kimura discloses performing waterfall reconstruction…to generate a waterfall image that encodes the mouth motion of the user (Kimura, pg. 5, section 3.5 “As for preparing the training data, two collaborators, (28-year old male and 24-year old male) were attached with an ultrasonic imaging probe under their jaws, and were instructed to utter various speech commands. Approximately 500 speech commands were collected from each collaborator (Table ??). For each command, as well as the voice utterance, a video of the ultrasonic images was recorded.”), and linearly dividing the waterfall image to generate the sequence of time-aligned waterfall image chunks (Kimura, pgs. 3-4 section 3.2 “Network 1 uses a series of 𝐾 ultrasonic images (size of 128 × 128, monochrome) as the input and generates an 𝑛-dimensional sound representation (Mel-scale spectrum) as the output. Currently, 𝐾 of 13 and 𝑛 of 64 are used. Because the frame rate of the ultrasonic images is 30 frames per second, the duration of 𝐾 ultrasonic images”).
Garg and Kimura are considered to be analogous to the claimed invention as
they both are in the same field of silent speech detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg to incorporate the teachings of Kimura in order to perform waterfall reconstruction to generate a waterfall image that encodes the mouth motion of the user and to linearly divide the waterfall image to generate the sequence of time-aligned waterfall image chunks. Doing so would be beneficial for the same rationale given in claim 1.
	Garg in view of Kimura does not specifically disclose performing pulse compression on the reflections of the ultrasound signal that capture a silent utterance, to acquire a pulse compressed waveform.
	Cheng teaches performing pulse compression on the reflections of the ultrasound signal that capture a silent utterance, to acquire a pulse compressed waveform (pg. 3 section 3.1 “TwinkleTwinkle leverages a pair of co-located speakers and microphones on mobile devices to transmit and receive inaudible ultrasonic signals carrying signal variations of eyelid motions and employs phase difference-related
information in FMCW signals to profile fine-grained eye blink motion trajectories. As the FMCW signals shown in Fig. 2(a), the frequency of the FMCW signal starts from frequency 𝑓0 and increases linearly with time 𝑡 over a chirp duration 𝑇𝑐 by bandwidth 𝐵, which can be denoted as 𝑓 (𝑡) = 𝑓0 + 𝐵 𝑇𝑐. Thus, the transmitted FMCW signal
can be presented as 𝑆𝑡 (𝑡) = 𝐴𝑐𝑜𝑠(𝜙𝑡 (𝑡)) with its phase calculated as 𝜙𝑡 (𝑡) = 2𝜋 (𝑓0𝑡 +
𝐵𝑡 2 2𝑇𝑐), where 𝐴 is the signal’s attenuation. Specifically, TwinkleTwinkle adopts the bandwidth 𝐵 as 4 kHz in the frequency range from 18 to 22 kHz, which allows our system runs silently on COTS mobile phones without bringing any disturbance. The signal period 𝑇 is set as a common value of 10.7 ms corresponding to 512 samples per signal period under the 48kHz sampling rate, which consists of a chirp duration 𝑇𝑐 with the first 1 to 480 samples and a blank guard time 𝑇𝑔 with the rest 32 samples.”; Fig. 2).
Garg, Kimura, and Cheng are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura to incorporate the teachings of Cheng in order to specifically have perform pulse compression on the reflections of the ultrasound signal that capture a silent utterance, to acquire a pulse compressed waveform. Doing so would be beneficial, as performing pulse compression for ultrasound imaging would increase SNR, improving penetration with good resolution (NPL Ortiz et al., pg. 423 section 3.1.4. “Pulse Compression”).

Regarding claim 13, Garg in view of Kimura does not specifically disclose wherein the ultrasound signal has a frequency range of approximately 21-22kHz.
Cheng teaches wherein the ultrasound signal has a frequency range of approximately 21-22kHz (pg. 8 section 3.1 “As the FMCW signals shown in Fig. 2(a), the frequency of the FMCW signal starts from frequency f0 and increases linearly with time t over a chirp duration Tc by bandwidth B…”; pg. 20, section 6.4.2 “Specifically, we opted for bandwidth B with 18-20kHz, 18-22kHz, and 17-23kHz…”).
Garg, Kimura, and Cheng are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura to incorporate the teachings of Cheng in order to specifically have the ultrasound signal have a frequency range of approximately 21-22kHz. Doing so would be beneficial, as this range fits mobile phone constraints while also preventing audible sound waves which hinder system performance (pg. 20, section 6.4.2).

3. Claims 3, 15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Garg in view of Kimura and further in view of Fu et al. (NPL SVoice: Enabling Voice Communication in Silence via Acoustic Sensing on Commodity Devices, hereinafter Fu).

	Regarding claim 3, Garg in view of Kimura discloses microphones and speakers (Garg, para. 0106 “Shown are one or more EMG electrodes 111A, a microphone 111B, an accelerometer 111C and other sensors 111D.”; para. 0104 “The audio signal may be played using a speaker of the external device or may be sent to other devices by communication module 124.”) does not specifically disclose wherein the reflections are received via a client device, and wherein the client device includes a speaker to emit the ultrasound signal and a microphone to receive the reflections.
	Fu teaches wherein the reflections are received via a client device (pg. 5, Fig. 3, ultrasound reflections received by a mobile phone; see Fig. 8), and wherein the client device includes a speaker to emit the ultrasound signal and a microphone to receive the reflections (pg. 5, Fig. 3, see “Acoustic Sensing”, “Microphone” and “Speaker”).
Garg, Kimura, and Fu are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura to incorporate the teachings of Fu in order to specifically receive reflections via a client device, and to have the client device include a speaker to emit the ultrasound signal and a microphone to receive the reflections. Doing so would be beneficial, as this would leverage components already present in many devices, allowing for silent voice detection with smart devices (Condliffe, NPL There’s a Very Obvious Voice Assistant Hack: Ultrasound).
	
	Regarding claim 15, Garg discloses A system (Fig. 1) comprising: …wherein the reflections capture a mouth motion of a user over a time interval (para. 0024 “According to one embodiment, the wearable device comprises a sensor configured to detect a position of a tongue of the user and transmit to the processing component a signal indicative of the position of the tongue. According to one embodiment, the sensor configured to detect the position of the tongue is one of a laser doppler sensor, a mechanomyography sensor, a sonomyography sensor, an ultrasound sensor, an infrared sensor, a fNIRS sensor, optical sensor, or a capacitive sensor.”; para. 0123 “In such examples, the input data may be from multiple sensors on wearable device 110, for example 302B may receive as inputs data from the accelerometer and data from an ultrasound sensor.”), and wherein the mouth motion formulates a silent utterance (para. 0024 “According to one embodiment, the wearable device comprises a sensor configured to detect a position of a tongue of the user and transmit to the processing component a signal indicative of the position of the tongue. According to one embodiment, the sensor configured to detect the position of the tongue is one of a laser doppler sensor, a mechanomyography sensor, a sonomyography sensor, an ultrasound sensor, an infrared sensor, a fNIRS sensor, optical sensor, or a capacitive sensor.”); one or more processors (Fig. 1, 113); and memory (Fig. 1, 115) storing instructions that, when executed, cause the one or more processors to (para. 0222 “As is apparent from the foregoing examples, a computer readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non-transitory form.”; para. 0223 “The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of processor-executable instructions that can be employed to program a computer or other processor to implement various aspects of the technology as described above.”): process the received reflections of the ultrasound signal that capture the mouth motion of the user…( para. 0123 “The input layers for a particular signal may include convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”; para. 0126 “The separate neural networks may be structured to have multiple sets of layers, for example a set of input layers which extract features from individual signal data and a second set of layers which determine one or more predicted words or phrases from concatenated feature data from the input layers. The separate neural networks may be structured to have multiple sets of layers, for example input layers which extract features from multiple types of signal data and a second set of layers which output one or more predicted words or phrases from concatenated feature data from the input layers. The layers of the separate neural networks may be structured as convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”); process…using a transformer-based machine learning model (para. 0123 “The input layers for a particular signal may include convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”; para. 0126 “The separate neural networks may be structured to have multiple sets of layers, for example a set of input layers which extract features from individual signal data and a second set of layers which determine one or more predicted words or phrases from concatenated feature data from the input layers. The separate neural networks may be structured to have multiple sets of layers, for example input layers which extract features from multiple types of signal data and a second set of layers which output one or more predicted words or phrases from concatenated feature data from the input layers. The layers of the separate neural networks may be structured as convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”), to generate a model output from which word content of the silent utterance is derived (para. 0126 “The separate neural networks may be structured to have multiple sets of layers, for example a set of input layers which extract features from individual signal data and a second set of layers which determine one or more predicted words or phrases from concatenated feature data from the input layers. The separate neural networks may be structured to have multiple sets of layers, for example input layers which extract features from multiple types of signal data and a second set of layers which output one or more predicted words or phrases from concatenated feature data from the input layers. The layers of the separate neural networks may be structured as convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”; para. 0127 “The predicted words and phrases 312 from the separate neural networks may then be fed to a comparison module 313.”; para. 0128 “The neural network may output one or more predicted words or phrases 310.”); and cause a voice assistant to be invoked or perform an assistant action based on the derived word content of the silent utterance (para. 0033 “According to one embodiment, the external device is configured to provide a virtual assistant platform. According to one embodiment, the external device is configured to provide the one or more words or phrases to the virtual assistant platform; determine a response to the one or more words or phrases using the virtual assistant platform; and transmit the response to the wearable device.”).
	Garg does not specifically disclose to [process the received reflections…] to generate a sequence of time-aligned waterfall image chunks; process the sequence of time-aligned waterfall image chunks.
Kimura teaches [process the received reflections…] to generate a sequence of time-aligned waterfall image chunks (pg. 3, section 3.1 “Ultrasonic Imaging Device”; pg. 6, section 5 “The real end-to-end silent voice to audio conversion was examined. In this case, a user is asked to mouth a speech command without
actually emitting a sound, and the oral cavity movement is record by an ultrasonic imaging probe. The obtained image sequence is subsequently translated to a voice by the proposed system.”; pgs. 4-5, section 3.2 “Because the frame rate of the ultrasonic images is 30 frames per second, the duration of K ultrasonic images is thus 400 ms. This time duration covers the static and motion features of the utterance. Samples of the ultrasonic images are shown in Figure 5. The K-size image sequence is prepared repeatedly such that one Mel-scale spectrum is created every 20 ms…”); process the sequence of time-aligned waterfall image chunks (Fig. 2, “Neural Network 1“ processes ultrasound image sequence to generate a model output).
Garg and Kimura are considered to be analogous to the claimed invention as
they both are in the same field of silent speech detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg to incorporate the teachings of Kimura in order to process the received waveform to generate a sequence of time-aligned waterfall image chunks and to process the sequence of time-aligned waterfall image chunks to generate a model output. Doing so would be beneficial, as this would enable 
silence voice interaction which can provide inconspicuous and accurate silent speech detection (Kimura, pg. 2, para. 0004-0005).
	Garg in view of Kimura does not specifically disclose [a speaker] that transmits an ultrasound signal; [a microphone] that receives reflections of the ultrasound signal.
	Fu teaches [a speaker] that transmits an ultrasound signal and [a microphone] that receives reflections of the ultrasound signal (pg. 5, Fig. 3, see “Acoustic Sensing”, “Microphone” and “Speaker”).
Garg, Kimura, and Fu are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura to incorporate the teachings of Fu in order to specifically receive reflections via a client device, and to have the client device include a speaker to emit the ultrasound signal and a microphone to receive the reflections. Doing so would be beneficial, as this would leverage components already present in many devices, allowing for silent voice detection with smart devices (Condliffe, NPL There’s a Very Obvious Voice Assistant Hack: Ultrasound).
	
	Regarding claim 20, Garg discloses A method using one or more processors (Abstract, Fig. 1, 113), the method comprising: receiving, via …a client device (Fig. 1, 110, 111); reflections of an ultrasound signal that capture a mouth motion of a user over a time interval (para. 0024 “According to one embodiment, the wearable device comprises a sensor configured to detect a position of a tongue of the user and transmit to the processing component a signal indicative of the position of the tongue. According to one embodiment, the sensor configured to detect the position of the tongue is one of a laser doppler sensor, a mechanomyography sensor, a sonomyography sensor, an ultrasound sensor, an infrared sensor, a fNIRS sensor, optical sensor, or a capacitive sensor.”; para. 0123 “In such examples, the input data may be from multiple sensors on wearable device 110, for example 302B may receive as inputs data from the accelerometer and data from an ultrasound sensor.”), wherein the mouth motion of the user formulates a silent utterance (para. 0024 “According to one embodiment, the wearable device comprises a sensor configured to detect a position of a tongue of the user and transmit to the processing component a signal indicative of the position of the tongue. According to one embodiment, the sensor configured to detect the position of the tongue is one of a laser doppler sensor, a mechanomyography sensor, a sonomyography sensor, an ultrasound sensor, an infrared sensor, a fNIRS sensor, optical sensor, or a capacitive sensor.”); transmitting a waveform of the received reflections of the ultrasound signal that capture the mouth motion of the user to a server device (Fig. 1, 130; para. 0095 “Specific modules are shown within each of the external device 120 and server 130, however these modules may be located within any of the wearable device 110, external device 120 and server 130.”; para. 0103 “The external device 120 may include a communication module 124 to retrieve and transmit signals to the wearable device 110 or the server 130”), wherein the waveform of the received reflections is processed…( para. 0123 “The input layers for a particular signal may include convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”; para. 0126 “The separate neural networks may be structured to have multiple sets of layers, for example a set of input layers which extract features from individual signal data and a second set of layers which determine one or more predicted words or phrases from concatenated feature data from the input layers. The separate neural networks may be structured to have multiple sets of layers, for example input layers which extract features from multiple types of signal data and a second set of layers which output one or more predicted words or phrases from concatenated feature data from the input layers. The layers of the separate neural networks may be structured as convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”), and…processed using a transformer-based machine learning model (para. 0123 “The input layers for a particular signal may include convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”; para. 0126 “The separate neural networks may be structured to have multiple sets of layers, for example a set of input layers which extract features from individual signal data and a second set of layers which determine one or more predicted words or phrases from concatenated feature data from the input layers. The separate neural networks may be structured to have multiple sets of layers, for example input layers which extract features from multiple types of signal data and a second set of layers which output one or more predicted words or phrases from concatenated feature data from the input layers. The layers of the separate neural networks may be structured as convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”), to generate a model output from which word content of the silent utterance is derived (para. 0126 “The separate neural networks may be structured to have multiple sets of layers, for example a set of input layers which extract features from individual signal data and a second set of layers which determine one or more predicted words or phrases from concatenated feature data from the input layers. The separate neural networks may be structured to have multiple sets of layers, for example input layers which extract features from multiple types of signal data and a second set of layers which output one or more predicted words or phrases from concatenated feature data from the input layers. The layers of the separate neural networks may be structured as convolutional layers, feedforward layers, transformer layers, conformer layers, and recurrent layers, among other types of neural network layers.”; para. 0127 “The predicted words and phrases 312 from the separate neural networks may then be fed to a comparison module 313.”; para. 0128 “The neural network may output one or more predicted words or phrases 310.”); and causing a voice assistant to be invoked or to perform an assistant action based on the derived word content of the silent utterance (para. 0033 “According to one embodiment, the external device is configured to provide a virtual assistant platform. According to one embodiment, the external device is configured to provide the one or more words or phrases to the virtual assistant platform; determine a response to the one or more words or phrases using the virtual assistant platform; and transmit the response to the wearable device.”).
	Garg does not specifically disclose [processed] to generate a sequence of time-aligned waterfall image chunks and wherein the sequence of time-aligned waterfall image chunks is processed.
Kimura teaches [processed] to generate a sequence of time-aligned waterfall image chunks (pg. 3, section 3.1 “Ultrasonic Imaging Device”; pg. 6, section 5 “The real end-to-end silent voice to audio conversion was examined. In this case, a user is asked to mouth a speech command without
actually emitting a sound, and the oral cavity movement is record by an ultrasonic imaging probe. The obtained image sequence is subsequently translated to a voice by the proposed system.”; pgs. 4-5, section 3.2 “Because the frame rate of the ultrasonic images is 30 frames per second, the duration of K ultrasonic images is thus 400 ms. This time duration covers the static and motion features of the utterance. Samples of the ultrasonic images are shown in Figure 5. The K-size image sequence is prepared repeatedly such that one Mel-scale spectrum is created every 20 ms…”) and wherein the sequence of time-aligned waterfall image chunks is processed (Fig. 2, “Neural Network 1“ processes ultrasound image sequence to generate a model output).
Garg and Kimura are considered to be analogous to the claimed invention as
they both are in the same field of silent speech detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg to incorporate the teachings of Kimura in order to process the received waveform to generate a sequence of time-aligned waterfall image chunks and to process the sequence of time-aligned waterfall image chunks to generate a model output. Doing so would be beneficial, as this would enable 
silence voice interaction which can provide inconspicuous and accurate silent speech detection (Kimura, pg. 2, para. 0004-0005).
	Garg in view of Kimura does not specifically disclose [receiving,] via one or more microphone of a client device, [reflections of an ultrasound signal].
Fu teaches receiving, via one or more microphone of a client device, reflections of an ultrasound signal (pg. 5, Fig. 3, see “Acoustic Sensing”, “Microphone” and “Speaker”).
Garg, Kimura, and Fu are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura to incorporate the teachings of Fu in order to specifically receive reflections via a client device, and to have the client device include a microphone to receive the reflections. Doing so would be beneficial, as this would leverage components already present in many devices, allowing for silent voice detection with smart devices (Condliffe, NPL There’s a Very Obvious Voice Assistant Hack: Ultrasound).

4. Claims 7-8 are rejected under 35 U.S.C. 103 as being unpatentable over Garg in view of Kimura and further in view of Zeng et al. (NPL mSilent: Towards General Corpus Silent Speech Recognition Using COTS mmWave Radar, hereinafter Zeng).

Regarding claim 7, Garg in view of Kimura does not specifically disclose wherein the transformer-based machine learning model includes a linear mapper that linearly projects the sequence of time-aligned waterfall image chunks into an embedding space by generating a sequence of image embeddings for the sequence of time-aligned waterfall image chunks in the embedding space.
Zeng teaches wherein the transformer-based machine learning model includes a linear mapper that linearly projects the sequence of time-aligned waterfall image chunks into an embedding space by generating a sequence of image embeddings for the sequence of time-aligned waterfall image chunks in the embedding space (pg. 15, Fig. 7 and section 5.1 “After obtaining the MSDS, we first design a network to extract high-level short-time features, i.e., what articulatory gestures the MSDS represents”; section 5.1.1 “We first consider a single network branch with a multi-channel spectrogram of shape 𝐹 ×𝑇 × 3 × 3, where 𝐹 ∈ {16, 32, 64} is the number of frequency bins and 𝑇 is the number of STFT segments. Due to the limitation of radar spatial resolution, the input spectrogram has only 3 × 3 spatial channels, so we directly flatten it to 𝐹 ×𝑇 × 9.”; pg. 15, section 5.1.2 “Through this network, the spectrogram of shape 𝐹 ×𝑇 × 9 is gradually transformed to a feature map of shape 4 ×𝑇 × 𝐶, where 𝐶 = 256 is the number of output channels. Finally, instead of using a fully connected (FC) layer, we use the global average pooling (GAP) to aggregate the output to a gesture sequence 𝑋 of shape 𝑇 × C”; ‘T’ images embedding into embedding space via convolutional layers to generate ‘T’ embeddings with frequency dimension ‘F’ and channel dimension ‘9’, which are then converted to ‘T’ number of gestures in gesture sequence ‘X’).
Garg, Kimura, and Zeng are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura to incorporate the teachings of Zeng in order to specifically have the transformer-based machine learning model include a linear mapper to generate a sequence of image embeddings for the sequence of time-aligned waterfall image chunks in the embedding space. Doing so would be beneficial, as this would enable prediction of articulatory gestures which are used for silent speech recognition (Zeng, pg. 3, 2nd para.; pg. 15, section 5.1.2).

	Regarding claim 8, Garg in view of Kimura in further view of Zeng discloses wherein the transformer-based machine learning model includes a text decoder to transcribe the sequence of image embeddings into the word content of the silent utterance (pg. 17 section 5.2.2 “The sequence-to-sequence (Seq2Seq) structure contains two parts: encoder and decoder. The encoder models the contextual information of gesture sequence 𝑋 by the self-attention, and transforms it
into a hidden sequence as one of the inputs of the decoder. The decoder is auto-correlated, i.e., accepts its last output text 𝑌𝑙−1 = [𝑦1, · · · , 𝑦𝑙−1] as the other input at step 𝑙. The decoder calculates the masked self-attention of 𝑌 to learn an internal language model in the corpus, which makes the Seq2Seq structure perform better [43]. The decoder then models the correlations between the two inputs by cross-attention, and outputs the 𝑌𝑙 at step 𝑙, until 𝑦𝑙 is the end-of-sentence (EOS) symbol.”; Fig. 8, transformer-based Seq2Seq back-end uses decoder layers to generate output (the text)).
Garg, Kimura, and Zeng are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura to incorporate the teachings of Zeng in order to specifically have the transformer-based machine learning model include a text decoder to transcribe the sequence of image embeddings into the word content of the silent utterance. Doing so would be beneficial as this would predict characters corresponding to the detected words, enabling silent speech recognition (Zeng, pg. 17, section 5.2.2.).

5. Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Garg in view of Kimura and further in view of Cheng, and further in view of Maizels et al. (US 2023/0230575 A1, hereinafter Maizels).

Regarding claim 12, Garg in view of Kimura does not specifically disclose modifying a repetition rate of the ultrasound signal.
Cheng teaches modifying a repetition rate of the ultrasound signal (Cheng, pg. 3 section 3.1 “TwinkleTwinkle leverages a pair of co-located speakers and microphones on mobile devices to transmit and receive inaudible ultrasonic signals carrying signal variations of eyelid motions and employs phase difference-related
information in FMCW signals to profile fine-grained eye blink motion trajectories. As the FMCW signals shown in Fig. 2(a), the frequency of the FMCW signal starts from frequency 𝑓0 and increases linearly with time 𝑡 over a chirp duration 𝑇𝑐 by bandwidth 𝐵, which can be denoted as 𝑓 (𝑡) = 𝑓0 + 𝐵 𝑇𝑐. Thus, the transmitted FMCW signal
can be presented as 𝑆𝑡 (𝑡) = 𝐴𝑐𝑜𝑠(𝜙𝑡 (𝑡)) with its phase calculated as 𝜙𝑡 (𝑡) = 2𝜋 (𝑓0𝑡 +
𝐵𝑡 2 2𝑇𝑐), where 𝐴 is the signal’s attenuation. Specifically, TwinkleTwinkle adopts the bandwidth 𝐵 as 4 kHz in the frequency range from 18 to 22 kHz, which allows our system runs silently on COTS mobile phones without bringing any disturbance. The signal period 𝑇 is set as a common value of 10.7 ms corresponding to 512 samples per signal period under the 48kHz sampling rate, which consists of a chirp duration 𝑇𝑐 with the first 1 to 480 samples and a blank guard time 𝑇𝑔 with the rest 32 samples.”; Fig. 2).
Garg, Kimura, and Cheng are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura to incorporate the teachings of Cheng in order to specifically modify a repetition rate of the ultrasound signal. Doing so would be beneficial, as this would increase SNR, improving penetration with good resolution (NPL Ortiz et al., pg. 423 section 3.1.4. “Pulse Compression”).
Garg in view of Kimura and Cheng does not specifically disclose modifying a repetition rate…based on a motion rate of the mouth motion of the user.
Maizels teaches modifying a repetition rate …based on a motion rate of the mouth motion of the user (para. 0062 “As long as user 24 is not speaking, sensing device 20 operates in a low-power idle mode in order to conserve power of its battery, at an idling step 410. This mode may use a low frame rate, for example twenty frames/sec…”; para. 0052 “Moreover, a processor or of device 20 can automatically switch from idle mode to high power consumption mode based on differing trigger types, such as a sensed input (e.g., eye blinks or mouth slightly open, or a pre-set sequence of motions like tongue movement). Also, the user may activate the device, using, for example a touch button on the device, or from an application in a mobile phone.”; para. 0062 “When such movement is detected, a processor of device 20 instructs to increase the frame rate, for example to the range of 100-200 frames/sec, to enable detection of changes in the secondary coherent light (e.g., speckle) patterns, that occur due to silent speech, at an active capture step 414.”).
Garg, Kimura, Cheng, and Maizels are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura and Cheng to incorporate the teachings of Maizels in order to specifically modify a repetition rate based on a motion rate of a mouth motion of a user. Doing so would be beneficial, as this would conserve battery power when the user isn’t currently mouthing words (Maizels, para. 0062).

6. Claims 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Garg in view of Kimura and Fu and further in view of Zeng. 

Regarding claim 16, claim 16 is rejected for analogous reasons to claim 7.
Regarding claim 17, claim 17 is rejected for analogous reasons to claim 8.

7. Claims 19 is rejected under 35 U.S.C. 103 as being unpatentable over Garg in view of Kimura, Fu, Cheng, and Urban et al. (NPL Modulation of ultrasound to produce multifrequency radiation force, hereinafter Urban).

Regarding claim 19, Garg in view of Kimura and Fu does not specifically disclose wherein the ultrasound signal is a Tuckey-tapered linear chirp…having a frequency range of approximately 21-22kHz.
Cheng teaches wherein the ultrasound signal is a Tuckey-tapered linear chirp…having a frequency range of approximately 21-22kHz (pg. 8 section 3.1 “As the FMCW signals shown in Fig. 2(a), the frequency of the FMCW signal starts from frequency 𝑓0 and increases linearly with time 𝑡 over a chirp duration 𝑇𝑐 by bandwidth 𝐵, which can be denoted as 𝑓 (𝑡) = 𝑓0 + 𝐵 𝑇𝑐 𝑡. Thus, the transmitted FMCW signal can be presented as 𝑆𝑡 (𝑡) = 𝐴𝑐𝑜𝑠(𝜙𝑡 (𝑡)) with its phase calculated as 𝜙𝑡 (𝑡) = 2𝜋 (𝑓0𝑡 + 𝐵𝑡 2 2𝑇𝑐 ), where 𝐴 is the signal’s attenuation. Specifically, TwinkleTwinkle adopts the bandwidth 𝐵 as 4 kHz in the frequency range from 18 to 22 kHz, which allows our system runs silently on COTS mobile phones without bringing any disturbance. The signal period 𝑇 is set as a common value of 10.7 ms corresponding to 512 samples per signal period under the 48 kHz sampling rate, which consists of a chirp duration 𝑇𝑐 with the first 1 to 480 samples and a blank guard time 𝑇𝑔 with the rest 32 samples. Such echoes reflected from different distances may cause fewer multipath interferences on the sweep, and echoes from two consecutive chirps have less overlap. Further, a Tukey window is applied to the transmitted chirp signal to eliminate the audible noises caused by spectral leakage due to frequency hopping between successive chirps with fewer samples altered [13, 14, 34, 57].”).
Garg, Kimura, Fu, and Cheng are considered to be analogous to the claimed invention as they are in the same field of silent speed detection. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura and Fu to incorporate the teachings of Cheng in order to specifically have the ultrasound signal be a Tuckey-tapered, linear chirp having a frequency range of approximately 21-22kHz. Utilizing this frequency range would be beneficial as this range fits mobile phone constraints while also preventing audible sound waves which hinder system performance (pg. 20, section 6.4.2). Furthermore, applying a Tuckey window would be beneficial to eliminate the audible noises caused by spectral leakage (pg. 8, section 3.1).
	Garg in view of Kimura, Fu, and Cheng does not specifically disclose the linear chirp having a repetition rate of approximately 20 Hz.
	Urban teaches having a repetition rate of approximately 20 Hz (pg. 5, section III.A, 2nd para. “An AM implemented with a linear FM signal was used with a bandwidth of 1 µHz– 5000 Hz, and the frequency sweep was performed in 50 ms”; Fig. 6).
Garg, Kimura, Fu, Cheng, and Urban are considered to be analogous to the claimed invention as they are in the same field of ultrasound imaging. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Garg in view of Kimura, Fu, and Cheng to incorporate the teachings of Urban in order to specifically have the ultrasound signal have a repetition rate of approximately 20 Hz. Doing this would be beneficial, as lower pulse rate frequencies result in higher range detection (NPL Parker Digital Signal Processing 101: Chapter 19 – Pulse Doppler Radar, section 19.5).

Allowable Subject Matter
Claims 9-11 and 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CODY DOUGLAS HUTCHESON/Examiner, Art Unit 2659    

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Feb 29, 2024
Application Filed
Jan 16, 2026
Non-Final Rejection — §103
Apr 13, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

18/330,472
Patent 12603096
VOICE ENHANCEMENT METHODS AND SYSTEMS
2y 5m to grant Granted Apr 14, 2026
18/545,677
Patent 12591750
GENERATIVE LANGUAGE MODEL UNLEARNING
2y 5m to grant Granted Mar 31, 2026
18/163,230
Patent 12579447
TECHNIQUES FOR TWO-STAGE ENTITY-AWARE DATA AUGMENTATION
2y 5m to grant Granted Mar 17, 2026
18/217,880
Patent 12537018
METHOD AND SYSTEM FOR PREDICTING A MENTAL CONDITION OF A SPEAKER
2y 5m to grant Granted Jan 27, 2026
17/877,543
Patent 12530529
DOMAIN-SPECIFIC NAMED ENTITY RECOGNITION VIA GRAPH NEURAL NETWORKS
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
62%
Grant Probability
99%
With Interview (+47.1%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allow rate.