DETAILED ACTION
This is responsive to the application filed 04 January 2024.
Claims 1-15 are pending and considered below.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-6, 8-13 and 15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Gaddy et al. ("Digital voicing of silent speech." arXiv preprint arXiv:2010.02960 (2020)).
Claim 1:
Gaddy discloses a system for decoding speech of a user, the system comprising:
a speech input device configured to measure a signal indicative of the speech muscle activation patterns of the user while the user is speaking (“By using muscular sensor measurements of speech articulator movement, we aim to capture silent speech - utterances that have been articulated without producing sound”, Introduction, paragraph 1, see also “To capture information about articulator movement, we make use of surface electromyography (EMG). Surface EMG uses electrodes placed on top of the skin to measure electrical potentials caused by nearby muscle activity. By placing electrodes around the face and neck, we are able to capture signals from muscles in the speech articulators”, Introduction paragraph 3);
a trained machine learning model configured to decode the speech of the user based at least in part on the signal indicative of the speech muscle activation patterns of the user (“Our method is built around a recurrent neural transduction model from EMG features to time-aligned speech features”, section 3, paragraph 1, see also “converting EMG input signals to audio outputs”, section 3.1, paragraph 1), wherein:
the trained machine learning model is trained using training data obtained in at least a subset of sampling contexts of a plurality of sampling contexts (silent EMG, vocalized EMG, and vocalized audio) (“the dataset consists of parallel silent / vocalized data, where the same utterances are recorded using both speaking modes. These examples can be viewed as tuples (ES, EV , AV ) of silent EMG, vocalized EMG, and vocalized audio, where EV and AV are time-aligned”, section 2, paragraph 1, see also “During training we use all three signals”, Description of Figure 2); and at least one processor configured to output the decoded speech of the user (“generating synthetic speech to be transmitted or played back”, Introduction, paragraph 1).
Claim 2:
Gaddy discloses the system of claim 1, wherein: the plurality of sampling contexts comprise a plurality of vocalization levels (silent EMG, vocalized EMG, and vocalized audio) (“the dataset consists of parallel silent / vocalized data, where the same utterances are recorded using both speaking modes. These examples can be viewed as tuples (ES, EV , AV ) of silent EMG, vocalized EMG, and vocalized audio, where EV and AV are time-aligned”, section 2, paragraph 1).
Claim 3:
Gaddy discloses the system of claim 2, wherein: the plurality of vocalization levels comprise a spectrum of vocalization levels from silent speech to vocalized speech (silent EMG, vocalized EMG, and vocalized audio) (“the dataset consists of parallel silent / vocalized data, where the same utterances are recorded using both speaking modes. These examples can be viewed as tuples (ES, EV , AV ) of silent EMG, vocalized EMG, and vocalized audio, where EV and AV are time-aligned”, section 2, paragraph 1).
Claim 4:
Gaddy discloses the system of claim 3, wherein: the spectrum of vocalization levels from silent speech to vocalized speech comprises a discrete spectrum of vocalization levels (silent EMG, vocalized EMG, and vocalized audio) (“the dataset consists of parallel silent / vocalized data, where the same utterances are recorded using both speaking modes. These examples can be viewed as tuples (ES, EV , AV ) of silent EMG, vocalized EMG, and vocalized audio, where EV and AV are time-aligned”, section 2, paragraph 1).
Claim 5:
Gaddy discloses the system of claim 3, wherein: the spectrum of vocalization levels from silent speech to vocalized speech comprises a continuous spectrum of vocalization levels (“We train the EMG-to speech transducer model on various-sized fractions of the dataset, from 10% to 100%”, section 4.3.1, paragraph 1).
Claim 6:
Gaddy discloses the system of claim 2, wherein: the plurality of sampling contexts further comprises a plurality of activity-based sampling contexts (section 2, paragraph 1, producing silent speech and vocalized speech are activities).
Claim 8:
Gaddy discloses the system of claim 2, wherein: the plurality of sampling contexts further comprises a plurality of environmental-based sampling contexts (“electrical noise is removed with notch filters at 60 Hz and its harmonics. Forward-backward filters are used to avoid phase delay. Audio is recorded from a built-in laptop microphone at 16kHz. Background noise is reduced using a spectral gating algorithm,2 and volume is normalized across sessions based on peak root-mean square levels”, section 2.3, paragraphs 1 and 2).
Claim 9:
Gaddy discloses the system of claim 8, wherein: each of sampling contexts of the plurality of environmental-based sampling contexts are based at least in part on a location and a noise level of the sampling context (“electrical noise is removed with notch filters at 60 Hz and its harmonics. Forward-backward filters are used to avoid phase delay. Audio is recorded from a built-in laptop microphone at 16kHz. Background noise is reduced using a spectral gating algorithm,2 and volume is normalized across sessions based on peak root-mean square levels”, section 2.3, paragraphs 1 and 2, see also “The electrode locations”, section 2.3).
Claim 10:
Gaddy discloses the system of claim 8, wherein: each of the sampling contexts of the plurality of environmental-based sampling contexts are based at least in part on the electrical properties of the sampling context (“electrical noise is removed with notch filters at 60 Hz and its harmonics”, section 2.3, paragraph 1).
Claim 11:
Gaddy discloses the system of claim 1, wherein: the trained machine learning model is associated with the user (“We collect a dataset of EMG signals and time aligned audio from a single speaker during both silent and vocalized speech”, section 2, paragraph 1).
Claim 12:
Gaddy discloses the system of claim 11, wherein: the trained machine learning model comprises a plurality of layers; and associating the trained machine learning model with the user comprises associating at least one layer of the plurality of layers with the user (“The LSTM model itself consists of 3 bidirectional LSTM layers with 1024 hidden units, followed by a linear projection to the speech feature dimension”, section 3.1, paragraph 2, note that all layers are associated with the user as all data used in training is collected from a single speaker).
Claim 13:
Gaddy discloses the system of claim 11, wherein: at least a subset of the training data is obtained from signals produced by the user; and associating the trained machine learning model with the user comprises training the machine learning model using the subset of the training data obtained from signals produced by the user (“We train the EMG-to speech transducer model on various-sized fractions of the dataset, from 10% to 100%”, section 4.3.1, paragraph 1).
Claim 15:
Gaddy discloses the system of claim 1, wherein;
the speech input device is further configured to obtain voiced speech measurements when the user is speaking vocally (“During the vocalized speech we can also record audio AV”, Introduction, paragraph 3); and
the trained machine learning model is a first trained machine learning model configured to associate a first signal indicative of the speech muscle activation patterns of the user when the user is speaking silently with a first voiced speech measurement when the user is speaking vocally (“Using a set of utterances recorded in both silent and vocalized speaking modes, we find alignments between the two recordings and use them to associate speech features from the vocalized instance (A’ V ) with the silent EMG E’S”, Abstract, see also section 3.2 paragraph 1); and
the system further comprises a second trained machine learning model configured to generate an audio and/or text output when the user is speaking silently based at least in part on the association of the first signal indicative of the speech muscle activation patterns of the user with the first voiced speech measurement (“Our method is built around a recurrent neural transduction model from EMG features to time-aligned speech features”, section 3, paragraph 1, see also “converting EMG input signals to audio outputs”, section 3.1, paragraph 1).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Gaddy et al. ("Digital voicing of silent speech." arXiv preprint arXiv:2010.02960 (2020)) in view of Keith, JR (US 2022/0027447).
Claim 7:
Gaddy discloses the system of claim 6, but does not explicitly disclose wherein: the plurality of activity-based sampling contexts comprise two or more of: walking, running, jumping, standing, or sitting.
In a system similarly using plurality of activity-based sampling contexts for voice signal processing, Keith, JR discloses wherein the plurality of activity-based sampling contexts (situational information) comprise two or more of: walking, running, jumping, standing, or sitting (“Analyzing the situational information also includes determining relationships or correspondences between acquired situational information and the acquired voice information. The relationships/correspondences are able to be learned (e.g., when a user walks, his voice is similar to when the user is at rest (or slightly winded), but when the user runs, his voice sounds more winded, has more pauses and/or any other effects)”, [0310], see also “machine learning is continuously implemented on the device such that any time the user speaks, the device acquires, analyzes and learns from the information. Additionally, the device learns any contextual information such as situational, behavioral, environmental, and/or other information”, [0315]).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result of Gaddy’s plurality of activity-based sampling contexts comprising two or more of: walking, running, jumping, standing, or sitting because a user’s speech production varies based on particular physical activities in which the user is engaged and training for the different scenarios improves voice signal processing (see Keith, JR, [0310]).
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Gaddy et al. ("Digital voicing of silent speech." arXiv preprint arXiv:2010.02960 (2020)) in view of Rameau et al. (US 2022/0208194).
Claim 14:
Gaddy discloses the system of claim 11, but does not explicitly disclose wherein: associating the trained machine learning model with the user comprises using as input to the trained machine learning model, a conditioning flag associated with the user.
In a silent speech detection system similarly associating a trained machine learning model with a user, Rameau discloses wherein associating the trained machine learning model with the user comprises using as input to the trained machine learning model, a conditioning flag associated with the user (“the general predictive model further trained using a relatively smaller patient-specific dataset based on a relatively smaller set of utterances. In certain embodiments, patient-specific utterances may be selected, for example, to represent sounds or words that exhibit more similarly in the patient than in the subjects used to train the general predictive model”, [0018], utterances need to be tagged to be selected as patient-specific).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result of using as input a conditioning flag associated with the user in Gaddy’s trained machine learning model in order to improve the silent speech recognition of a particular speaker by training the model using the particular’s speaker data (see Rameau, [0018]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Janke et al. ("EMG-to-speech: Direct generation of speech from facial electromyographic signals." IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.12 (2017): 2375-2385) discloses a silent speech interface based on facial surface electromyography (EMG), which is used to record the electrical signals that control muscle contraction during speech production. These signals are then converted directly to an audible speech waveform, retaining important paralinguistic speech cues for information such as speaker identity and mood.
Kapur et al. (US 2019/0074012) discloses a system which may detect silent, internal articulation of words by a human user, by measuring low-voltage electrical signals at electrodes positioned on a user's skin. The measured signals may have been generated by neural activation of speech articulator muscles during the internal articulation. The system may detect the content of internally articulated words even though the internal articulation may be silent, may occur even when the user is not exhaling, and may occur without muscle movement that is detectable by another person. The system may react in real-time to this detected content. In some cases, the system reacts by providing audio feedback to the user via an earphone or a bone conduction transducer.
McVicker et al. (US 10,621,973) discloses a sub-vocal speech recognition (SVSR) apparatus which includes a headset that is worn over an ear and electromyography (EMG) electrodes and an Inertial Measurement Unit (IMU) in contact with a user's skin in a position over the neck, under the chin and behind the ear. When a user speaks or mouths words, the EMG and IMU signals are recorded by sensors and amplified and filtered, before being divided in multi-millisecond time windows. These time windows are then transmitted to the interface computing device for Mel Frequency Cepstral Coefficients (MFCC) conversion into aggregated vector representation (AVR). The AVR is the input to the SVSR system, which utilizes a neural network, CTC function, and language model to classify the phoneme. The phonemes are then combined into words and sent back to the interface computing device, where they are played either as audible output, such as from a speaker, or non-audible output, such as text.
Maizels et al. (US 11,908,478) discloses a method for generating speech by uploading a reference set of features that were extracted from sensed movements of one or more target regions of skin on faces of one or more reference human subjects in response to words articulated by the subjects and without contacting the one or more target regions. A test set of features is extracted a from the sensed movements of at least one of the target regions of skin on a face of a test subject in response to words articulated silently by the test subject and without contacting the one or more target regions. The extracted test set of features is compared to the reference set of features, and, based on the comparison, a speech output is generated, that includes the articulated words of the test subject.
Ziv (US 2025/0085780) discloses systems and methods for performing operations comprising: detecting, by one or more electromyograph (EMG) electrodes of an EMG communication device, subthreshold muscle activation signals of one or more muscles associated with speech production, the subthreshold muscle activation signals being generated in response to inner speech of a user; applying a machine learning technique to the subthreshold muscle activation signals to estimate one or more speech features corresponding to the subthreshold muscle activation signals, the machine learning technique being trained to establish a relationship between a plurality of training subthreshold muscle activation signals and ground truth speech features; generating visual or audible output based on the one or more speech features; and causing the visual or audible output to be processed by a messaging application to engage a feature of the messaging application.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SAMUEL G NEWAY whose telephone number is (571)270-1058. The examiner can normally be reached Monday-Friday 9:00am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SAMUEL G NEWAY/Primary Examiner, Art Unit 2657