Last updated: May 29, 2026
Application No. 18/271,416
SPEECH-ANALYSIS BASED AUTOMATED PHYSIOLOGICAL AND PATHOLOGICAL ASSESSMENT

Non-Final OA §103
Filed
Jul 07, 2023
Priority
Jan 13, 2021 — EU 21151442.7 +2 more
Examiner
TENGBUMROONG, NATHAN NARA
Art Unit
2654
Tech Center
2600 — Communications
Assignee
UNIVERSITÄTSSPITAL BASEL
OA Round
2 (Non-Final)
This examiner grants 47% of cases after interview

— +26.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 19 resolved cases, 2023–2026
Examiner Intelligence

TENGBUMROONG, NATHAN NARA View full profile →
Grants 47% of resolved cases
Career Allowance Rate
9 granted / 19 resolved
-14.6% vs TC avg
Strong +27% interview lift
Without
With
+26.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
21 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
98.3%
+58.3% vs TC avg
§102
1.7%
-38.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 19 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 10/10/2025 and 1/06/2026.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Amendment
Claims 1-2, 4-8, 12-13, 16-18, and 21 are amended. Claims 1-14 and 16-21 are presented for examination.

Response to Arguments
Rejection under 35 U.S.C. 101
Applicant’s arguments have been fully considered and are persuasive.  Independent claims 1, 12, and 13 recite determining different metrics, such as correct word rate, from a word-reading test by computing Mel-frequency cepstral coefficients for multiple segments in a voice recording of the reading test to obtain vectors, clustering and labeling the vectors based on the words, predicting a sequence of words using the labels, performing sequence alignment between predicted words and words in the reading test using a selected label with the highest alignment score, and comparing the output metric with a reference value. Thus, the claims provide an improvement in assessing pathological and physiological conditions of patients using and analyzing word-reading tests. 

Rejection under 35 U.S.C. 103
Applicant’s arguments have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 9, and 11-14 are rejected under 35 U.S.C. 103 as being unpatentable over Shallom (US 20200294531 A1) in view of Voss et al. (US 20220199071 A1; hereinafter referred to as Voss), Bellegarde (US 20110004475 A1), and Le Roux et al. (US 20190318725 A1; hereinafter referred to as Le Roux).
Regarding claim 1, Shallom teaches: a method of assessing a pathological and/or physiological state of a subject ([0139] FIG. 1, which is a schematic illustration of a system 20 for evaluating the physiological state of a subject 22, in accordance with some embodiments of the present invention), the method comprising: obtaining a voice recording from a word-reading test from the subject ([0059] the test speech sample includes a predetermined utterance that includes at least one of the identified speech units), wherein the voice recording is from a word-reading test comprising reading a sequence of words drawn from a set of n words ([0125] the subject's mobile phone may prompt the subject to produce the reference samples by repeating one or more designated sentences, words, or syllables, which may contain any number of designated phonemes, diphones, triphones, and/or other acoustic phonetic units (APUs). As the subject produces the reference samples, a microphone belonging to the mobile phone may record the samples. Subsequently, a processor belonging to the mobile phone or to a remote server may construct, from the samples, a model that represents the particular utterance. Subsequently, to acquire the test sample, the system prompts the subject to repeat the utterance);
and analysing the voice recording, or a portion thereof, by: identifying a plurality of segments of the voice recording… ([0128] the system may prompt the subject to produce, for the test sample, any particular utterance that includes the speech units for which the speech-unit models were constructed. The system may then identify these speech units in the test sample) that correspond to single words or syllables ([0125] the subject's mobile phone may prompt the subject to produce the reference samples by repeating one or more designated sentences, words, or syllables, which may contain any number of designated phonemes, diphones, triphones, and/or other acoustic phonetic units (APUs));
and comparing the value of the one or more metrics with one or more respective reference values ([0130] directly compares the test speech sample to each of the individual reference samples that were previously acquired. For example, to acquire a reference sample, the system may prompt the subject to utter a particular utterance. Subsequently, to acquire the test sample, the system may prompt the subject to utter the same utterance, and the two samples may then be compared to one another).
Shallom does not explicitly, but Voss discloses: determining a value of one or more metrics selected from a breathing %, unvoicing/voicing ratio, voice pitch and correct word rate at least in part based on the identified segments ([0057] Process 100 generates (120) output based on the determined alignment. In some embodiments, outputs can include scores for the audio data, where determined alignments can be used to generate scores for the audio data associated with the target data. Scoring in accordance with certain embodiments of the invention can measure the performance of a reader along various measures, such as (but not limited to) accuracy, completeness, speed, etc. For example, processes in accordance with some embodiments of the invention can include scores that indicate performance on reading for individual phonemes, for subcomponents of words, for entire words, for a sequence of words, etc.), wherein determining the value of the one or more metrics comprises determining the correct word rate associated with the recording ([0057] Scoring in accordance with certain embodiments of the invention can measure the performance of a reader along various measures, such as (but not limited to) accuracy, completeness, speed, etc.), wherein determining the correct word rate comprises: computing one or more Mel-frequency cepstral coefficients (MFCCs) for each of the identified segments to obtain a plurality of vectors of values, each vector being associated with a segment… ([0064] Process 200 encodes (210) audio data. In some embodiments, audio data can be encoded to various formats, such as (but not limited to) vectors of acoustic features derived from audio data, mel-frequency cepstral coefficients (MFCC), spectrogram data, neural embeddings, etc.);
and performing a sequence alignment between the predicted sequence of words and the sequence of words used in the word reading test… ([0054] processes can predict text based on the audio data, and alignments can be determined based on similarities between the predicted text and the target text. Target text can be a word reading test.);
wherein matches in the alignment correspond to correctly read words in the voice recording ([0057] Process 100 generates (120) output based on the determined alignment. In some embodiments, outputs can include scores for the audio data, where determined alignments can be used to generate scores for the audio data associated with the target data. Scoring in accordance with certain embodiments of the invention can measure the performance of a reader along various measures, such as (but not limited to) accuracy, completeness, speed, etc. For example, processes in accordance with some embodiments of the invention can include scores that indicate performance on reading for individual phonemes, for subcomponents of words, for entire words, for a sequence of words, etc.), wherein performing a sequence alignment comprises obtaining an alignment score and the best alignment is the alignment with a highest alignment score… ([0093] A particularly stable estimate can be achieved in accordance with several embodiments of the invention by defining the optimal alignment to be the alignment which corresponds to highest score sum for the largest possible j that brings the score through j (normalized by j itself the number of mapped targets) above a certain threshold).
Shallom and Voss are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom to combine the teachings of Voss because doing so would allow for target sequences, such as the word-reading test, to be encoded, which would improve performance in determining a sequence alignment between predicted and target words (Voss [0063] performance may be improved by computing alignments over phonetic representations rather than character representations of language. In such cases, target data may be encoded by looking up a representation of an input target in a phoneme dictionary and/or using a phonetic decoder to transform characters into phonemes or other graphemes (e.g., combinations of phonemes). To improve performance, some of these encoding steps over the targets may be performed in advance so that the already-encoded targets are passed into the process so that less or no further encoding is required at runtime).
The combination of Shallom and Voss do not explicitly, but Bellegarde teaches: clustering the plurality of vectors of values into n clusters, wherein each cluster has n possible labels corresponding to each of the n words ([0084] As shown in FIG. 11, each cluster label represents a set of frames 1107 of the acoustic signal. Each frame of the acoustic signal, e.g., frame 1109, can be described by a set of feature vector parameters, e.g., MFCCs);
for each of the n! permutations of labels, predicting a sequence of words in the voice recording using the labels associated with the clustered vectors of values… ([0073] determination of the likelihood 611 of the recognized word sequence 613 based on phoneme sequence 609 associated with the recovered cluster label sequence 605).
Shallom, Voss, and Bellegarde are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom and Voss to combine the teachings of Bellegarde because doing so would allow for MFCC vector values to be clustered and labeled to represent the frames of an input audio, improving the prediction of word sequences using the labeled clusters (Bellegarde [0077] a word sequence "did you" may be realized with a "D" followed by a "y", or it may be realized with a "D" followed by a "J". That may result into two slightly different cluster label sequences distorted from a cluster label sequence associated with the observed acoustic evidence. The detailed distortion model for the cluster labels can be applied to the observed acoustic evidence to determine the recovered cluster label sequence, as described in further detail below).
The combination of Shallom, Voss, and Bellegarde does not explicitly, but Le Roux teaches: and selecting the labels ([0193] The label permutation minimizing the CTC loss between Y.sub.ctc and R is selected, and the decoder network generates output label sequences using the permuted reference labels for teacher forcing) that result in a best alignment… ([0190] the forward-backward algorithm of CTC enforces monotonic alignment between the input speech and the output label sequences during training and decoding. The CTC loss can be calculated from the output of the encoder network). 
Shallom, Voss, Bellegarde, and Le Roux are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, and Bellegarde to combine the teachings of Le Roux because doing so would allow the use of attention-based models to learn and determine the most desirable alignment using labels during training, leading to improved sequence alignment (Le Roux [0167-0168] Attention-based models make predictions conditioned on all the previous predictions, and thus can learn language-model-like output contexts. However, without strict monotonicity constraints, these attention-based decoder models can be too flexible and may learn sub-optimal alignments or converge more slowly to desirable alignments. In the hybrid system, the BLSTM encoder is shared by both the CTC and attention decoder networks. Unlike the attention model, the forward-backward algorithm of CTC enforces monotonic alignment between speech and label sequences during training).

Regarding claim 9, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. Shallom further teaches: wherein the n words: (i) are monosyllabic or disyllabic ([0125] the subject's mobile phone may prompt the subject to produce the reference samples by repeating one or more designated sentences, words, or syllables, which may contain any number of designated phonemes, diphones, triphones, and/or other acoustic phonetic units (APUs). The syllables can be monosyllabic, disyllabic, or any other combination.), and/or (ii) each include one or more vowels that are internal to the respective word ([0125] other acoustic phonetic units (APUs). This can include vowels.); and/or (iii) each include a single emphasized syllable ([0125] any number of designated phonemes, diphones, triphones. This can include a designated phoneme.); and/or (iv) are color words, optionally wherein the words are displayed in a single color in the word reading test, or wherein the words are displayed in a color independently chosen from a set of m colors in the word reading test.

Regarding claim 11, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. Shallom further teaches: wherein the sequence of words comprises a predetermined number of words ([0125] the subject's mobile phone may prompt the subject to produce the reference samples by repeating one or more designated sentences, words, or syllables, which may contain any number of designated phonemes, diphones, triphones, and/or other acoustic phonetic units (APUs). A designated sentence has a predetermined number of words.), optionally at least 20, at least 30 or about 40 words, and/or wherein obtaining a voice recording comprises receiving a word recording from a computing device associated with the subject, optionally wherein obtaining a voice recording further comprises causing a computing device associated with the subject to display the sequence of words ([0180] the verbal content of the utterance may be displayed on the screen of the device, and the subject may be requested to read the verbal content aloud), and/or to record a voice recording and/or to emit a fixed length tone, then to record a voice recording.

Regarding claim 12, Shallom teaches: a method of monitoring a subject with heart failure, or diagnosing a subject as having worsening of heart failure or decompensated heart failure ([0117] by analyzing the subject's speech, the system may identify an onset of, or a deterioration with respect to, a physiological condition such as congestive heart failure (CHF), coronary heart disease, atrial fibrillation or any other type of arrhythmia). The rest of the claim recites the same limitations as claim 1 and is rejected similarly.

Regarding claim 13, Shallom teaches: a method of assessing a level of dyspnea and/or fatigue in a subject or monitoring a subject that has been diagnosed as having or being at risk of having a condition associated with dyspnea and/or fatigue ([0117] by analyzing the subject's speech, the system may identify an onset of, or a deterioration with respect to, a physiological condition such as congestive heart failure (CHF), coronary heart disease, atrial fibrillation or any other type of arrhythmia, chronic obstructive pulmonary disease (COPD), asthma, interstitial lung disease, pulmonary edema, pleural effusion, Parkinson's disease, or depression. Dyspnea correlates with both asthma and ILD.). The rest of the claim recites the same limitations as claim 1 and is rejected similarly.

Regarding claim 14, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 13. Shallom further teaches: wherein the method is for assessing the level of dyspnea and/or fatigue in the subject… ([0117] by analyzing the subject's speech, the system may identify an onset of, or a deterioration with respect to, a physiological condition such as congestive heart failure (CHF), coronary heart disease, atrial fibrillation or any other type of arrhythmia, chronic obstructive pulmonary disease (COPD), asthma, interstitial lung disease).
Voss further teaches: and wherein the one or more metrics include the correct word rate ([0057] Scoring in accordance with certain embodiments of the invention can measure the performance of a reader along various measures, such as (but not limited to) accuracy, completeness, speed, etc.).

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claims 1, 9, and 11-14 above, and further in view of Visser et al. (US 20110264447 A1; hereinafter referred to as Visser).
Regarding claim 2, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Visser discloses: wherein identifying segments of the voice recording that correspond to single words or syllables comprises: obtaining a power Mel-spectrogram of the voice recording ([0077] FIGS. 1A and 1B show an example of the first-order derivative of spectrogram power of a segment of recorded speech over time);
computing a maximum intensity projection of the power Mel spectrogram along a frequency axis ([0085] Task T200 calculates a value of the energy E(k,n) (also called "power" or "intensity") for each frequency component k of segment n over a desired frequency range. FIG. 2B shows a flowchart for an application of method M100 in which the audio signal is provided in the frequency domain);
and defining a segment boundary as a time point where the maximum intensity projection of the Mel spectrogram along the frequency axis crosses a threshold ([0115] A gain based VAD operation may be configured to indicate voice detection, for example, when the ratio of the energies of two channels exceeds a threshold value (indicating that the signal is arriving from a near-field source and from a desired one of the axis directions of the microphone array). Such a detector may be configured to operate on the signal in the frequency domain (e.g., over one or more particular frequency ranges) or in the time domain).
Shallom, Voss, Bellegarde, Le Roux, and Visser are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to combine the teachings of Visser because doing so would allow for more accurate determinations of words and syllables in a voice recording, leading to improved sequence alignment (Visser [0111] Indication of speech onsets and/or offsets (or a combined onset/offset score) as produced by an implementation of method M100 as described herein may be used to improve the accuracy of a VAD stage and/or to quickly track energy changes in time. For example, a VAD stage may be configured to combine an indication of presence or absence of a transition in voice activity state, as produced by an implementation of method M100, with an indication as produced by one or more other VAD techniques (e.g., using AND or OR logic) to produce a voice activity detection signal).

Claim 3-4 and 20-21 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claims 1, 9, and 11-14 above, and further in view of Patel et al. (US 20230329630 A1; hereinafter referred to as Patel).
Regarding claim 3, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Patel teaches: wherein determining the value of one or more metrics comprises determining a breathing percentage associated with the recording as the percentage of time in the voice recording that is between the identified segments ([0111] some aspects of the disclosure include determining a percentage, proportion, or ratio of frames of a phonation recording that are voiced. Alternatively, this feature may be determined using unvoiced frames. In some instances of determining voiced (or unvoiced) frames, a predetermined pitch threshold may be applied so that the percentage of voiced or unvoiced frames is being termed for frames that have suspected speech. The unvoiced frames can be the time that is between segments.).
Shallom, Voss, Bellegarde, Le Roux, and Patel are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to combine the teachings of Patel because doing so would allow for a better physiological assessment of a subject by analyzing respiratory conditions using breathing/voicing metrics to determine the condition of patients (Patel [0009] utilizing acoustic features from voice recordings to monitor respiratory condition enable improved accuracy in treating individuals with respiratory conditions. For example, a potential respiratory condition of the individual may be tracked at home in accordance with this disclosure utilizing the voice recordings to more precisely determine when treatment, such as an antibiotic, is needed rather than prescribing treatment to an individual prematurely and/or for too long a time period).

Regarding claim 4, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Patel teaches: wherein determining the value of one or more metrics comprises determining a unvoicing/voicing ratio associated with the recording as a ratio of a time between the identified segments in the recording and a time within identified segments in the recording ([0115] Pause length may refer to pauses in a user's speech that are at least a predetermined minimum duration, such as 200 milliseconds. In some aspects, pauses used to determine an average pause length and/or pause count may be determined by utilizing an automated speech-to-text algorithm to generate text from user's voice sample, determine timestamps for when a user starts a word and when a user finishes a word, and, using the timestamps, determining the durations between words. The global SNR may be the signal-to-noise ratio over the recording that includes nonspoken time).

Regarding claim 20, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Patel teaches: wherein identifying segments of the voice recording that correspond to single words or syllables further comprises: excluding segments that represent erroneous detections ([0092] Sample recording auditor 2608 may further perform trimming, cutting, or filtering to remove unnecessary and/or un-useable portions of a voice sample recording. In some embodiments, sample recording auditor 2608 may work with signal preparation processor 2606 to perform such actions. For example, sample recording auditor 2608 may trim a beginning portion and an end portion (e.g., 0.25 seconds) from each recording. Usable portions of a voice sample may include voice-related data that is sufficient for further processing to determine phoneme or feature information) by removing segments shorter than a predetermined threshold and/or with mean relative energy below a predetermined threshold ([0101] phoneme segmenter 2610 may perform automated segmentation by applying thresholds to detected intensity levels in the voice samples. For example, acoustic intensity throughout a recording may be computed, and a threshold for separating background noise from more energetic events in the sample (representing speech events) may be applied).
Shallom, Voss, Bellegarde, Le Roux, and Patel are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, Le Roux, and Patel to further combine the teachings of Patel because doing so would allow for higher quality processing of voice recording segments by removing segments that don't fit a criteria (Patel [0036] In some embodiments, pre-processing or signal condition operations may be performed to facilitate detecting phonemes and/or determining phoneme features. These operations may include, for example, trimming the audio sample data, frequency filtering, normalization, removing background noise, intermittent spikes, other acoustic artifacts, or other operations as described herein).

Regarding claim 21, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Patel teaches: wherein determining the value of the one or more metrics comprises determining a breathing percentage associated with the recording as a ratio of a time between the identified segments in the recording and a sum of a time between the identified segments and within identified segments in the recording ([0115] Pause length may refer to pauses in a user's speech that are at least a predetermined minimum duration, such as 200 milliseconds. In some aspects, pauses used to determine an average pause length and/or pause count may be determined by utilizing an automated speech-to-text algorithm to generate text from user's voice sample, determine timestamps for when a user starts a word and when a user finishes a word, and, using the timestamps, determining the durations between words. The global SNR may be the signal-to-noise ratio over the recording that includes nonspoken time).

Claims 5 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claims 1, 9, and 11-14 above, and further in view of Laaksonen et al. (US 20140019125 A1; hereinafter referred to as Laaksonen).
Regarding claim 5, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Laaksonen teaches: wherein determining the value of the one or more metrics comprises determining a voice pitch ([0103] The speech decoder 201 can furthermore generate or recover a fundamental frequency f.sub.0 value or pitch estimate based on a pitch period estimate performed in the associated encoder and passed along with the encoded narrowband signal) associated with the recording by obtaining one or more estimates of a fundamental frequency for each of the identified segments ([0136] The fundamental frequency f.sub.0 estimate from the audio signal can in some embodiments be determined for each input frame).
Shallom, Voss, Bellegarde, Le Roux, and Laaksonen are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to combine the teachings of Laaksonen because doing so would allow for an improvement in in speech quality by postprocessing a speech signal, leading to a better determination of voice pitch (Laaksonen [0008] embodiments of the application attempt to improve the perceived quality and intelligibility of the narrowband telephone speech by post-processing the speech signal received or recovered and by artificially widening the low frequency content below the telephone band, based solely on information extracted from the received speech signal when the sound reproduction system is capable of reproducing low frequencies).

Regarding claim 16, the combination of Shallom, Voss, Bellegarde, Le Roux, and Laaksonen teaches: the method of claim 5. Laaksonen further teaches: wherein determining the value of the voice pitch comprises obtaining a plurality of estimates of the fundamental frequency for each of the identified segment ([0136] The fundamental frequency f.sub.0 estimate from the audio signal can in some embodiments be determined for each input frame), and applying a filter to the plurality of estimates to obtain a filtered plurality of estimates ([0175] uses a first order recursive filter to smooth the fundamental frequency estimates for consecutive frames and thus reduce a rapid variation of sine wave amplitudes), and/or wherein determining the value of the voice pitch comprises obtaining a summarised voice pitch estimate for a plurality of segments, and/or wherein determining the value of the voice pitch comprises obtaining -a mean, median or mode of the (optionally filtered) plurality of estimates for the plurality of segments.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claim 1, 9, and 11-14 above, and further in view of Medan (US 20210104174 A1).
Regarding claim 6, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Medan teaches: wherein determining the value of one or more metrics comprises determining the correct word rate associated with the voice recording by computing the ratio of a number of identified segments corresponding to correctly read words divided by a time duration between a start of a first identified segment and an end of a last identified segment ([0017] Analyzing the user's speech quality may include determining, evaluating and/or measuring reaction time, number of attempts, order of words, stuttering, omission of words, mispronunciation of words/syllables, length of response time, rate of speech, "swallowing" of words, ratio between mispronounced and correctly pronounced words, speech fluency, use of correct word types, grammar correctness, use of key words (i.e., given a certain prompt by the VA, the user is expected to say certain words, reflecting the richness of their vocabulary), number of correct attempts, length of utterance, pitch of speech, intensity of speech or any combination thereof. Rate of speech, length of utterance, and ratio between mispronounced and correctly pronounced words can be used together.).
Shallom, Voss, Bellegarde, Le Roux, and Medan are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to combine the teachings of Medan because doing so would allow for user speech to be closely analyzed in order to determine a user's speech struggles for better physiological assessment using relevant metrics such as correct word rate (Medan [0035] Once the personal practice protocol was uploaded, the user receives from the system requests to perform a task, which includes saying one or more words associated with their specific speech/lingual pathology (step 106). For example, if the user has a problem with the pronunciation of the letter "s", the proto col may include tasks which require the sure to say word(s)/sentence(s) with the letter "s". If the user has difficulties with grammar, the tasks will involve tasks that relate to the user's grammar problems. If the user is struggling with stuttering, the system may provide a task that will challenge the speech fluency, etc.).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claims 1, 9, and 11-14 above, and further in view of Deng et al. (US 20040019483 A1; hereinafter referred to as Deng) and Faifkov et al. (US 20100324900 A1; hereinafter referred to as Faifkov).
Regarding claim 7, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. Le Roux further teaches: and/or wherein clustering the plurality of vectors of values into n clusters is performed using k-means… ([0077] For example, the separation encoding estimation from embeddings module 1163 can use a clustering algorithm such as the k-means algorithm to cluster the embedding vectors into C groups).
The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Deng teaches: computing one or more MFCCs to obtain a vector of values for a segment comprises: computing a set of i MFCCs for each frame of the segment for each i ([0111] The frames of data created by frame constructor 707 are provided to feature extractor 708, which extracts a feature from each frame. Examples of feature extraction modules include… Mel-Frequency Cepstrum Coefficients (MFCC) feature extraction) and obtaining a set of j values for the segment by interpolation ([0010] identifies an articulatory dynamics value by performing a linear interpolation between a production-related dynamics value at a previous time and a target using a time-dependent interpolation weight. The production-related dynamics value is then used to form a predicted acoustic feature value that is compared to an observed acoustic feature value), preferably linear interpolation, to obtain a vector of ixj values for the segment… ([0036] The mapping model predicts a sequence of acoustic observation vectors given the sequence of trajectory values, the phone sequence and the phone boundaries).
Shallom, Voss, Bellegarde, Le Roux, and Deng are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to combine the teachings of Deng because doing so would allow for the use of linear interpolation to improve prediction of a sequence of words (Deng [0010] performing a linear interpolation between a production-related dynamics value at a previous time and a target using a time-dependent interpolation weight. The production-related dynamics value is then used to form a predicted acoustic feature value that is compared to an observed acoustic feature value to determine the likelihood that the observed acoustic feature value was produced by a given phonological unit).
The combination of Shallom, Voss, Bellegarde, Le Roux, and Deng does not explicitly, but Faifkov teaches: and/or wherein the sequence alignment step is performed using a local sequence alignment algorithm, preferably a Smith-Waterman algorithm ([0046] Start and finish times (t.sub.i,t.sub.f) are input to a sequence alignment algorithm preferably based on a dynamic program algorithm such as the well-known Smith-Waterman algorithm for sequence alignment in which one sequence is target phonemes 36 and the second sequence is a portion of frames 115A).
Shallom, Voss, Bellegarde, Le Roux, Deng, and Faifkov are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, Le Roux, and Deng to combine the teachings of Faifkov because doing so would allow for use of specific algorithms, such as the Smith-Waterman algorithm, to be used for improving sequence alignment (Faifkov [0046] The sequence alignment algorithm may be further refined by limiting alignment for phonemes which repeat in successive frames to the same phoneme in a single frame).

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, Le Roux, and Visser, as applied to claim 2 above, and further in view of Paraskevopoulos et al. (US 20200335086 A1; hereinafter referred to as Paraskevopoulos).
	Regarding claim 8, the combination of Shallom, Voss, Bellegarde, Le Roux, and Visser teaches: the method of claim 2. The combination of Shallom, Voss, Bellegarde, Le Roux, and Visser does not explicitly, but Paraskevopoulos teaches: wherein identifying segments of the voice recording that correspond to single words or syllables further comprises (i) normalising the power Mel-spectrogram of the voice recording, preferably against the frame that has the highest energy in the recording ([0035] The spectrograms are normalized in the [-1-1, ] range applying min-mах normalization, so we use tan h activation at the decoder output. In both modules, batch normalization, dropout with p=0.2 and leaky ReLU activations are added after each (de)convolutional layer).
Shallom, Voss, Bellegarde, Le Roux, Visser, and Paraskevopoulos are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, Le Roux, and Visser to combine the teachings of Paraskevopoulos because doing so would allow for a spectrogram of a voice recording to be normalized, leading to an improvement in speech analysis and sequence alignment (Paraskevopoulos [0028] an output layer of a spectrogram generator uses a transposed convolution layer, which improves the "realism" of synthesized spectrograms).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claim 1, 9, and 11-14 above, and further in view of Scherlen (US 20180228366 A1).
	Regarding claim 10, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Scherlen teaches: wherein obtaining a voice recording from a word-reading test from the subject comprises obtaining a voice recording from a first word-reading test, and a voice recording from a second word-reading test ([0034] the mobile digital terminal is a touch tablet able to make the reading medium appear and to record the voice of the individual), wherein the word-reading tests comprise reading a sequence of words drawn from a set of n words that are color words ([0026] a prior step in which the individual reads a text having a meaning and using commonplace words, the conditions under which the step in which the specific test is taken depending on reading errors noted in said prior step. Specifically, if the individual has already made reading errors in this preliminary reading test, the individual will possibly be asked to read directly a particular zone of the text with defects, corresponding to a particular size of letter and/or defects), wherein the words are displayed in a single color in the first word reading test, and in a color independently chosen from a set of m colors in the second word reading test ([0012] All these constituent elements of the text may be presented with variable and/or constant sizes and/or with a particular arrangement in rows and/or in columns, and/or with varied colors. Colors can be chosen for the first and second test.), optionally wherein the sequence of words in the second word reading test is the same as the sequence of words in the first word reading test.
Shallom, Voss, Bellegarde, Le Roux, and Scherlen are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to combine the teachings of Scherlen because doing so would provide a way to obtain multiple voice recordings for processing and analysis using word-reading tests to better assess a patient (Scherlen [0024] a determining method according to the invention comprises a second test in which a medium presenting at least one color is read. Specifically, in order to refine the determination of the visual anomaly, simple and complementary visualization and/or reading tests may be carried out).

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claim 1, 9, and 11-14 above, and further in view of Scodary et al. (US 20210020165 A1; hereinafter referred to as Scodary).
Regarding claim 17, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Scodary teaches: wherein determining the value of the one or more metrics comprises determining the correct word rate associated with the voice recording by computing a cumulative sum of a number of identified segments corresponding to correctly read words in the voice recording over time ([0108] machine learning models (e.g., deep neural networks) may be utilized to predict metrics directly as classifiers, either per-utterance (a segment of an audio call) or over the full call. If computed per utterance, it is then summed and a maximum, minimum, mean, average, or some other descriptive statistic is computed. The metric can be correctly read words.), and computing a slope of a linear regression ([0109] One technique utilizes linear regression for a given metric against a different metric of call quality (ground truth sources such as human labelers, CSAT, NPS, or a custom QA score, or some combination of several ground truth sources). The linear regression produces an indication of how much each model should be weighted) model fitted to the cumulative sum data ([0108] computing the slope of the best fit curve of emotional valence (itself a model output). Statistical natural language processing techniques may also be utilized. For example, precomputed weights for different words and phrases may be implemented in a lookup table, and a word-trie data structure generated to efficiently count occurrences of words and phrases, weighted by configured coefficients).
Shallom, Voss, Bellegarde, Le Roux, and Scodary are considered analogous in the field of speech analysis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to combine the teachings of Scodary because doing so would allow for a correct word rate to be determined by analyzing recordings using adaptive feedback, improving accuracy of the correct word rate (Scodary [0050] system provides adaptive feedback responsive to more and more frequent inputs than do conventional communication systems, so that corrective action may be applied for exceptional situations and so that processing agents and components operative in the system receive a continuous adaptive feedback control that enables more rapid correction and improvement of call processing. The system may provide more stable metric controls to more accurately compare performance between system agents, components, and/or groups and combinations thereof).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claims 1, 9, and 11-14 above, and further in view of Tiron et al. (US 20220007965 A1; hereinafter referred to as Tiron).
Regarding claim 18, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Tiron teaches: wherein identifying segments of the voice recording that correspond to single words or syllables further comprises: performing onset detection for at least one of the segments by computing a spectral flux function over a Mel-spectrogram of the segment ([0399] the audio processing may include applying short time Fourier transform (STFT) analysis of a sampled audio waveform obtained from an audio sensor, and estimating normalized sub-band power levels thereof. Mel-frequency cepstral coefficients (MFCCs) may be evaluated to distinguish snoring and/or coughing from speech. Spectral flux (analyzing changes between power spectral estimates) may be determined and evaluated to detect the onset of snoring), 
and defining a further boundary whenever an onset is detected within a segment, thereby forming two new segments ([0474] The summary metric provides shape for the audio signal in the plot of FIG. 36. These metric outputs may be evaluated by the offline classification module to form the additional outputs of the snore detector (See, e.g., snore related output of the user interface of FIGS. 34 to 36). For example, a total snore time may be determined and "snore snippets" may be displayed in the user interface example of FIG. 36 from the classification. As shown, each of a plurality of the audio signal snippets or sub-segments of the larger audio session, which may be chosen for display based on the classification, may be plotted in proximity to a play button (See play arrow symbols of FIG. 36) on the user interface to permit playback of the associated audio signal with a speaker(s) of the apparatus that presents the user interface).
Shallom, Voss, Bellegarde, Le Roux, and Tiron are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to combine the teachings of Tiron because doing so would allow for better detection of speech and respiratory sounds, leading to more accurate word and syllable segments for determining sequence alignment (Tiron [0428] the processing device may employ a generic VAD to merely reject low background noise, and retain both speech and respiratory related sounds).

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Shallom in view of Voss, Bellegarde, and Le Roux, as applied to claim 1, 9, and 11-14 above, and further in view of Harada (US 20100010813 A1).
Regarding claim 19, the combination of Shallom, Voss, Bellegarde, and Le Roux teaches: the method of claim 1. The combination of Shallom, Voss, Bellegarde, and Le Roux does not explicitly, but Harada teaches: wherein identifying segments of the voice recording that correspond to single words or syllables further comprises: excluding segments that represent erroneous detections by computing one or more Mel-frequency cepstral coefficients (MFCCs) for the segments to obtain a plurality of vectors of values ([0039] The voice analyzing unit (extraction unit) 10a acoustically analyzes voice data, and extracts, for example, MFCC parameters (feature parameters, amount of feature) from the voice data), each vector being associated with a segment, and applying an outlier detection method to the plurality of vectors of values ([0075] reject words are registered for each of recognition words registered in the word dictionary 13c. Therefore, in the case where voice data to be subjected to the voice recognition process is recognized as a word (reject word) that is similar to the recognition word that is not desired to be obtained as the result of recognition, the recognition word relating to this reject word is excluded from the result of recognition. Thus, it becomes possible to pre vent erroneous recognition, and consequently to improve the precision of the voice recognition).
Shallom, Voss, Bellegarde, Le Roux, and Harada are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Shallom, Voss, Bellegarde, and Le Roux to further combine the teachings of Harada because doing so would allow for erroneous voice segments to be detected and removed, improving speech quality and sequence alignment (Harada [0059] even in the case where, as a result of a voice recognition of the voice data of the similar word, the corresponding word is recognized as a word similar to a recognition word that is desired to be obtained as the result of recognition, since the word is a reject word, the recognition word in association with this reject word is not used as the result of recognition, thereby making it possible to prevent erroneous recognition).

	

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Nathan Tengbumroong whose telephone number is (703)756-1725. The examiner can normally be reached Monday - Friday, 11:30 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NATHAN TENGBUMROONG/Examiner, Art Unit 2654     

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654
Read full office action
Prosecution Timeline

Jul 07, 2023
Application Filed
Aug 08, 2025
Non-Final Rejection mailed — §103
Oct 10, 2025
Response Filed
Jan 16, 2026
Final Rejection mailed — §103
Mar 16, 2026
Response after Non-Final Action
Apr 07, 2026
Request for Continued Examination
Apr 13, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/195,121
Patent 12640161
METHOD AND APPARATUS FOR PROCESSING AUDIO FOR SCENE CLASSIFICATION
3y 0m to grant Granted May 26, 2026
18/173,495
Patent 12530536
Mixture-Of-Expert Approach to Reinforcement Learning-Based Dialogue Management
2y 11m to grant Granted Jan 20, 2026
17/876,156
Patent 12451142
NON-WAKE WORD INVOCATION OF AN AUTOMATED ASSISTANT FROM CERTAIN UTTERANCES RELATED TO DISPLAY CONTENT
3y 2m to grant Granted Oct 21, 2025
17/883,265
Patent 12412050
MULTI-PLATFORM VOICE ANALYSIS AND TRANSLATION
3y 1m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 4 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

2-3
Expected OA Rounds
47%
Grant Probability
74%
With Interview (+26.7%)
3y 0m (~1m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 19 resolved cases by this examiner. Grant probability derived from career allowance rate.