Last updated: April 19, 2026
Application No. 18/463,690
MACHINE LEARNING METHOD FOR DETERMINING PATIENT BEHAVIOR USING AUDIO ANALYTICS

Non-Final OA §102§112
Filed
Sep 08, 2023
Examiner
BLACK-CHILDRESS, RAJSHEED O
Art Unit
2685
Tech Center
2600 — Communications
Assignee
Insight Direct Usa Inc.
OA Round
1 (Non-Final)
This examiner grants 62% of cases after interview

— +23.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 448 resolved cases, 2023–2026
Examiner Intelligence

BLACK-CHILDRESS, RAJSHEED O View full profile →
Grants 62% of resolved cases
Career Allow Rate
279 granted / 448 resolved
At TC average
Strong +24% interview lift
Without
With
+23.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
39 currently pending
Career history
487
Total Applications
across all art units
Statute-Specific Performance

§101
2.0%
-38.0% vs TC avg
§103
52.5%
+12.5% vs TC avg
§102
17.0%
-23.0% vs TC avg
§112
21.7%
-18.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 448 resolved cases
Office Action

§102 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 1 is objected to because of the following informalities: it recites “… a computer-implemented machine-learning model …” in lines 14-15; however, it should recite “… the computer-implemented machine-learning model …”.  Appropriate correction is required.
Claim 14 is objected to because of the following informalities: it recites “… a computer-implemented machine-learning model …” in lines 18-19; however, it should recite “… the computer-implemented machine-learning model …”.  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 1 recites the limitation "the computer-implemented machine-learning engine" in line 6.  There is insufficient antecedent basis for this limitation in the claim. Claims 2-13 are rejected the same because they depend upon 1.

Claim 14 recites the limitation "the computer-implemented machine-learning engine" in line 10.  There is insufficient antecedent basis for this limitation in the claim. Claims 15-20 are rejected the same because they depend upon 14.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) and/or 102(a)(2) as being anticipated by Shriberg et al.  (US 20220165371 A1).

Regarding claim 1, Shriberg discloses a method for automatically invoking an alert based upon a behavior of a patient (Abstract: “The present disclosure provides systems and methods for assessing a mental state of a subject in a single session or over multiple different sessions … using one or more individual, joint, or fused models.” [0005] teaches that “the method may further comprise processing the data using one or more individual, joint, or fused models comprising a natural language processing (NLP) model, an acoustic model, and/or a visual model. The method may further comprise generating … one or more assessments of the mental state associated with the subject.” [0012] further explains that “the method can further comprise transmitting the assessment to a healthcare provider to be used in evaluating the mental state of the subject. The transmitting can be performed in real-time during the assessment, just-in-time, or after the assessment has been completed.” [0041] adds that “the method can further comprise determining that the confidence level does not satisfy a predetermined criterion, in real time and based at least in part on the at least one response, generating at least one additional query, and using the at least one additional query, repeating steps (a)–(d) until the confidence level satisfies the predetermined criterion.” [0529]–[0543] and Figs. 52–54 show processors, memory, and network interfaces executing the logic that performs audio/semantic analysis and alert generation. Thus, all steps are implemented by a computing system as claimed. These passages collectively describe an automated computer-implemented system that monitors a patient’s speech in real time, evaluates behavioral or mental cues, and provides an alert or transmission to a healthcare provider when a state requiring attention is detected.), the method comprising:
extracting audio data and semantic text data from an audio stream configured to capture the patient (Shriberg teaches extracting audio and semantic text data from patient speech captured by microphones or smart speakers ([0160], [0277], [0383], [0443], [0530]). [0005] states that “the method may further comprise processing the data using one or more individual, joint, or fused models comprising a natural language processing (NLP) model, an acoustic model, and/or a visual model.” [0492] discloses that “the system may perform an audio analysis or a semantic analysis on audio snippets it collects from the prospective patient.” [0285]: explains that the NLP model 1806 “stores text metadata and modeling dynamics and shares that data with acoustic model 1808, visual model 1810, and descriptive model and analytics 1812.” The paragraph list metadata types such as “, parts of speech (syntactic analysis), sentiment analysis, semantic analysis, topic analysis, etc.” — confirming that semantic text data are extracted, structured, and cross-linked with acoustic data. These passages expressly disclose extracting both audio data and semantic text data from an audio stream of the patient.); 
analyzing the audio data to identify a first feature set of audio features identified by the computer-implemented machine-learning engine as being indicative of at least one of the set of alerting behaviors corresponding to a patient classification of the patient ([0492]: “the system may also pick up non-lexical or non-linguistic cues such as pauses, gasps, sighs, and slurred or mumbled speech.” [0211] and [0213]: metrics may include … duration of responsive utterances … an amount of acoustic energy … and “voice activation” in the responses, such as the presence or absence of speech activity. These disclosures identify acoustic features extracted from audio data—pauses, gasps, duration, energy, and speech activity.); 
analyzing the semantic text data to identify a second feature set of semantic text features identified by the computer-implemented machine-learning engine as being indicative of at least one alerting behavior of the set of alerting behaviors ([0492]: “the system may determine relative frequencies of words or phrases associated with depression…such as, phrases indicating suicidal thoughts, self-harm instincts, phrases indicating a poor body image or self-image, and feelings of anxiety, isolation, or loneliness.” [0430]: “the system may also use semantic analysis to find context clues to determine the identities of missing or unintelligible words.” Thus, Shriberg discloses analysis of semantic text features such as words and phrases conveying emotional or behavioral meaning.); 
determining a patient behavior of the patient based on the first and/or second features sets, wherein the patient's behavior is determined using a computer-implemented machine-learning model generated by the computer-implemented machine-learning engine ([0423] teaches that “the regression model may determine a fit from training examples that are expressed as sums of weighted variables. The fit may be used to extrapolate a score from a patient with known weights.” [0005] likewise provides that the method may further comprise processing the data using one or more individual, joint, or fused models to generate one or more assessments of the mental state associated with the subject. These portions describe using a trained machine-learning model to determine the patient’s behavioral or mental state based on the extracted features.); 
comparing the patient's behavior with each alerting behavior of the set of alerting behaviors ([0041]: “the method can further comprise: determining that the confidence level does not satisfy a predetermined criterion … and using the at least one additional query, repeating steps (a)–(d) until the confidence level satisfies the predetermined criterion.” [0035]: “the one or more individual, joint, or fused models can be interdependent such that each … is conditioned on an output of at least one other of the … models.” These disclosures show that the system continuously compares current behavioral outputs to expected or target patterns to verify whether the detected behavior matches a known class requiring attention.); and
automatically invoking the alert when the patient's behavior is determined to be included in the set of alerting behaviors ([0012]: “The method can further comprise transmitting the assessment to a healthcare provider to be used in evaluating the mental state of the subject. The transmitting can be performed in real-time during the assessment, just-in-time, or after the assessment has been completed.” [0039] also teaches that “the method can further comprise transmitting the one or more scores with a clinical value to a healthcare provider for use in evaluating and/or providing care for a mental health of the subject. The transmitting can comprise transmitting the one or more scores with a clinical value to the healthcare provider during the screening, monitoring, or diagnosing.” These passages expressly disclose automatic notification or alert transmission to a provider once the system detects a qualifying behavioral or mental state. [0464]: “predetermined alert rules are analogous to work-flows 4810…if-then-else logic elements that specify logical states and corresponding actions to take in such states.” [0465]: “a very high priority may be assigned to the call and taken ahead of less urgent callers.” [0465]: “may trigger immediate notification of law enforcement and school personnel”).
 
Regarding claim 2, Shriberg discloses the method of claim 1, wherein the computer-implemented machine-learning model has been trained to determine the patient's behavior (Shriberg discloses that “the system may use a classification algorithm … such as a neural network or an ensemble method” ([0425]) and that “the models described herein may use confidence measures” ([0428]) and “machine learning algorithms” ([0154]) to determine a patient’s behavioral or mental state. [0423] further teaches that “the scaled score may also be determined using a regression model … from training examples that are expressed as sums of weighted variables,” thereby describing a trained, computer-implemented model used to determine patient behavior. Accordingly, Shriberg teaches that the computer-implemented model has been trained to determine the patient’s behavior.), training of the computer-implemented machine-learning model includes:
extracting training audio data and training semantic text data from a plurality of training audio streams of a corresponding plurality of training patients (Shriberg explicitly discloses that “the regression model may determine a fit from training examples that are expressed as sums of weighted variables” ([0423]) and that “the models may be continually active or passive … a machine learning algorithm may be employed on a large amount of unlabeled audio samples” ([0494]). [0285]: explains that the NLP model 1806 “stores text metadata and modeling dynamics and shares that data with acoustic model 1808, visual model 1810, and descriptive model and analytics 1812.” The paragraph list metadata types such as “, parts of speech (syntactic analysis), sentiment analysis, semantic analysis, topic analysis, etc.” [0500] further states that “the system may collect recordings of clinical encounters for physical complaints … the system may record, with patient permission, conversation patients have with health care providers during appointments.” Together, these passages show that the system extracts training audio data and corresponding textual (semantic) data from multiple patients to train its models.);
analyzing the training audio data to identify a first training feature set of audio features (Shriberg discloses that “voice-based biomarkers may be associated with lab values or physiological measurements” ([0505]) and that “the system may perform an audio analysis … to determine relative frequencies of words or phrases associated with depression … and pick up non-lexical or non-linguistic cues such as pauses, gasps, sighs, and slurred or mumbled speech” ([0492]).  [0211] further explains that “quantification logic 1110 retrieves all responses to a given question … and uses response data … to determine an amount of energy present in each response … quantifies the quality of the question as a statistical measure of the measured acoustic energy.”These passages disclose the analysis of training audio data to identify an acoustic feature set used as training features.);
analyzing the training semantic text data to identify a second training feature set of semantic text features (Shriberg teaches that “semantic analysis … determines relative frequencies of words or phrases associated with depression” ([0492]) and that “semantic analysis to find context clues to determine the identities of missing or unintelligible words” ([0430]) is part of model training and refinement. [0150] further discloses that the models perform “semantic analysis and sentiment determination of textual content,” describing semantic text features used in training. Accordingly, Shriberg discloses analyzing training semantic data to identify a semantic feature set.);
receiving a plurality of known training patient behaviors corresponding to of the plurality of training patients captured in the plurality of training audio streams (Shriberg teaches that “the regression model may determine a fit from training examples that are expressed as sums of weighted variables” ([0423]), where the “training examples” include known labels representing behavioral or clinical outcomes. [0429] explains that “the confidence level may also be based on the quality of the labels of the training data used to train the models that analyze the patient’s speech,” confirming that the system uses labeled training patient behaviors corresponding to the training data. Additionally, [0483] discloses that “human computer engineers configure this deep learning machine logic … to analyze patient data … in the context of labels specified by users, e.g., labels related to costs of the care of each patient.” Thus, Shriberg explicitly discloses receiving a plurality of known training patient behaviors (labels) for the training data.); and
determining general model coefficients of the computer-implemented machine-learning model, such general model coefficients determined so as to improve a correlation between the plurality of known training patient behaviors and a plurality of training patient behaviors as determined by the computer-implemented machine-learning model (Shriberg teaches that “the regression model may determine a fit from training examples that are expressed as sums of weighted variables” ([0423]) and that “the fit may be used to extrapolate a score from a patient with known weights. The weights may be based in part on features … and in part derived from patient information, such as patient demographics. Weights used to predict a final score or an intermediate score may be taken from previous intermediate scores.” Furthermore, [0485] describes “identifying deep learning machine (DLM) nodes of the decision tree that have the most influence over the predetermined goals” and [0487]–[0488] explain that these nodes and their conditions are used to form model logic and actions, i.e., determining coefficients or weights to optimize predictive correlation. Accordingly, Shriberg discloses determining model coefficients (weights) that improve correlation between known patient behaviors and model outputs.).

Regarding claim 3, Shriberg discloses the method of claim 2, wherein training of the computer-implemented machine-learning model further includes: selecting model features from the first and second training feature sets, the model features selected as being indicative of the known patient behaviors corresponding to of the plurality of the training patients captured in the plurality of the training audio streams (Shriberg discloses that “the regression model may determine a fit from training examples that are expressed as sums of weighted variables. The fit may be used to extrapolate a score from a patient with known weights. The weights may be based in part on features, which may be in part derived from the audiovisual signal (e.g., voice-based biomarkers) and in part derived from patient information, such as patient demographics” ([0423]).  [0429] explains that “the confidence level may also be based on the quality of the labels of the training data used to train the models that analyze the patient’s speech,” confirming the model’s reliance on feature-label relationships.  [0485] further states that “automatic work-flow generation logic 4804 identifies deep learning machine (DLM) nodes of the decision tree that have the most influence over the predetermined goals,” and [0487]–[0488] describe forming conditions and actions from those influential nodes.  [0505] discloses that “voice-based biomarkers may be associated with lab values or physiological measurements. Voice-based biomarkers may be associated with mental-health-related measurements,” illustrating the selection of audio features that correlate with known behaviors. [0492] adds that “the system may determine relative frequencies of words or phrases associated with depression … [and] pick up non-lexical or non-linguistic cues such as pauses, gasps, sighs, and slurred or mumbled speech,” showing the semantic and audio features chosen as indicative of behavioral states. These passages together disclose selecting model features—both audio and semantic—from the training feature sets that are indicative of known patient behaviors in the training population.).

Regarding claim 4, Shriberg discloses the method of claim 2, wherein the audio stream of the patient is added to the plurality of training audio stream along with the known patient behaviors of the patient (Shriberg discloses that “training of the models may be continuous, so that the model is continuously running while audio data is collected. Voice-based biomarkers may be continually added to the system and used for training during multiple epochs. Models may be updated using the data as it is collected” ([0507]). Shriberg also teaches that “the system may be able to make a particular number of profiles to account for different types of individuals … A person’s baseline mood or mental state, and thus the person’s voice-based biomarkers, may change over time. The model may be continually trained to account for this” ([0518]–[0522]). [0310]-[0311]: teaches that model trainers 2039 produce acoustic, NLP, video, and meta data models which are passed toa personalizer 2038 that leverages metadata to tailor the models specifically for a given client 260a-n – this supports individualized adaptation and tuning of model parameters based on patient data.. Furthermore, [0494] explains that “the models may be continually active or passive … A machine learning algorithm may be employed on a large amount of unlabeled audio samples,” indicating ongoing ingestion of patient audio for learning. Finally, [0500] specifies that “the system may collect recordings of clinical encounters … The system may record, with patient permission, conversation patients have with health care providers during appointments,” showing that patient audio data are gathered and incorporated in the model environment. These disclosures collectively teach that a patient’s own audio stream and corresponding behavioral data are incorporated into ongoing model training, satisfying the limitation of adding the patient’s audio to the plurality of training streams along with the known behaviors.).

Regarding claim 5, Shriberg discloses the method of claim 2, wherein the computer-implemented machine-learning model is a general patient-behavior model (Shriberg teaches a generalized behavioral model trained on multiple patient populations and refined to specific individuals. [0518] states that “the system may be able to establish a baseline profile for each individual patient.” [0521] further discloses that “a person’s baseline mood or mental state, and thus the person’s voice-based biomarkers, may change over time. The model may be continually trained to account for this.” [0522] adds that “the system may be able to make a particular number of profiles to account for different types of individuals. These profiles may be related to individuals’ genders, ages, ethnicities, languages spoken, and occupations.” Together these passages show that Shriberg employs a general model that is individualized for specific patients using behavior-known data segments.), the method further comprises:
 identifying a set of behavior-known audio portions of the patient, each of the behavior-known audio stream portions capturing features indicative of known patient-specific behaviors ([0500] teaches that “the system may collect recordings of clinical encounters for physical complaints … The system may record, with patient permission, conversation patients have with health care providers during appointments.” [0520] discloses that “some people may be more hesitant to speak than others … Some people may have experienced trauma and abuse. Some people may be in denial about their feelings,” which describes patient-specific behavioral expressions captured in audio portions. [0527] states that “the voice-based biomarkers may be able to be used to determine whether somebody is depressed, even if the person is holding back information or attempting to outsmart testing methods.” These sections identify behavior-known audio portions linked to known patient-specific behaviors.); 
extracting patient-specific audio data and patient-specific semantic text data from the set of behavior-known audio portions of the patient ([0492] teaches that “the system may perform an audio analysis or a semantic analysis on audio snippets it collects from the prospective patient.” [0430] further provides that “the system may also use semantic analysis to find context clues to determine the identities of missing or unintelligible words.” [0521] confirms that such extraction adapts to each patient’s evolving voice-based biomarkers. Thus, Shriberg discloses extracting patient-specific audio and semantic text data from known portions.); 
analyzing the patient-specific audio data to identify a first patient-specific feature set of audio features ([0521] explains that “a person’s baseline mood or mental state, and thus the person’s voice-based biomarkers, may change over time. The model may be continually trained to account for this.” [0505] teaches that “voice-based biomarkers may be associated with lab values or physiological measurements.” [0523] specifies that “older people may have thinner, breathier voices than younger people … Their weaker voices may make it more difficult for microphones to pick up specific biomarkers.” These paragraphs show the analysis of patient-specific audio features distinctive to each individual.); 
analyzing the patient-specific semantic text data to identify a second patient-specific feature set of semantic text features ([0492] states that “the system may determine relative frequencies of words or phrases associated with depression … such as phrases indicating suicidal thoughts, self-harm instincts, phrases indicating a poor body image or self-image, and feelings of anxiety, isolation, or loneliness.” [0524] adds that “men and women may express negative emotions more aggressively or violently, while women may be better able to articulate their emotions.” [0525] continues that “people from different cultures may have different methods of dealing with or expressing emotions, or may feel guilt and shame when expressing negative emotions.” These passages disclose individualized semantic text feature analysis that reflects each patient’s linguistic and cultural expressions.); and 
receiving the known patient-specific behaviors corresponding to the patient captured in the audio stream ([0429] discloses that “the confidence level may also be based on the quality of the labels of the training data used to train the models that analyze the patient’s speech.” [0527] shows that known behavioral cues—even involuntary ones—are recognized and labeled: “many of the voice-based biomarkers may be involuntary utterances. For example, the patient may equivocate or the patient’s voice may quaver.” [0519] similarly references the recognition of “particular voice-based biomarkers indicating emotions, such as happiness, sadness, anger, and grief.” These passages teach the receipt of known, labeled patient-specific behaviors linked to corresponding audio.); and 
determining patient-specific model coefficients of a patient-specific patient-behavior model, such patient-specific model coefficients determined so as to improve a correlation between the known patient-specific behaviors and patient behaviors as determined by the patient-specific patient-behavior model ([0310]-[0315]: describe that personalizer 2038 leverages metadata to tailor acoustic, NLP, and video models for each client 260a-n, adjusting model coefficients for individualized behavior prediction. [0423] states that “the regression model may determine a fit from training examples that are expressed as sums of weighted variables. The fit may be used to extrapolate a score from a patient with known weights. The weights may be based in part on features … and in part derived from patient information, such as patient demographics. Weights used to predict a final score or an intermediate score may be taken from previous intermediate scores.” [0521] further emphasizes that “the model may be continually trained to account for this,” describing ongoing adjustment of coefficients to improve correlation between predicted and known patient-specific behaviors. Accordingly, Shriberg teaches determining patient-specific model coefficients for individualized behavior prediction.).

Regarding claim 6, Shriberg discloses the method of claim 1, wherein the first feature set includes metrics related to: a number of times a first audio feature occurs; a frequency of occurrences of the first audio feature; a time period between occurrences of the first audio feature; and/or a time period between occurrences of the first audio feature and a second audio feature (Shriberg discloses that “the system may perform an audio analysis or a semantic analysis on audio snippets it collects from the prospective patient” ([0492]). [0492] continues, stating that “the system may determine relative frequencies of words or phrases associated with depression (e.g., phrases indicating suicidal thoughts, self-harm instincts, phrases indicating a poor body image or self-image, and feelings of anxiety, isolation, or loneliness).” The same paragraph further specifies that “the system may also pick up non-lexical or non-linguistic cues such as pauses, gasps, sighs, and slurred or mumbled speech.” [0211]-[0213] describes quantitative measures of acoustic events by explaining that metrics may include: (i) the length of elicited responses, (ii) duration of responsive utterances, (iii) weighted word score, (iv) an amount of acoustic energy, and (v) ‘voice activation’ in the responses, such as the presence or absence of speech activity. [0214] adds that quantification logic 1110 retrieves all responses to a given question and uses response data to determine an amount of energy present in each response, and quantifies the quality of the question as a statistical measure of the measured acoustic energy (e.g., a statistical mean thereof). [0210]-[0211] and [0216] discloses additional statistical analysis of acoustic data, noting that question quality logic 1102 may then average the metrics of the same question over multiple subjects and multiple questions to determine the overall question quality score. [0521] indicates that “a person’s baseline mood or mental state, and thus the person’s voice-based biomarkers, may change over time. The model may be continually trained to account for this,” which inherently requires temporal tracking of feature occurrences and spacing between events. These disclosures collectively teach that Shriberg’s system quantifies acoustic events such as pauses, sighs, or speech bursts, determines their frequency and occurrence patterns, and measures durations and time intervals between audio features—all corresponding directly to the metrics recited in claim 6.).

Regarding claim 7, Shriberg discloses the method of claim 1, wherein the second feature set includes metrics related to: a number of times a first semantic text feature occurs; a frequency of occurrences of the first semantic text feature; a time period between occurrences of the first semantic text feature; and/or a time period between occurrences of the first semantic text feature and a second semantic text feature (Shriberg teaches that “the system may determine relative frequencies of words or phrases associated with depression (e.g., phrases indicating suicidal thoughts, self-harm instincts, phrases indicating a poor body image or self-image, and feelings of anxiety, isolation, or loneliness)” ([0492]). [0492] also states that “the system may perform an audio analysis or a semantic analysis on audio snippets it collects from the prospective patient.” [0430] explains that “the system may also use semantic analysis to find context clues to determine the identities of missing or unintelligible words.” [0216] further discloses statistical temporal aggregation of textual and acoustic metrics, stating that “question quality logic 1102 may then average the metrics of the same question over multiple subjects and multiple questions … to determine the overall question quality score.” [0215] confirms that “quantification logic 1110 retrieves all responses to a given question and uses response data … to determine an amount of energy present in each response, and quantifies the quality of the question as a statistical measure … over multiple time intervals.” Finally, [0521] indicates that “the model may be continually trained to account for this,” referring to evolving linguistic and semantic patterns across time. Together, these disclosures demonstrate that Shriberg’s system analyzes semantic text features such as words and phrases, measures their relative frequencies and temporal relationships, and aggregates those metrics across time windows—corresponding directly to the claimed “number,” “frequency,” and “time-period” relationships between semantic features.).

Regarding claim 8, Shriberg discloses the method of claim 1, further comprising: generating a fourth feature set that includes feature combinations of an audio feature and a semantic text feature (Shriberg discloses that “the data can be processed using one or more individual, joint, or fused models comprising a natural language processing (NLP) model, an acoustic model, and/or a visual model” ([0005]). [0035] states that “the one or more individual, joint, or fused models can be interdependent such that each of the one or more individual, joint, or fused models is conditioned on an output of at least one other of the one or more individual, joint, or fused models.” [0036] explicitly teaches cross-modality fusion: “generating the score can comprise fusing the NLP output, the acoustic output, and the visual output.” [0037] further provides that “generating the confidence level of the score can comprise fusing (i) a confidence level of the NLP output with (ii) a confidence level of the acoustic output.” [0235] reiterates the combined modeling: “the acoustic model and the natural language processing model may be fused or otherwise combined to jointly assess the patient’s mental state.” [0285]: explains that the NLP model 1806 “stores text metadata and modeling dynamics and shares that data with acoustic model 1808, visual model 1810, and descriptive model and analytics 1812.” The paragraph list metadata types such as “, parts of speech (syntactic analysis), sentiment analysis, semantic analysis, topic analysis, etc.” [0521] indicates that “the model may be continually trained to account for this,” referring to adaptive weighting and interaction between feature types across sessions. These portions show that Shriberg’s system generates joint or fused model outputs that integrate audio (acoustic) features and semantic (NLP/text) features, thereby forming a combined feature set corresponding to the claimed “fourth feature set.”).

Regarding claim 9, Shriberg discloses the method of claim 8, wherein the fourth feature set includes metrics related to: a number of times feature combination occurs; a frequency of occurrences of the feature combination; a time period between occurrences of the feature combination; and/or a time period between occurrences of a first feature combination and a second feature combination (Shriberg teaches multimodal feature fusion where “the data can be processed using one or more individual, joint, or fused models comprising a natural language processing (NLP) model, an acoustic model, and/or a visual model” ([0005]). [0035] explains that “the one or more individual, joint, or fused models can be interdependent such that each of the one or more individual, joint, or fused models is conditioned on an output of at least one other of the one or more individual, joint, or fused models.” [0036] further teaches that “generating the score can comprise fusing the NLP output, the acoustic output, and the visual output.” [0037] adds that “generating the confidence level of the score can comprise fusing (i) a confidence level of the NLP output with (ii) a confidence level of the acoustic output.” [0492] discloses that “the system may determine relative frequencies of words or phrases associated with depression … and may also pick up non-lexical or non-linguistic cues such as pauses, gasps, sighs, and slurred or mumbled speech.” [0213] specifies computation of quantitative metrics including “duration of responsive utterances … an amount of acoustic energy … and ‘voice activation’ in the responses, such as the presence or absence of speech activity.” [0216] further states that “question quality logic 1102 may then average the metrics … over multiple subjects and multiple questions … to determine the overall question quality score.” Collectively, these teachings demonstrate that Shriberg’s fused modeling of acoustic and semantic outputs inherently produces feature combinations (joint audio–text features) and quantifies them through frequency, occurrence, and time-based metrics over multiple time intervals, directly corresponding to the recited limitations of claim 9.).

Regarding claim 10, Shriberg discloses the method of claim 1, wherein the set of alerting behaviors includes a mental state of the patient, the mental state is based on the first feature set, the second feature set, the third feature set, and a multidimensional mental-state model, wherein: the multidimensional mental-state model includes a first dimension, a second dimension, and a third dimension; the first dimension corresponds to a first aspect of mental state; the second dimension corresponds to a second aspect of mental state; and the third dimension corresponds to a third aspect of mental state (Shriberg’s Abstract states that “the present disclosure provides systems and methods for assessing a mental state of a subject in a single session or over multiple different sessions … using one or more individual, joint, or fused models.” [0005] similarly teaches that “the method may further comprise processing the data using one or more individual, joint, or fused models comprising a natural language processing (NLP) model, an acoustic model, and/or a visual model. The method may further comprise generating … one or more assessments of the mental state associated with the subject.” [0029] discloses “the one or more individual, joint, or fused models can comprise a composite model that can be an aggregate of two or more different models.” [0035] explains that “the one or more individual, joint, or fused models can be interdependent such that each … is conditioned on an output of at least one other of the … models.” [0036] adds that “generating the score can comprise fusing the NLP output, the acoustic output, and the visual output.” [0521] teaches that “a person’s baseline mood or mental state, and thus the person’s voice-based biomarkers, may change over time. The model may be continually trained to account for this.” [0523] provides that “older people may have thinner, breathier voices … Their weaker voices may make it more difficult for microphones to pick up specific biomarkers.” [0524] states that “men and women may express negative emotions more aggressively or violently, while women may be better able to articulate their emotions.” [0525] further discloses that “people from different cultures may have different methods of dealing with or expressing emotions, or may feel guilt and shame when expressing negative emotions.” Taken together, these passages show that Shriberg determines a mental state using multiple feature sets—acoustic, semantic/NLP, and visual—within a composite (multidimensional) model whose distinct dimensions correspond to separate aspects of mental state such as mood, affect, and expression.).

Regarding claim 11, Shriberg discloses the method of claim 1, wherein the set of alerting behaviors includes physical actions of the patient, the physical action based on the first feature set, the physical actions included in the set of alerting behaviors includes one or more of the following: the patient lying on back; the patient lying on left side; the patient lying on right side; the patient lying on front; the patient moving legs. the patient waving arms; the patient shaking; the patient shivering; the patient perspiring; the patient sitting up in a bed; the patient sitting on a side of the bed; the patient getting out of the bed; the patient standing; and the patient falling (Shriberg’s Abstract teaches that “the present disclosure provides systems and methods for assessing a mental state of a subject in a single session or over multiple different sessions … using one or more individual, joint, or fused models.” [0005] states that “the method may further comprise processing the data using one or more individual, joint, or fused models comprising a natural language processing (NLP) model, an acoustic model, and/or a visual model. The method may further comprise generating … one or more assessments of the mental state associated with the subject.” [0017] explicitly adds: “In some embodiments, the data can further comprise images or video of the subject. The data can be further processed using the visual model to generate the assessment of the mental state of the subject. In some embodiments, the visual model can be selected from the group consisting of a facial cue model, a body movement/motion model, and an eye activity model.” [0047] further specifies that “the video model can comprise one or more of a facial cue model, a body movement/motion model, and a gaze model.” [0523] states that “older people may have thinner, breathier voices than younger people … Their weaker voices may make it more difficult for microphones to pick up specific biomarkers.” — indicating the inclusion of physical manifestations such as tremors or weakness in patient monitoring. [0524] adds that “men and women may express negative emotions more aggressively or violently, while women may be better able to articulate their emotions.”[0525] teaches that “people from different cultures may have different methods of dealing with or expressing emotions, or may feel guilt and shame when expressing negative emotions.” These disclosures collectively teach that Shriberg’s system uses visual and motion-based models to capture body movement and physical behaviors of a patient—such as gestures, shaking, or positional changes—based on the same multimodal feature framework that includes audio and semantic features.).

Regarding claim 12, Shriberg discloses the method of claim 1, wherein the set of alerting behaviors includes verbal statements, the verbal statements based on the third feature set, the verbal statements included in the set of alerting behaviors include one or more of the following: a request for assistance; an expressed lament; an expressed concern; an expression of worry; an expression of sorrow; and a statement of pain (Shriberg’s Abstract teaches that “the present disclosure provides systems and methods for assessing a mental state of a subject … using … a natural language processing (NLP) model, an acoustic model, and/or a visual model.” [0005] similarly states that “the method may further comprise processing … data using … a natural language processing (NLP) model … and generating … one or more assessments of the mental state associated with the subject.” [0492] explicitly teaches semantic analysis of verbal content: “the system may determine relative frequencies of words or phrases associated with depression (e.g., phrases indicating suicidal thoughts, self-harm instincts, phrases indicating a poor body image or self-image, and feelings of anxiety, isolation, or loneliness). The system may also pick up non-lexical or non-linguistic cues such as pauses, gasps, sighs, and slurred or mumbled speech.” [0524] adds that “men may express negative emotions more aggressively or violently, while women may be better able to articulate their emotions.” [0525] further discloses that “people from different cultures may have different methods of dealing with or expressing emotions, or may feel guilt and shame when expressing negative emotions.” [0527] notes that “many of the voice-based biomarkers may be involuntary utterances. For example, the patient may equivocate or the patient’s voice may quaver.” [0519] teaches that “particular voice-based biomarkers indicating emotions, such as happiness, sadness, anger, and grief” are detected. These passages show that Shriberg’s NLP model processes verbal statements conveying requests, concerns, sorrow, or emotional distress, by identifying phrases and emotional expressions in speech indicative of depression, anxiety, or pain.).

Regarding claim 13, Shriberg discloses the method of claim 1, wherein the set of alerting behaviors includes non-textual sounds made by the patient, the alerting behaviors based on the second feature set, the non-textual sounds included in the set of alerting behaviors include one or more of the following: crying; groaning; moaning; whimpering; and sounds of breathing difficulty (Shriberg’s Abstract states that “the present disclosure provides systems and methods for assessing a mental state of a subject … using … a natural language processing (NLP) model, an acoustic model, and/or a visual model.” [0005] explains that “the method may further comprise processing the data using one or more individual, joint, or fused models comprising a natural language processing (NLP) model, an acoustic model, and/or a visual model.” [0492] explicitly discloses detection of non-lexical and paralinguistic cues, stating that “the system may also pick up non-lexical or non-linguistic cues such as pauses, gasps, sighs, and slurred or mumbled speech.” [0505] further teaches that “voice-based biomarkers may be associated with lab values or physiological measurements,” linking acoustic properties to underlying physiological or distress indicators. [0521] describes that “a person’s baseline mood or mental state, and thus the person’s voice-based biomarkers, may change over time. The model may be continually trained to account for this.” [0527] reinforces that “many of the voice-based biomarkers may be involuntary utterances. For example, the patient may equivocate or the patient’s voice may quaver.” These disclosures demonstrate that Shriberg’s system explicitly detects non-textual, involuntary acoustic events—including sighs, gasps, quavers, and other non-speech sounds—which correspond directly to crying, groaning, moaning, whimpering, or breathing difficulty as recited in the claim.).

Claims 14-20 are being rejected similarly to the rejection of claims 1-7 above for being directed to an apparatus having operations/functions corresponding to the steps of claims 1-7 above whereby the scope and contents of the recited limitations are substantially the same.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RAJSHEED O BLACK-CHILDRESS whose telephone number is (571)270-7838. The examiner can normally be reached M to F, 10am to 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Quan-Zhen Wang can be reached at (571) 272-3114. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RAJSHEED O BLACK-CHILDRESS/Examiner, Art Unit 2685
Read full office action
Prosecution Timeline

Sep 08, 2023
Application Filed
Oct 18, 2025
Non-Final Rejection — §102, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/590,796
Patent 12602983
SYSTEM, METHOD AND STORAGE MEDIUM FOR VEHICLE INSPECTION AUTHORIZATION INFORMATION MANAGEMENT
2y 5m to grant Granted Apr 14, 2026
17/713,556
Patent 12597901
RECONFIGURABLE INTELLIGENT SURFACE REALIZED WITH INTEGRATED CHIP TILING
2y 5m to grant Granted Apr 07, 2026
18/346,129
Patent 12592145
FIRE DETECTION SYSTEM TESTING
2y 5m to grant Granted Mar 31, 2026
18/760,443
Patent 12580074
METHODS, DEVICES AND SYSTEMS FOR MEDICAL CODE EVENT INFORMATION TRACKING
2y 5m to grant Granted Mar 17, 2026
18/471,329
Patent 12573273
Audio Assisted File Sharing
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
62%
Grant Probability
86%
With Interview (+23.9%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 448 resolved cases by this examiner. Grant probability derived from career allow rate.