Last updated: April 19, 2026
Application No. 18/358,300
SYSTEM AND METHOD FOR USING GESTURES AND EXPRESSIONS FOR CONTROLLING SPEECH APPLICATIONS

Non-Final OA §102§103
Filed
Jul 25, 2023
Examiner
LEVEL, BARBARA HENRY
Art Unit
2142
Tech Center
2100 — Computer Architecture & Software
Assignee
Wispr AI Inc.
OA Round
1 (Non-Final)
Interview Optional

— +26.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 330 resolved cases, 2023–2026
Examiner Intelligence

LEVEL, BARBARA HENRY View full profile →
Grants 72% — above average
Career Allow Rate
236 granted / 330 resolved
+16.5% vs TC avg
Strong +27% interview lift
Without
With
+26.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
16 currently pending
Career history
346
Total Applications
across all art units
Statute-Specific Performance

§101
17.2%
-22.8% vs TC avg
§103
42.5%
+2.5% vs TC avg
§102
10.4%
-29.6% vs TC avg
§112
20.7%
-19.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 330 resolved cases
Office Action

§102 §103
DETAILED ACTION
This correspondence is responsive to the application filed on July 25, 2023. Claims 1-20 are pending in the case, with claims 1 and 17 in independent form.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Summary of Detailed Action
Claims 1-2, 4-6, 11-12, 17-18 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Choudhary et al.
Claims 3 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Choudhary, and further in view of Kienzle et al.
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Choudhary, and further in view of Xu et al.
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Choudhary and Xu, and further in view of YANHUI GUO et al., Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos.
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Choudhary, and further in view of Li et al.  
Claims 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Choudhary and Li, and further in view of Liu et al.
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Choudhary, Li and Liu, and further in view of Reinspach et al. 
Claim 9 is objected to.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-2, 4-6, 11-12, 17-18 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Choudhary et al. (Pub. No. US 2021/0012770 A1, published January 14, 2021) hereinafter Choudhary.

Regarding claim 1, Choudhary teaches:
A method for training a model, the method comprising (i.e., Each of the embedding networks and the classifier can be updated (e.g., trained) by individual users to improve recognition of user commands that are received via various modalities (A method for training a model). For example, if a spoken user command is received that cannot be interpreted with high confidence, the user interface can query the user as to the meaning of the spoken command, and the user can input the meaning using a different modality, such as by performing a gesture input that is recognized by the user interface. Choudhary, para 24, 22-26, 55.):
receiving an output of a model (i.e., The multi-modal recognition engine 130 is configured to receive data from one or more of the input devices 112-116 and to process the received data to generate an output. For example, the output can include a command that most closely matches the received input and a confidence (or likelihood) indicator associated with the command (output of a model). …The combined embedding vector is processed to determine an output, such as by using a classifier trained to map combined embedding vectors to commands. An illustrative example of components that can be implemented in the multi-modal recognition engine 130 is described with reference to FIG. 2. The feedback message generator 132 is configured to generate feedback message data to be output to the user 102 via the output device 120. For example, the feedback message generator 132 can send a feedback message 144 to the output device 120 to instruct the user 102 to repeat a user input that was not adequately recognized, such as predicted to be a particular command with a confidence level below a threshold. (receiving an output of a model (receiving an output of a model that user input not adequately recognized). Choudhary, Figs 1-3, 12A, 12B, 13, para 39-40, 37-38, 42-43, 22-26, 55, 57, 80, 90, 107.);
receiving an input signal from a speech input device wearable on a user, wherein the input signal is captured when the user is making a facial expression or gesture or speaking in response to the output (i.e., FIG. 12A is a diagram of a virtual reality or augmented reality headset operable to process multi-modal user input (speech input device wearable on a user), …. FIG. 12B is a diagram of a wearable electronic device operable to process multi-modal user input (speech input device wearable on a user). Choudhary, Figs 3, 12A-12B, para 19-20, 26, 32-36, 64, 68-69.  Each of the embedding networks and the classifier can be updated (e.g., trained) by individual users to improve recognition of user commands that are received via various modalities. For example, if a spoken user command is received that cannot be interpreted with high confidence, the user interface can query the user as to the meaning of the spoken command, and the user can input the meaning using a different modality, such as by performing a gesture input that is recognized by the user interface (receiving an input signal from a speech input device wearable on a user (headset, smart watch see Figures 3, 12 A, 12 B, para 26, 32-35, 42), wherein the input signal is captured when the user is making a facial expression or gesture or speaking (performing gesture, facial expression para 34, speaking para 42) in response to the output). Choudhary, Figs 1-3, 12A-12B, 13, paragraphs 24, 22-26, 68-72, 71, 80, 19-20, 32-36, 42, 64, 113-115, 4-7); 
determining a feedback signal based on the input signal; and 
using the feedback signal at least in part to retrain the model (i.e., The data adjustor 292 is configured to determine adjustments of one or more of the embedding networks 202, 204, 206, or 220, adjustments of one or more of the weights W1-W3, or a combination thereof, to update embedding network data and weight data to represent changes that are determined to not be based on temporary conditions. In some implementations, the data adjustor 292 is configured to perform update training to one or more of the embedded networks 202, 204, 206, or 220 to indicate updated mappings of user inputs to specific commands, such as in response to receiving disambiguation feedback from a user that helps the multi-modal recognition engine 130 to more accurately recognize a user input (e.g., to adapt to differences between the user's pronunciation of spoken command and a default speech recognition model) (determining a feedback signal (determining a disambiguation feedback) based on the input signal; and using the feedback signal (disambiguation feedback) at least in part to retrain (update) the model) or in response to user input indicating a custom mapping of an input to a particular command (e.g., the user inputs a “thumbs-up” gesture with both hands as a previously-unknown video input and indicates that this video input should cause the device 110 to turn off an alarm)  Choudhary, Figs 1-2, 9-10, para 55.).

Regarding claim 2, which depends from claim 1 and recites:
wherein the input signal is at least one of a group comprising, an EMG signal, a microphone input signal, an inertial measurement unit, a camera, and a biosensor (i.e., FIG. 12A depicts an example of the multi-modal recognition engine 130 and the feedback message generator 132 integrated into a headset 1202, such as a virtual reality, augmented reality, or mixed reality headset. A visual interface device, such as a display 1220, can correspond to the output device 120 and is positioned in front of the user's eyes to enable display of augmented reality or virtual reality images or scenes to the user while the headset 1202 is worn. Sensors 1250 can include one or more microphones, cameras, or other sensors, and can correspond to the input devices 112-116 of FIG. 1 (input signal is at least one of a group comprising, an EMG signal, a microphone input signal, an inertial measurement unit, a camera, and a biosensor). Although illustrated in a single location, in other implementations one or more of the sensors 1250 can be positioned at other locations of the headset 1202, such as an array of one or more microphones and one or more cameras distributed around the headset 1202 to detect multi-modal inputs. FIG. 12B depicts an example of the multi-modal recognition engine 130 and the feedback message generator 132 integrated into a wearable electronic device 1204, illustrated as a “smart watch,” that includes the display 1220 and the sensors 1250. The sensors 1250 enable detection, for example, of user input based on modalities such as video, speech, and gesture (input signal is at least one of a group comprising, an EMG signal, a microphone input signal, an inertial measurement unit, a camera, and a biosensor). Also, although illustrated in a single location, in other implementations one or more of the sensors 1250 can be positioned at other locations of the wearable electronic device 1204.Choudhary, Figs 1-2, 3, 12A, 12B, 13 paragraphs 113-114, 120, 33, 34, 42, 69, 71.).

Regarding claim 4, which depends from claim 1 and recites:
wherein the model is a speech recognition model (i.e., Devices and methods are described to enable user interaction using multiple input modalities. Many user interfaces are based on automatic speech recognition (ASR) (model is a speech recognition model) and natural language processing (NLP) and are trained over many different commands, accents, and languages to be useful over a large customer base. Choudhary, Figs 1-3, 12A, 12B, 13, para  22, 23-26, 37-43, 55, 57, 77-80, 107).

Regarding claim 5, which depends from claim 1 and recites:
wherein the model is associated with a digital assistant  (i.e., FIG. 13 depicts a block diagram of a particular illustrative implementation of a device 1300 that includes the multi-modal recognition engine 130, such as in a wireless communication device implementation (e.g., a smartphone) or a digital assistant device implementation (model (multi-modal recognition model) is associated with a digital assistant). In various implementations, the device 1300 may have more or fewer components than illustrated in FIG. 13. In an illustrative implementation, the device 1300 may correspond to the device 110. In an illustrative implementation, the device 1300 may perform one or more operations described with reference to FIGS. 1-12B. Choudhary, Figs 1-3, 13, para  115, 31, 68, 122, 22-26, 37-43, 57, 77-80).

Regarding claim 6, which depends from claim 1 and recites:
further comprising determining a dataset comprising a plurality of feedback signals including the feedback signal and retraining the model based on the dataset (i.e., The first history data 258 includes historical data associated with the first user and enables the processor 108 to update the first embedding network data 252, the first weight data 254, or both, based on historical trends corresponding to multi-modal inputs of the first user processed by the multi-modal recognition engine 130 (a dataset (history data dataset) comprising a plurality of feedback signals including the feedback signal (a plurality of multi-modal inputs processed including the determined disambiguation feedback response multi-modal input) and retraining (updating) the model based on the dataset). Choudhary, Figs 1-3, para 50.).

Regarding claim 11, which depends from claim 1 and recites:
further comprising determining content of words spoken by the user based on the feedback signal (i.e., In some implementations, the first input 140 is a command, and the feedback message 144 instructs the user 102 to provide the second input 148 to disambiguate the first input 140. The multi-modal recognition engine 130 may send the feedback message 144 in response to a confidence level associated with recognition processing of the first input 140 failing to satisfy a confidence threshold, indicating uncertainty in an output (e.g., uncertainty of whether a spoken input indicates “up” or “off”). The user 102 may provide the second input 148 (e.g., pointing upward), and based on second data 150 that indicates the second input 148, the multi-modal recognition engine 130 can update a mapping of the first input 140 (e.g., the speech “up”) (determining content of words spoken by the user based on the feedback signal) to an action (e.g., increase a music volume) that is associated with the second input 148, such as described in further detail in FIG. 2. Choudhary, Figs 1-3, 12A, 12B, 13, para 43.).

Regarding claim 12, which depends from claim 1 and recites:
wherein the output of the model is provided at least in part by a knowledge system configured to interact with the user (i.e., The output device 120 is configured to output information for the user 102, such as via generation of an audible output using a loudspeaker, visual output using a display, via one or more other output modalities (e.g., haptic), or any combination thereof. For example, the output device 120 can receive message data (e.g., a feedback message 144) from the control unit 104 and can generate an output (e.g., an instruction 146) to the user 102, as described further below (output of the model (model output feedback) is provided at least in part by a knowledge system (embedding networks, classifiers, Figures 1-2, para 37) configured to interact with the user). In a particular example, the output device 120 includes a display configured to represent a graphical user interface, one or more loudspeakers configured to render or direct the feedback message 144 to the user 102, or a combination thereof. Choudhary, Figs 1-3, 12A,12B, 13, para 36-37, 38-44, 55. 
The control unit 104 is configured to receive data corresponding to user inputs from the input devices 112-116 and to generate feedback messages to be provided to the user 102 via the output device 120. The control unit 104 includes a memory 106 coupled to one or more processors, referred to as processor 108. As described further with reference to FIG. 2, the memory 106 can include data representing one or more embedding networks, data representing one or more transformations of embedding vectors to a combined embedding space, and data representing one or more classifiers (output of the model is provided at least in part by a knowledge system configured to interact with the user), accessible for use by the processor 108. The memory 106 can also include instructions executable by the processor 108 to implement a multi-modal recognition engine 130, a feedback message generator 132, or both. Choudhary, Figs 1-3, 12A,12B, 13, para 36-37, 38-44, 55.).

Claims 17-18 and 20 recite non-transitory computer-readable media that parallel the method of claims 1-2 and 4. Therefore, the analysis discussed above with respect to claims 1-2 and 4 also applies to claims 17-18 and 20, respectively. Accordingly, claims 17-18 and 20 are rejected based on substantially the same rationale as set forth above with respect to claims 1-2 and 20, respectively. More specifically regarding A non-transitory computer-readable medium containing instruction that, when executed, cause at least one computer hardware processor to perform (i.e., Choudhary, para 7, 129)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 3 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Choudhary as applied to claims 1 and 17 above, and further in view of Kienzle et al. (Pub. No. US 2018/0046851 A1, published February 15, 2018) hereinafter Kienzle. The Examiner notes that Kienzle is cited on Applicant’s Information Disclosure Statement filed May 31, 2024.

Regarding claim 3, which depends from claim 1 and recites:
wherein the feedback signal indicates a frown and/or a head gesture.
Choudhary teaches the method of claim 1, including the feedback signal indicating a gesture. Choudhary, Figs 1-3, 12A-12B, 13, paragraphs 24, 22-26, 68-72, 71, 80, 19-20, 32-36, 42, 64, 113-115, 4-7. Choudhary does not specifically disclose a frown and/or a head gesture. 
However, Kienzle teaches in the field related to systems designed to detect and respond to natural human movements and conversational queries, and more specifically to systems designed to identify and act upon entities of interest to an individual using potentially imprecise cues obtained from a combination of several types of signals such as gestures and gaze directions. Kienzle, para 2.  Kienzle, which is analogous to the claimed invention because Kienzle is directed to command processing using multimodal signal analysis, teaches that, The method may also comprise obtaining a second set of signals corresponding to a different signal modality, such as hand pointing gestures or head movements such as nods (frown and/or a head gesture). Kienzle, para 5, 26, 27-28, 8-9, 48, 52, 63. As such, in system 100, in addition to detectors for gaze, gesture and speech/voice tokens, one or more detectors 156 for other modalities such as facial expressions (including smiles, frowns, etc.) (frown and/or a head gesture), head orientation or movement (including nods, head shakes etc.) (frown and/or a head gesture), torso orientation or movement, gestures made using body parts other than hands (such as shoulder shrugs), and/or involuntary physiological responses/behaviors such as changes to heart rate, breathing rate, skin conductance and the like may also or instead be used. Kienzle. Para 26, 27-28, 5, 8-9, 48, 52, 63. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement the method for training a model and multi-modal user interfaces of Choudhary using the frown and/or a head gesture of Kienzle, with a reasonable expectation of success, in order to provide enable users to immerse themselves in selected environments in which naturalistic human behaviors can be used. Kienzle. Para 4, 5, 8-9, 26-28, 48, 52, 63. 
This would have provided the user with the advantages of additional gesture and facial expressions recognition and feedback.

Claim 19 recites a non-transitory computer-readable media that parallels the method of claim 3. Therefore, the analysis discussed above with respect to claim 3 also applies to claim 19. Accordingly, claim 19 is rejected based on substantially the same rationale as set forth above with respect to claim 3.
 
Claim(s) 7 is rejected under 35 U.S.C. 103 as being unpatentable over Choudhary as applied to claim 1, and further in view of Xu et al. (Pub. No. US 2023/0260536 A1, provisional application filed January 24, 2022) hereinafter Xu.

Claim 7, which depends from claim 1 and recites:
further comprising converting the feedback signal to a scalar value.
Choudhary teaches the method of claim 1, including the feedback signal. Choudhary does not specifically disclose converting the feedback signal to a scalar value. 
However, Xu teaches in the field related to an interactive artificial intelligence analytical system. Xu, title, abstract, para 2, 5.  Xu, which is analogous to the claimed invention because Xu is directed to receiving, analyzing and transforming input communications from a user, teaches that, Xu teaches that, A facial analysis module may be used to detect faces and corresponding facial landmark points in the video frames and to extract facial features, which may be used to yield analytical outputs such as gaze direction and micro-expressions. “Micro-expression” refers to an involuntary facial display of emotion that lasts for a fraction of a second, sometimes as little as 1/25th of a second. The person who has expressed a micro-expression may not be aware that they have displayed an emotion through the micro-expression and may even wish to conceal the emotion. When combined later into category scores using combinatorial logic 510, emotion analysis, such as anger, hesitation, passion, nervousness/confidence, and energy level may be predicted. “Category score” refers to a value resulting from a transformation of a feature vector into a scalar based on category-specific combinatorial logic (converting the feedback signal to a scalar value). Those skilled in the art will appreciate that the human morphology features may include any other trainee features detectable with a camera, such as iris dilation, dressing etiquette, and so on. The visual features 524 or morphology features from the converted video signal may serve as inputs to a transformational module 508. Xu, Abstract, Fig 5, para 52, 50-53, 4-5, 41.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement the method for training a model and multi-modal user interfaces of Choudhary using the feature for converting the feedback signal to a scalar value of Xu, with a reasonable expectation of success, in order to transform, combine, and analyze visual and/or auditory data of the actual communicator and provide feedback to the communicator and for refining learning models to improve results over time and so that emotion analysis may be predicted. Xu. para 4-5, 52, 50, 41. 
This would have provided the user with the advantages of transforming and analyzing input communications for category specific scores.  

Claim(s) 8 is rejected under 35 U.S.C. 103 as being unpatentable over Choudhary and Xu as applied to claim 7 above, and further in view of YANHUI GUO et al., Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos, MM '20: The 28th ACM International Conference on Multimedia October 12 - 16, 2020 WA, Seattle, USA, https://dl.acm.org/doi/10.1145/3394171.3413709, Published: 12 October 2020, hereinafter Guo.

Regarding claim 8, which depends from claim 7 and recites:
further comprising using the scalar value representing the feedback signal to retrain the model.
Choudhary in view of Xu teaches the method of claim 7. Choudhary in view of Xu does not specifically disclose using the scalar value representing the feedback signal to retrain the model.
However, Guo teaches in the field related to multi-modality neural network. Guo, Abstract, page 3947, Introduction page 3948. Guo, which is analogous to the claimed invention because Gou is directed to mulit-modalaity video, audio, emotional recognition, teaches that, We adopt the conditional GAN (cGAN) technique [16, 19] to incorporate the prior knowledge of emotion state 𝑠 into statistical inference. Specifically in our implementation, the added cGAN subnet forces the restored imageˆ𝐼 of MMSD-Net to pass the test of being the original face image 𝐼𝑡 in the given emotion state 𝑠 without compression or down sampling. The test is done by examining whether ˆ𝐼 and 𝐼 obey the same conditional probability distribution of face images in the given emotion state 𝑠. The discriminator 𝐷, with the same architecture as [19], takes the restored image ˆ𝐼 and given emotion state 𝑠 as input and outputs a single scalar 𝐷(ˆ𝐼,𝑠) representing the probability that ˆ𝐼 came from real images with given emotion state 𝑠. The MMSD-Net is trained to maximize 𝐷(ˆ𝐼,𝑠) and discriminator is trained to minimize 𝐷(ˆ𝐼,𝑠) (using the scalar value representing the feedback signal to retrain the model), simultaneously. The adversarial loss function is defined as: 𝐿𝑎𝑑𝑣 = −E𝑥 𝑙𝑜𝑔 𝐷(𝐺(ˇ𝐼,𝑠),𝑠) (6) where 𝐺 represents the proposed MMSD neural network and 𝐷 is the discriminator. Combining all the loss terms introduced above, the overall objective function for optimizing the MMSD-Net is 𝐿 = ∥ˆ𝐼−𝐼∥1 +𝜆1𝐿𝑎𝑑𝑣 +𝜆2𝐿𝐸 (7) where𝜆1 and 𝜆2 are hyper-parameters. Recall from the discussions around Eq(5) that the restored frame ˆ𝐼 is an inference based on the three-modality features 𝒇𝑉,𝐴,𝐸. Gou, Section  3.5 and Section 3.4, page 3951.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement the method for training a model and multi-modal user interfaces of Choudhary using the feature for converting the feedback signal to a scalar value of Xu and using the scalar value representing the feedback signal to retrain the model of Gou, with a reasonable expectation of success, in order to transform, combine, and analyze visual and/or auditory data of the actual communicator and provide feedback to the communicator and for refining learning models to improve results over time and so that emotion analysis may be predicted and to accurately predict emotional states. Xu. para 4-5, 52, 50, 41. Gou, Section 3.4, page 3951.
This would have provided the user with the advantages of transforming and analyzing input communications for category specific scores and accurately predicting and recognizing emotional states.  

Claim(s) 10 is rejected under 35 U.S.C. 103 as being unpatentable over Choudhary as applied to claim 1 above, and further in view of Li et al. (Pub. No. US 2023/0128422 A1, filed October 26, 2022) hereinafter Li.

Regarding claim 10, which depends from claim 1 and recites:
wherein the method used to at least in part retrain the model is based on reinforcement learning.
Choudhary teaches the method of claim 1, including the method to at least in part retrain the model. Choudhary does not specifically disclose based on reinforcement learning. 
However, Li teaches in the field related to hardware and software for augmented reality systems and virtual reality systems. Li, para 2.  Li, which is analogous to the claimed invention because Li is directed to automated speech recognition, natural language understanding, gesture input detection, teaches that, The dialog manager 216 may additionally store previous conversations between the user and the assistant system 140. In particular embodiments, the dialog manager 216 may conduct dialog optimization. Dialog optimization relates to the challenge of understanding and identifying the most likely branching options in a dialog with a user. As an example and not by way of limitation, the assistant system 140 may implement dialog optimization techniques to obviate the need to confirm who a user wants to call because the assistant system 140 may determine a high confidence that a person inferred based on context and available data is the intended recipient. In particular embodiments, the dialog manager 216 may implement reinforcement learning frameworks to improve the dialog optimization (retrain (optimization and retraining) based on reinforcement learning). The dialog manager 216 may comprise dialog intent resolution 356, the dialog state tracker 218, and the action selector 222. In particular embodiments, the dialog manager 216 may execute the selected actions and then call the dialog state tracker 218 again until the action selected requires a user response, or there are no more actions to execute. Each action selected may depend on the execution result from previous actions. In particular embodiments, the dialog intent resolution 356 may resolve the user intent associated with the current dialog session based on dialog history between the user and the assistant system 140. The dialog intent resolution 356 may map intents determined by the NLU module 210 to different dialog intents. The dialog intent resolution 356 may further rank dialog intents based on signals from the NLU module 210, the entity resolution module 212, and dialog history between the user and the assistant system 140. Li, para 117. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement the method for training a model and multi-modal user interfaces of Choudhary using the retrain based on reinforcement learning of Li, with a reasonable expectation of success, in order to provide an assistant system that enables the user to interact with the assistant system via user inputs of various modalities (e.g., audio, voice, text, image, video, gesture, motion, location, orientation) in stateful and multi-turn conversations to receive assistance from the assistant system and to optimize and improve dialog. Li., para 7, 117. 
This would have provided the user with the advantages of optimizing and improving dialog between user and an assistant system.

Claim(s) 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Choudhary and Li as applied to claim 10 above, and further in view of Liu et al. (Patent No. US 11,715,042 B2, filed April 19, 2019) hereinafter Liu.

Regarding claim 13, which depends from claim 10 and further recites:
receiving an input speech signal; converting the input speech signal to a text output; providing the text output to the knowledge system as a prompt; receiving, from the knowledge system, an output to the user, the output being generated by the knowledge system responsive to the provided prompt; and collecting a feedback signal from the user responsive to the output generated by the knowledge system.
Choudhary in view of Li teaches the method of claim 10 from which claim 13 depends. As similarly discussed above with respect to claim 1, Chaudhary teaches  that, Each of the embedding networks and the classifier can be updated (e.g., trained) by individual users to improve recognition of user commands that are received via various modalities. For example, if a spoken user command (receiving an input speech signal) is received that cannot be interpreted with high confidence, the user interface can query the user as to the meaning of the spoken command, and the user can input the meaning using a different modality, such as by performing a gesture input that is recognized by the user interface (receiving an input signal from a speech input device wearable on a user (collecting a feedback signal from the user responsive to the output generated by the knowledge system), wherein the input signal is captured when the user is making a facial expression or gesture or speaking (performing gesture, facial expression para 34, speaking para 42) in response to the output). Choudhary, Figs 1-3, 12A-12B, 13, paragraphs 24, 22-26, 68-72, 71, 80, 19-20, 32-36, 42, 64, 113-115, 4-7.    
Choudhary in view of Li does not specifically disclose converting the input speech signal to a text output;  providing the text output to the knowledge system as a prompt; receiving, from the knowledge system, an output to the user, the output being generated by the knowledge system responsive to the provided prompt.
However, Liu teaches in the field related to dialog management based on machine-learning techniques within network environments, and in particular relates to hardware and software for smart assistant systems. Liu, Col 1:15-20.  Liu, which is analogous to the claimed invention because Liu is directed to reinforcement learning , assistant systems and user input, teaches that, (33) In particular embodiments, the assistant system 140 may receive a user input from the assistant application 136 in the client system 130 associated with the user. ... If the user input is based on an audio modality (e.g., the user may speak to the assistant application 136 or send a video including speech to the assistant application 136), the assistant system 140 may process it using an automatic speech recognition (ASR) module 210 to convert the user input into text (converting the input speech signal to a text output). …The output of the messaging platform 205 or the ASR module 210 may be received at an assistant xbot 215. More information on handling user input based on different modalities may be found in U.S. patent application Ser. No. 16/053,600, filed 2 Aug. 2018, which is incorporated by reference. Liu, Fig 2col 11:23-43. The assistant application 136 may then communicate the request to the assistant system 140. The assistant system 140 may accordingly generate the result and send it back to the assistant application 136. The assistant application 136 may further present the result to the user in text (providing the text output to the knowledge system as a prompt; receiving, from the knowledge system, an output to the user, the output being generated by the knowledge system responsive to the provided prompt). Liu, col 7:2-21.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement the method for training a model and multi-modal user interfaces of Choudhary using the retrain based on reinforcement learning of Li and the providing the text output to the knowledge system as a prompt, receiving, from the knowledge system, an output to the user, the output being generated by the knowledge system responsive to the provided prompt of Liu, with a reasonable expectation of success, in order to provide an assistant system that enables the user to interact with the assistant system via user inputs of various modalities (e.g., audio, voice, text, image, video, gesture, motion, location, orientation) in stateful and multi-turn conversations to receive assistance from the assistant system and to optimize and improve dialog and in order to provide an assistant system that may assist a user to obtain information or services and may enable the user to interact with it with multi-modal user input (such as voice, text, image, video) in stateful and multi-turn conversations to get assistance. Li., para 7, 117. Liu, col 2:5-11.
This would have provided the user with the advantages of optimizing and improving dialog between user and an assistant system.

Regarding claim 14, which depends from claim 13 and recites:
wherein the knowledge system comprises a machine learning foundation model.
Choudhary in view of Li and Liu teaches the method of claim 13 from which claim 14 depends. Choudhary in view of Li does not specifically disclose wherein the knowledge system comprises a machine learning foundation model. 
However, Liu teaches that, (72) … The reinforcement-learning model may be further based on a deep Q-network model 445. The embodiments disclosed herein evaluate a Gradient-Boosted Decision Tree (GBDT) based supervised learning model and a Deep Q-Network (DQN) 445 based reinforcement learning model (wherein the knowledge system comprises a machine learning foundation model (knowledge system comprises a Deep Q-Network machine learning foundational model)). The DQN model 445 consists of a neural network value function estimator that is trained on offline logs of conversations that were collected in the presence of a rule-based suggestion triggering policy. The neural network is trained to predict the discounted future rewards of showing a suggestion at each step of the conversation 400. The embodiments disclosed herein use a reward value of +2.5 for a suggestion which is clicked, and −0.1 for a suggestion that is shown but not clicked. Liu, col 26:62-col 27:13.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement the method for training a model and multi-modal user interfaces of Choudhary using the retrain based on reinforcement learning of Li and the providing the text output to the knowledge system as a prompt, receiving, from the knowledge system, an output to the user, the output being generated by the knowledge system responsive to the provided prompt, wherein the knowledge system comprises a machine learning foundation model of Liu, with a reasonable expectation of success, in order to provide an assistant system that enables the user to interact with the assistant system via user inputs of various modalities (e.g., audio, voice, text, image, video, gesture, motion, location, orientation) in stateful and multi-turn conversations to receive assistance from the assistant system and to optimize and improve dialog and in order to provide an assistant system that may assist a user to obtain information or services and may enable the user to interact with it with multi-modal user input (such as voice, text, image, video) in stateful and multi-turn conversations to get assistance. Li., para 7, 117. Liu, col 2:5-11.
This would have provided the user with the advantages of optimizing and improving dialog between user and an assistant system.

Regarding claim 15, which depends from claim 14 and recites:
wherein the machine learning foundation model is retrained at least in part to be personalized to the user based on the user feedback.
Choudhary in view of Li and Liu teaches the method of claim 14 from which claim 15 depends Choudhary teaches that, By enabling multi-modal user interaction, along with the ability to personalize interpretation of user commands, techniques described herein enable multi-modal user interfaces to be trained for use by particular users (model is retrained at least in part to be personalized to the user based on the user feedback), reducing or eliminating the extensive training for broad applicability of conventional user interfaces. Choudhary, Figs 1-2, 9-10,para 23, 22-25, 39, 45, 47, 51, 52, 55. 
Choudhary does not specifically disclose the machine learning foundational model.  However, as discussed above, Choudhary in view of Li and Liu teaches the machine learning foundation model. Liu, col 26:62-col 27:13.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement the method for training a model and multi-modal user interfaces and where the model is retrained at least in part to be personalized to the user based on the user feedback of Choudhary using the retrain based on reinforcement learning of Li and the providing the text output to the knowledge system as a prompt, receiving, from the knowledge system, an output to the user, the output being generated by the knowledge system responsive to the provided prompt, wherein the knowledge system comprises a machine learning foundation model of Liu, with a reasonable expectation of success, in order to provide an assistant system that enables the user to interact with the assistant system via user inputs of various modalities (e.g., audio, voice, text, image, video, gesture, motion, location, orientation) in stateful and multi-turn conversations to receive assistance from the assistant system and to optimize and improve dialog and in order to provide an assistant system that may assist a user to obtain information or services and may enable the user to interact with it with multi-modal user input (such as voice, text, image, video) in stateful and multi-turn conversations to get assistance. Li., para 7, 117. Liu, col 2:5-11.
This would have provided the user with the advantages of optimizing and improving dialog between user and an assistant system personalized for the user.

Claim(s) 16 is rejected under 35 U.S.C. 103 as being unpatentable over Choudhary, Li and Liu as applied to claim 15 above, and further in view of Reinspach et al. (Pub No. US 2021/0183366 A1, published June 17, 2021) hereinafter Reinspach.

Regarding claim 16, which depends from claim 15 and recites:
wherein the machine learning foundation model is updated based on aggregated feedback signals collected across a plurality of users.
Choudhary in view of Li and Liu teaches the method of claim 15 from which claim 16 depends, including the machine learning foundational model. Choudhary in view of Li and Liu does not specifically disclose model updated based on aggregated feedback signals collected across a plurality of users.
However, Reinspach teaches in the field related to speech recognition and digital assistants. Reinspach, which is analogous to the claimed invention because Reinspach is directed to speech recognition, digital assistants, model training and disambiguation feedback, teaches that The additional user input may be associated with one or more known words. For example, the user input may be a selection between two products. The name of the selected product is known and includes the one or more keywords. Thus, the one or more keywords are associated with the original utterance that could not be properly understood or caused ambiguity. The association between the utterance and the one or more keywords is then logged and used to update a user-specific speech recognition key. The user-specific speech recognition key is specific to a particular user account, as it contains information on how to interpret the way that user speaks and may not apply for other users. In some embodiments, the association between the utterance and the one or more keywords is also used to train a general speech recognition model which is used to interpret utterances for a plurality of users (model updated based on aggregated feedback signals collected across a plurality of users). The user-specific speech recognition key may be updated upon receiving the additional user input such that the next time the user makes the same utterance, the user-specific speech recognition key is referenced, and the utterance is interpreted as the one or more keywords associated with the utterance. The general speech recognition model may be a statistical model that is updated or trained in intervals and includes many entries from different users (model updated based on aggregated feedback signals collected across a plurality of users). Reinspach, Abstract, para 13, 26, 29, 32, 35. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to implement the method for training a model and multi-modal user interfaces and where the model is retrained at least in part to be personalized to the user based on the user feedback of Choudhary using the retrain based on reinforcement learning of Li and the providing the text output to the knowledge system as a prompt, receiving, from the knowledge system, an output to the user, the output being generated by the knowledge system responsive to the provided prompt, wherein the knowledge system comprises a machine learning foundation model of Liu and the model updated based on aggregated feedback signals collected across a plurality of users of Reinspach, with a reasonable expectation of success, in order to provide an assistant system that enables the user to interact with the assistant system via user inputs of various modalities (e.g., audio, voice, text, image, video, gesture, motion, location, orientation) in stateful and multi-turn conversations to receive assistance from the assistant system and to optimize and improve dialog and in order to provide an assistant system that may assist a user to obtain information or services and may enable the user to interact with it with multi-modal user input (such as voice, text, image, video) in stateful and multi-turn conversations to get assistance and to collect feedback to train models to better understand utterances and translate diverse user pronunciations into the intended words. Li., para 7, 117. Liu, col 2:5-11. Reinspach, para 11-13.
This would have provided the user with the advantages of optimizing and improving dialog between user and an assistant system personalized for the user and a general assistant system for many different users.


Allowable Subject Matter
Claim 9 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BARBARA LEVEL whose telephone number is (303)297-4748. The examiner can normally be reached Monday through Friday 8:00 AM - 5:00 PM MT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mariela Reyes can be reached at (571) 270-1006. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BARBARA M LEVEL/           Examiner, Art Unit 2142
Read full office action
Prosecution Timeline

Jul 25, 2023
Application Filed
Mar 05, 2026
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/728,045
Patent 12602907
DATA SENSITIVITY ESTIMATION
2y 5m to grant Granted Apr 14, 2026
17/858,581
Patent 12596963
Machine-Learning Based Record Processing Systems
2y 5m to grant Granted Apr 07, 2026
17/735,020
Patent 12579467
DECENTRALIZED CROSS-NODE LEARNING FOR AUDIENCE PROPENSITY PREDICTION
2y 5m to grant Granted Mar 17, 2026
17/991,330
Patent 12567000
SYSTEMS AND METHODS FOR SUBSCRIBER-BASED ADAPTATION OF PRODUCTION-IMPLEMENTED MACHINE LEARNING MODELS OF A SERVICE PROVIDER USING A TRAINING APPLICATION
2y 5m to grant Granted Mar 03, 2026
17/950,951
Patent 12561399
COMPUTER SYSTEM AND DATA ANALYSIS METHOD
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
72%
Grant Probability
98%
With Interview (+26.9%)
2y 8m
Median Time to Grant
Low
PTA Risk
Based on 330 resolved cases by this examiner. Grant probability derived from career allow rate.