Last updated: April 18, 2026
Application No. 17/166,410
METHOD AND APPARATUS FOR CONTROLLING A VOICE ASSISTANT, AND COMPUTER-READABLE STORAGE MEDIUM

Final Rejection §103
Filed
Feb 03, 2021
Examiner
CHAVEZ, RODRIGO A
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Beijing Xiaomi Pinecone Electronics Co. Ltd.
OA Round
8 (Final)
Interview Optional

— +37.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 228 resolved cases, 2023–2026
Examiner Intelligence

CHAVEZ, RODRIGO A View full profile →
Grants 50% of resolved cases
Career Allow Rate
115 granted / 228 resolved
-11.6% vs TC avg
Strong +37% interview lift
Without
With
+37.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
22 currently pending
Career history
250
Total Applications
across all art units
Statute-Specific Performance

§101
16.4%
-23.6% vs TC avg
§103
53.1%
+13.1% vs TC avg
§102
20.9%
-19.1% vs TC avg
§112
5.6%
-34.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 228 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claims 1, 3, 5-10, 12, 14, 15 and 17-23 have been considered but are moot because of the new ground of rejection in view of Lemay, Kim and Zhou. Although the applicant does mention in the Remarks filed 12/25/2025, that:
“Kim, Zhou and Andersen also fail to disclose, teach or suggest at least the above underlined features as recited in amended claim 1 of the present application, and therefore fail to cure the above deficiencies of Lemay.”

The applicant only argues that Lemay fails to teach the newly recited limitation. In turn, the examiner argues that Zhou does teach the newly recited limitation, as noted in the rejection presented below.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 5-8, 10, 12, 14, 15, 17 and 19-23 are rejected under 35 U.S.C. 103 as being unpatentable over Lemay (US PG Pub 20140040748) in view of Kim (US PG Pub 20160077794) and further in view of Zhou (US PG Pub 20210407507).
As per claims 1, 10 and 20, Lemay discloses:	A method, apparatus and non-transitory computer-readable storage medium for speech assistant control, comprising: 
a processor (Lemay; Fig 28, item 62; p. 0075); and
memory configured to store instructions executable by the processor (Lemay; Fig 28, item 65; p. 0079), wherein the processor is configured to:	displaying an interface corresponding to a speech assistant after waking up the speech assistant with a wake-up word, and switching to, according to a control instruction corresponding to received speech data, a target interface corresponding to the control instruction from the interface corresponding to the speech assistant (Lemay; p. 0402 - the digital assistant object is displayed in an object region 1254 (FIGS. 12-15 and 33); Fig. 34, item 3402; p. 0408 - after the speech input is received, the digital assistant object is displayed (3402) in an object region of the video display screen; see also Fig. 34 & p. 0407-0416 - At any time, e.g., either before receiving a speech input or after the speech input is received, the digital assistant object is displayed (3402) in an object region of the video display screen (waking up speech assistant and displaying speech assistant interface)… The user then provides a speech input, which is received (3404) by the computing device and digital assistant… "find me a nearby restaurant."… Upon determining that the at least one information item can be displayed in its entirety in the display region of the video display screen (3410--Yes), the at least one information item is displayed (3416) in its entirety in the display region (jumping to target interface)) wherein the target interface is different from the interface corresponding to the speech assistant (Lemay; p. 0188 - Application context can also help identify the meaning of the user's intent across applications. Referring now to FIG. 21, there is shown an example in which the user has invoked virtual assistant 1002 in the context of viewing an email message (such as email message 1751), but the user's command 2150 says "Send him a text . . . ". Command 2150 is interpreted by virtual assistant 1002 as indicating that a text message, rather than an email, should be sent. However, the use of the word "him" indicates that the same recipient (John Appleseed) is intended. Virtual assistant 1002 thus recognizes that the communication should go to this recipient but on a different channel (a text message to the person's phone number, obtained from contact information stored on the device). Accordingly, virtual assistant 1002 composes text message 2152 for the user to approve and send… Fig 21 shows that the response to John Applecore is displayed where item 2152 corresponds to the display of the “target interface”, which is different than the speech assistant interface); 	determining whether a target control instruction to be executed is included in received second speech data based on the second speech data received in a displaying process of the target interface (Lemay; p. 0187 - In FIG. 20, the user has provided a command 2050: "Reply let's get this to marketing right away". Context information, including information about email message 1751 and the email application in which it displayed, is used to interpret command 2050. This context can be used to determine the meaning of the words "reply" and "this" in command 2050, and to resolve how to set up an email composition transaction to a particular recipient on a particular message thread; also see p. 0409); and 	displaying an interface corresponding to the target control instruction in response to the target control instruction being included in the second speech data (Lemay; Fig. 20; p. 0187 - In FIG. 20, the user has provided a command 2050: "Reply let's get this to marketing right away". Context information, including information about email message 1751 and the email application in which it displayed, is used to interpret command 2050. This context can be used to determine the meaning of the words "reply" and "this" in command 2050, and to resolve how to set up an email composition transaction to a particular recipient on a particular message thread. In this case, virtual assistant 1002 is able to access context information to determine that "marketing" refers to a recipient named John Applecore and is able to determine an email address to use for the recipient. Accordingly, virtual assistant 1002 composes email 2052 for the user to approve and send. In this manner, virtual assistant 1002 is able to operationalize a task (composing an email message) based on user input together with context information describing the state of the current application; Figs. 34, 13 and 14; p. 0410-0412 - Upon determining that the at least one information item can be displayed in its entirety in the display region of the video display screen (3410--Yes), the at least one information item is displayed (3416) in its entirety in the display region).	Lemay, however, fails to disclose without waking up the speech assistant again with the wake-up word, displaying a speech reception identifier in the target interface and controlling to continuously receive speech data.	Kim does teach without waking up the speech assistant again with the wake-up word, displaying a speech reception identifier in the target interface and controlling to continuously receive speech data (Kim; p. 0053-0054 -  While interface 422 is displayed, the threshold for triggering the virtual assistant to again receive a command can be lowered to readily detect follow-on interactions. For example, while interface 422 is displayed, a user can utter “Launch Photos,” which can be recognized as a trigger phrase in some examples. The lowered threshold can make it more likely that the utterance will trigger and cause the virtual assistant to receive the command and execute the associated user intent… (without waking up the speech assistant again with the wake-up word)). 	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lemay to include without waking up the speech assistant again with the wake-up word, displaying a speech reception identifier in the target interface and controlling to continuously receive speech data, as taught by Kim, because speech triggers… can be missed due to background noise, interference, variations in user speech, and a variety of other factors. For example, a user may utter a speech trigger softly, in a noisy environment, in a unique tone of voice, or the like, and a virtual assistant may not be triggered. Users in such instances may repeat the speech trigger louder or more clearly, or they may resort to manually initiating a virtual assistant session (Kim; p. 0005).	Furthermore, Lemay in view of Kim fail to disclose that an application corresponding to the target interface is running in a foreground of a terminal device where the speech assistant is located after switching to the target interface corresponding to the control instruction.	Zhou does teach an application corresponding to the target interface is running in a foreground of a terminal device where the speech assistant is located after switching to the target interface corresponding to the control instruction (Zhou; p. 0113 - For example, as shown in FIG. 7A or FIG. 8, the mobile phone may display, in a form of a floating menu, an identifier 701 of the speech app running in the background on the interface of the foreground application. The user may drag the identifier 701 to any location on the current interface. In addition, when the mobile phone displays the identifier 701 of the speech app, the user may still interact with the interface of the foreground application. For example, as shown in FIG. 7A, the user may click a control such as a playback button 602 on an interface 601 of the video app).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lemay to include an application corresponding to the target interface is running in a foreground of a terminal device where the speech assistant is located after switching to the target interface corresponding to the control instruction, as taught by Zhou, in order to improve speech control efficiency of a speech app in the electronic device and user experience (Zhou; p. 0005).

	As per claims 3 and 12, Lemay in view of Kim and Zhou discloses:
The method and apparatus according to claims 1 and 10, wherein the method further comprises: closing the window interface in response to a display duration of the window interface reaching a target duration (Lemay; p. 0298 - different instances and/or embodiments of method 10 may be initiated at one or more different time intervals (e.g., during a specific time interval (target duration), at regular periodic intervals, at irregular periodic intervals, upon demand, and the like); also see p. 0311 - If, after viewing the response, the user is done 790, the method ends).	And further, Zhou does teach wherein the displaying an interface corresponding to the target control instruction comprises: displaying a window interface in the target interface in response to that there is the window interface corresponding to the target control instruction, wherein the window interface is located on an upper layer of the target interface, and a size of the window interface is smaller than a size of the target interface (Zhou; p. 0113 - For example, as shown in FIG. 7A or FIG. 8, the mobile phone may display, in a form of a floating menu (window interface is located on an upper layer…), an identifier 701 of the speech app running in the background on the interface of the foreground application. The user may drag the identifier 701 to any location on the current interface. In addition, when the mobile phone displays the identifier 701 of the speech app, the user may still interact with the interface of the foreground application. For example, as shown in FIG. 7A, the user may click a control such as a playback button 602 on an interface 601 of the video app).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lemay to include wherein the displaying an interface corresponding to the target control instruction comprises: displaying a window interface in the target interface in response to that there is the window interface corresponding to the target control instruction, wherein the window interface is located on an upper layer of the target interface, and a size of the window interface is smaller than a size of the target interface, as taught by Zhou, in order to improve speech control efficiency of a speech app in the electronic device and user experience (Zhou; p. 0005).

As per claims 5 and 14, Lemay in view of Kim and Zhou discloses:
	The method and apparatus according to claims 1 and 10, wherein the determining whether the target control instruction to be executed is included in the received second speech data based on the second speech data comprises: performing speech recognition on the second speech data to obtain text information corresponding to the second speech data; matching the text information with instructions in an instruction library; and in response to a target instruction matched with the text information being determined and the text information meeting an instruction execution condition, determining that the target control instruction is included in the speech data (Lemay; p. 0187 - In FIG. 20, the user has provided a command 2050: "Reply let's get this to marketing right away". Context information, including information about email message 1751 and the email application in which it displayed, is used to interpret command 2050. This context can be used to determine the meaning of the words "reply" and "this" in command 2050, and to resolve how to set up an email composition transaction to a particular recipient on a particular message thread; also see p. 0409; also see p. 0314 - Referring now to FIG. 3, there is shown a flow diagram depicting a method for using context in speech elicitation and interpretation 100, so as to improve speech recognition according to one embodiment. Context 1000 can be used, for example, for disambiguation in speech recognition to guide the generation, ranking, and filtering of candidate hypotheses that match phonemes to words; also see p. 0337 - The method begins 200. Input text 202 is received. In one embodiment, input text 202 is matched 210 against words and phrases using pattern recognizers 2760, vocabulary databases 2758, ontologies and other models 1050, so as to identify associations between user input and concepts. Step 210 yields a set of candidate syntactic parses 212, which are matched for semantic relevance 220 producing candidate semantic parses 222. Candidate parses are then processed to remove ambiguous alternatives at 230, filtered and sorted by relevance 232, and returned; p. 0340); wherein the instruction execution condition further comprises: voiceprint features corresponding to the text information are the same as voiceprint features of last speech data, wherein the last speech data is speech data corresponding to a last control instruction executed by the speech assistant; voiceprint features corresponding to the text information are voiceprint features of a target user; or semantic features between the text information and text information corresponding to last speech data are continuous (Lemay; p. 0203-217 - Another source of context data is the user's dialog history 1052 with virtual assistant 1002. Such history may include, for example, references to domains, people, places, and so forth. Referring now to FIG. 15, there is shown an example in which virtual assistant 1002 uses dialog context to infer the location for a command, according to one embodiment. In screen 1551, the user first asks "What's the time in New York"; virtual assistant 1002 responds 1552 by providing the current time in New York City. The user then asks "What's the weather". Virtual assistant 1002 uses the previous dialog history to infer that the location intended for the weather query is the last location mentioned in the dialog history. Therefore its response 1553 provides weather information for New York City).

As per claims 6 and 15, Lemay in view of Kim and Zhou discloses:	The method and apparatus according to claims 1 and 10, further comprising: in response to the target control instruction being included in the second speech data, displaying text information corresponding to the second speech data at a position corresponding to the speech reception identifier (Lemay; Fig. 20 - "Reply let's get this to marketing right away"; p. 0187 - In FIG. 20, the user has provided a command 2050: "Reply let's get this to marketing right away". Context information, including information about email message 1751 and the email application in which it displayed, is used to interpret command 2050. This context can be used to determine the meaning of the words "reply" and "this" in command 2050, and to resolve how to set up an email composition transaction to a particular recipient on a particular message thread. In this case, virtual assistant 1002 is able to access context information to determine that "marketing" refers to a recipient named John Applecore and is able to determine an email address to use for the recipient. Accordingly, virtual assistant 1002 composes email 2052 for the user to approve and send. In this manner, virtual assistant 1002 is able to operationalize a task (composing an email message) based on user input together with context information describing the state of the current application; Figs. 34, 13 and 14; p. 0410-0412 - Upon determining that the at least one information item can be displayed in its entirety in the display region of the video display screen (3410--Yes), the at least one information item is displayed (3416) in its entirety in the display region).

	As per claims 7 and 17, Lemay in view of Kim discloses:
The method and apparatus according to claims 1 and 10, upon which claims 7 and 17 depend.	And further, Kim does teach wherein the processor is further configured to: display a speech waiting identifier in the target interface and monitor a wake-up word or a speech hot word in response to determining the speech assistant meeting a sleep state (Kim; p.0054 - …whether the virtual assistant is active can be apparent from a display (e.g., such as interface 422 on touchscreen 246)…); display the speech reception identifier in the target interface in response to detecting the wake-up word) (Kim; p. 0049 - a user can utter a trigger phrase immediately followed by a command or question (e.g., “Hey Siri (wake-up word), tell my mom I'm on my way.”)… a brief prompt, silent prompt (e.g., text display, illuminated light, etc.), or other unobtrusive prompt can be issued quickly in between the trigger phrase and the command or question (speech reception identifier)); and execute a control instruction corresponding to the speech hot word in response to detecting the speech hot word (Kim; p. 0057 - In some examples, notifications can be used as pseudo triggers to allow users to interact with a virtual assistant without first uttering a speech trigger. For example, in response to receiving a notification (e.g., new email 434 shown in notification interface 432), the virtual assistant can be triggered to receive a user command from the audio input without detecting receipt of a spoken trigger. Should a user utter a command following the notification (e.g., “read that to me”), the user's intent can be determined and the command can be executed (e.g., the newly received email can be read out) without first requiring a particular speech trigger. Should a user not utter a command within a certain time, a user intent is unlikely to be determined from the input, and no virtual assistant action may take place. The time to wait for a spoken command can vary based on the notification, such as listening for commands for a longer duration when multiple notifications are received or notifications are complex (e.g., calendar invitations, messages, etc.) than when a single, simple notification is received (e.g., a reminder, alert, etc.). In some examples, the virtual assistant can thus be triggered and listen for commands in response to receiving a notification without first requiring a particular speech trigger).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lemay to include wherein the processor is further configured to: display a speech waiting identifier in the target interface and monitor a wake-up word or a speech hot word in response to determining the speech assistant meeting a sleep state; display the speech reception identifier in the target interface in response to detecting the wake-up word; and execute a control instruction corresponding to the speech hot word in response to detecting the speech hot word, as taught by Kim, because speech triggers… can be missed due to background noise, interference, variations in user speech, and a variety of other factors. For example, a user may utter a speech trigger softly, in a noisy environment, in a unique tone of voice, or the like, and a virtual assistant may not be triggered. Users in such instances may repeat the speech trigger louder or more clearly, or they may resort to manually initiating a virtual assistant session (Kim; p. 0005).	Furthermore, Lemay in view of Kim fail to disclose wherein the determining that the speech assistant meets the sleep state is based on at least one of following situations: the target control instruction is not included in speech data received in a first preset time period; and no speech data is received in a second preset time period, a duration of the second preset time period being longer than that of the first preset time period.	Zhou does teach wherein the determining that the speech assistant meets the sleep state is based on at least one of following situations: the target control instruction is not included in speech data received in a first preset time period (Zhou; p. 0153 - after enabling the speech app, if the mobile phone collects, within a specific time (for example, two seconds) (first preset time period), no speech control signal entered by the user, it indicates that the user may not know how to use the speech app in this case…); and no speech data is received in a second preset time period, a duration of the second preset time period being longer than that of the first preset time period (Zhou; p. 0154 - …After displaying the speech input prompt on the first interface for a period of time (for example, three seconds) (second preset time period), the mobile phone may automatically hide the speech input prompt).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lemay to include wherein the determining that the speech assistant meets the sleep state is based on at least one of following situations: the target control instruction is not included in speech data received in a first preset time period; and no speech data is received in a second preset time period, a duration of the second preset time period being longer than that of the first preset time period, as taught by Zhou, in order to improve speech control efficiency of a speech app in the electronic device and user experience (Zhou; p. 0005).

As per claim 8, Lemay in view of Kim and Zhou discloses:
	The method according to claim 1, wherein prior to the determining whether a target control instruction to be executed is included in received second speech data based on the second speech data, the method further comprises: acquiring detection information of a terminal, the detection information being configured for determining whether a user sends speech to the terminal; determining whether the received second speech data is speech data sent by the user to the terminal based on the detection information (Lemay; p. 0401 - the digital assistant object is used to show the status of the digital assistant. For example, if the digital assistant is waiting to be invoked it may display a first icon (e.g., a microphone icon), when the digital assistant is "listening" to the user (i.e., recording user speech input), the digital assistant display a second icon (e.g., a colorized icon showing the fluctuations in recorded speech amplitude); and when the digital assistant is processing the user's input it may display a third icon (e.g., a microphone icon with a light source swirling around the perimeter of the microphone icon)); and in response to determining that the second speech data is speech data sent by the user to the terminal, determining whether the target control instruction to be executed is included in the second speech data based on the received second speech data (Lemay; p. 0187 - In FIG. 20, the user has provided a command 2050: "Reply let's get this to marketing right away". Context information, including information about email message 1751 and the email application in which it displayed, is used to interpret command 2050. This context can be used to determine the meaning of the words "reply" and "this" in command 2050, and to resolve how to set up an email composition transaction to a particular recipient on a particular message thread; also see p. 0409).

	As per claim 19, Lemay in view of Kim and Zhou discloses:
	A mobile terminal comprising the apparatus of claim 10, further comprising a microphone, a speaker, and a display screen (Lemay; p. 0081 - Computing device 60 includes processor(s) 63 which run software for implementing virtual assistant 1002. Input device 1206 can be of any type suitable for receiving user input, including for example a keyboard, touchscreen, microphone (for example, for voice input), mouse, touchpad, trackball, five-way switch, joystick, and/or any combination thereof. Output device 1207 can be a screen, speaker, printer, and/or any combination thereof), wherein the display screen is configured to display interfaces of other applications during user interaction with the speech assistant, and the speech assistant is configured to continuously receive speech data while the display screen displaying the interfaces of the other applications, such that operations corresponding to the continuously received speech data are capable of being executed in the interfaces of the other applications through the speech assistant, without repeated waking-up operations from the user (Lemay; p. 0408 - An example of the digital assistant object is the microphone icon 1252 shown in FIG. 12. An exemplary object region 1254 is shown in FIGS. 12-15 and described above. As described above, the digital assistant object may be used to invoke the digital assistant service and/or show its status; also see Fig 19- 20 & p. 0186-0187 - In FIG. 19, the user has activated virtual assistant 1002 while viewing email message 1751 from within the email application. In one embodiment, the display of email message 1751 moves upward on the screen to make room for prompt 150 from virtual assistant 1002. This display reinforces the notion that virtual assistant 1002 is offering assistance in the context of the currently viewed email message 1751. Accordingly, the user's input to virtual assistant 1002 will be interpreted in the current context wherein email message 1751 is being viewed).	As per claims 21-23, Lemay in view of Kim and Zhou disclose wherein transparency and size of the speech reception identifier are adjustable (Lemay; p. 0420 - when the at least one information item is partially displayed (3416 of FIG. 34) in the display region, the transparency of at least a portion of the information region, and/or the object region, nearest the display region are adjusted so that at least a portion of the information item(s) are displayed under the information region and/or the object region).		Claims 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Lemay in view of Kim and Zhou and further in view of Andersen (US PG Pub 20190138268).

As per claim 9, Lemay in view of Kim and Zhou discloses:
The method according to claim 8, of determining whether the received second speech data is speech data sent by the user to the terminal based on the detection information.	Lemay in view of Kim, however, fails to disclose when the detection information is rotation angle information of the terminal, determining that the second speech data is speech data sent by the user to the terminal in response to determining that a distance between a microphone array of the terminal and a speech data source is reduced based on the rotation angle information of the terminal; and when the detection information is face image information, performing gaze estimation based on the face image information, and determining that the second speech data is speech data sent by the user to the terminal in response to determining that a gaze point corresponding to the face image information is at the terminal based on the gaze estimation.	Andersen does teach when the detection information is rotation angle information of the terminal, determining that the second speech data is speech data sent by the user to the terminal in response to determining that a distance between a microphone array of the terminal and a speech data source is reduced based on the rotation angle information of the terminal (Andersen; p. 0089 - In yet another example, the proximity sensor 148 may provide data as to the proximity of other objects within the monitored environment 150 to the smart speaker device 140, including the source of the speech component of the captured audio sample. If the human speaker is within a certain predetermined distance or proximity of the smart speaker device, then this is more indicative of that the human speaker is directing the speech towards the smart speaker device 140. If the human speaker is outside this predetermined distance or proximity, then it is more likely that the human speaker is not directing their speech towards the smart speaker device 140); and when the detection information is face image information, performing gaze estimation based on the face image information, and determining that the second speech data is speech data sent by the user to the terminal in response to determining that a gaze point corresponding to the face image information is at the terminal based on the gaze estimation (Andersen; p. 0088 - As another example, the image and video data from the sensor(s) 146 may be analyzed to identify gaze detection information (eye contact), head nod or other gesture detection indicative of the direction of attention towards the smart speaker device 140. The particular gaze detection and other video or image analysis may be directed to portions of images/video that correlate with the source of the speech component of the captured audio sample. As noted above, this may be done via correlation mechanisms that correlate a determined source location within the monitored environment 150 with elements of the image/video. The correlation may include tracking the eye position of a source of the captured audio sample and movement over several frames in the image/video data. This may include performing image recognition for facial recognition with checking of the eye location and determining the angle of the eye to understand that the eyes are looking to the camera or computer (a range of angle), when the camera or computer is part of the smart speaker device, for example. Facial recognition is not needed initially, but it may assist to lower the computation to find the eyes initially. After face recognition, eye position and time looking at the computer or camera may be used to classify the eye movement as part of gaze detected. If the human speaker is looking at the smart speaker device 140 at the time that the audio sample is captured and maintains such eye contact, or gaze, for a predetermined period of time, this is more indicative that the human speaker is directing the speech towards the smart speaker device 140; otherwise, is it more likely that the human speaker is not directing the speech towards the smart speaker device 140).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Lemay in view of Kim to include when the detection information is rotation angle information of the terminal, determining that the second speech data is speech data sent by the user to the terminal in response to determining that a distance between a microphone array of the terminal and a speech data source is reduced based on the rotation angle information of the terminal; and when the detection information is face image information, performing gaze estimation based on the face image information, and determining that the second speech data is speech data sent by the user to the terminal in response to determining that a gaze point corresponding to the face image information is at the terminal based on the gaze estimation, as taught by Andersen, in order to more accurately determine, by the fusion sensor service, whether the user input is specifically directed to the HCI device based on the captured sensor data (Andersen; p. 0004).

A per claim 18, the claim is directed to an apparatus that discloses language that is similar to the combination of the limitations in claims 8 and 9. Thus, the claim is rejected similarly.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes:	Kudurshian (US PG Pub 20170358305) discloses systems and processes for operating a digital assistant are provided. In one example, a method includes receiving a first speech input from a user. The method further includes identifying context information and determining a user intent based on the first speech input and the context information. The method further includes determining whether the user intent is to perform a task using a searching process or an object managing process. The searching process is configured to search data, and the object managing process is configured to manage objects. The method further includes, in accordance with a determination the user intent is to perform the task using the searching process, performing the task using the searching process; and in accordance with the determination that the user intent is to perform the task using the object managing process, performing the task using the object managing process (Kudurshian; Abstract). 
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139. The examiner can normally be reached Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 5712727602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RODRIGO A CHAVEZ/Examiner, Art Unit 2658

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Feb 03, 2021
Application Filed
Dec 13, 2022
Non-Final Rejection — §103
Mar 16, 2023
Response Filed
Mar 23, 2023
Final Rejection — §103
May 22, 2023
Response after Non-Final Action
Jun 23, 2023
Examiner Interview (Telephonic)
Jun 23, 2023
Response after Non-Final Action
Jun 30, 2023
Request for Continued Examination
Jul 05, 2023
Response after Non-Final Action
Sep 22, 2023
Non-Final Rejection — §103
Dec 26, 2023
Response Filed
Apr 02, 2024
Final Rejection — §103
May 28, 2024
Response after Non-Final Action
Jun 27, 2024
Examiner Interview (Telephonic)
Jun 27, 2024
Response after Non-Final Action
Jul 09, 2024
Request for Continued Examination
Jul 15, 2024
Response after Non-Final Action
Sep 21, 2024
Non-Final Rejection — §103
Dec 24, 2024
Response Filed
Mar 31, 2025
Final Rejection — §103
Jun 07, 2025
Response after Non-Final Action
Sep 04, 2025
Request for Continued Examination
Sep 08, 2025
Response after Non-Final Action
Sep 29, 2025
Non-Final Rejection — §103
Dec 25, 2025
Response Filed
Mar 30, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/175,355
Patent 12597430
MULTI-CHANNEL SIGNAL GENERATOR, AUDIO ENCODER AND RELATED METHODS RELYING ON A MIXING NOISE SIGNAL
2y 5m to grant Granted Apr 07, 2026
17/579,750
Patent 12579984
DATA AUGMENTATION SYSTEM AND METHOD FOR MULTI-MICROPHONE SYSTEMS
2y 5m to grant Granted Mar 17, 2026
17/513,419
Patent 12541653
ENTERPRISE COGNITIVE SOLUTIONS LOCK-IN AVOIDANCE
2y 5m to grant Granted Feb 03, 2026
17/532,315
Patent 12542136
DYNAMICALLY CONFIGURING A WARM WORD BUTTON WITH ASSISTANT COMMANDS
2y 5m to grant Granted Feb 03, 2026
17/450,015
Patent 12531077
METHOD AND APPARATUS IN AUDIO PROCESSING
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

9-10
Expected OA Rounds
50%
Grant Probability
88%
With Interview (+37.3%)
3y 5m
Median Time to Grant
High
PTA Risk
Based on 228 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD AND APPARATUS FOR CONTROLLING A VOICE ASSISTANT, AND COMPUTER-READABLE STORAGE MEDIUM

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email