Last updated: April 19, 2026
Application No. 18/790,869
CONVERSATIONAL CONTROL OF AN APPLIANCE

Non-Final OA §101§102§103
Filed
Jul 31, 2024
Examiner
TENGBUMROONG, NATHAN NARA
Art Unit
2654
Tech Center
2600 — Communications
Assignee
Haier US Appliance Solutions Inc.
OA Round
1 (Non-Final)
This examiner grants 43% of cases after interview

— +75.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 14 resolved cases, 2023–2026
Examiner Intelligence

TENGBUMROONG, NATHAN NARA View full profile →
Grants 43% of resolved cases
Career Allow Rate
6 granted / 14 resolved
-19.1% vs TC avg
Strong +75% interview lift
Without
With
+75.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
34 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
27.2%
-12.8% vs TC avg
§103
54.3%
+14.3% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
3.2%
-36.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 14 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION
This office action is in response to Applicant’s submission filed on 7/31/2024. Claims 1-20 are pending in the application. As such, claims 1-20 have been examined.
	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 7/31/2024. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is
directed to an abstract idea without significantly more.
	Regarding claim 1, the claim recites “(a) analyzing the sound signal to identify a voice input”, “(b) identifying, based at least in part on the voice input, a presence of a conversational state trigger”, “(c) entering a conversational state of operation, wherein the conversational state of operation analyzes the voice input using a conversational state response criteria”, “(d) determining that a responsive action to the voice input is needed using the conversational state response criteria”, and “(e) implementing the responsive action.” Limitations (a) – (e) recite mental processes that may be practically performed in the mind using pen and paper. For example, limitation (a) can be done by a person listening to a sound signal and determining a voice input. Limitation (b) can be done by a person identifying a trigger word in a voice input. Limitation (c) can be done by a person analyzing a voice input using specific criteria. Limitation (d) can be done by a person determining a responsive action based on a criteria. Limitation (e) can be done by a person performing a responsive action. Under its broadest reasonable interpretation when read in light of the specification, the actions of “analyzing,” “identifying,” “entering,” “determining,” and “implementing” encompass mental processes practically performed in the human mind by evaluation and judgement using pen and paper. Accordingly, the claim recites an abstract idea (Step 2A, Prong One).
	The judicial exception is not integrated into a practical application. In particular, the claim recites additional elements of “(f) appliance comprising a microphone” and “(g) obtaining a sound signal using the microphone.” The limitations, (f) and (g), are mere data gathering recited at a high level of generality, and thus are insignificant extra-solution activity. In addition, all uses of the recited judicial exception require such data gathering, and, as such, these limitations do not impose any meaningful limits on the claim. This limitation amounts to necessary data gathering. Further, limitations (a) - (e) and (g) are recited as being performed by an appliance, which can be considered a generic computer. In limitation (g), the appliance is used as a tool to perform the generic computer function of receiving data. In limitations (a) - (e), the appliance is used to perform an abstract idea, as discussed above in Step 2A, Prong One, such that it amounts to no more than mere instructions to apply the exception using a generic computer. Even when viewed in combination, these additional elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to an abstract idea (Step 2A: YES).
	The claim does not include additional elements that are sufficient to amount to more than the judicial exception. As discussed above, the recitation of an appliance to perform limitations (a) - (e) and (g) amounts to no more than mere instructions to apply the exception using a generic computer component. Also as discussed above, limitations (f) and (g) are recited at a high level of generality. These elements amount to receiving sound signal data using a microphone, which is well understood, routine, conventional activity. Even when considered in combination, these additional elements represent mere instructions to implement an abstract idea or other exception on a computer and insignificant extra-solution activity, which do not provide an inventive concept (Step 2B).

Regarding claim 17, the claim recites “(a) analyze the sound signal to identify a voice input”, “(b) obtain an image of a user that provided the voice input”, “(c) identify, based at least in part on the voice input and the image of the user, a presence of a conversational state trigger”, “(d) enter a conversational state of operation, wherein the conversational state of operation analyzes the voice input using a conversational state response criteria”, “(e) determine that a responsive action to the voice input is needed using the conversational state response criteria,” and “(f) and implement the responsive action.” Limitations (a) – (f) recite mental processes that may be practically performed in the mind using pen and paper. For example, limitation (a) can be done by a person listening to a sound signal and determining a voice input. Limitation (b) can be done by a person observing a user providing a voice input to obtain an image. Limitation (c) can be done by a person identifying a trigger word in a voice input. Limitation (d) can be done by a person analyzing a voice input using specific criteria. Limitation (e) can be done by a person determining a responsive action based on a criteria. Limitation (f) can be done by a person performing a responsive action. Under its broadest reasonable interpretation when read in light of the specification, the actions to “analyze,” “obtain,” “identify,” “enter,” “determine,“ and “implement” encompass mental processes practically performed in the human mind by evaluation and judgement using pen and paper. Accordingly, the claim recites an abstract idea (Step 2A, Prong One).
	The judicial exception is not integrated into a practical application. In particular, the claim recites additional elements of “(g) a cabinet,” “(h) a microphone mounted to the cabinet,” “(i) a camera mounted to the cabinet,” “(j) a controller in operative communication with the microphone and the camera,” and “(k) obtain a sound signal using the microphone.” The limitations (h), (i), and (k) are mere data gathering recited at a high level of generality, and thus are insignificant extra-solution activity. In addition, all uses of the recited judicial exception require such data gathering, and, as such, these limitations do not impose any meaningful limits on the claim. This limitation amounts to necessary data gathering. Further, limitations (a) - (f) and (k) are recited as being performed by an appliance, which can be considered a generic computer. In limitation (k), the appliance is used as a tool to perform the generic computer function of receiving data. In limitations (a) - (f), the appliance is used to perform an abstract idea, as discussed above in Step 2A, Prong One, such that it amounts to no more than mere instructions to apply the exception using a generic computer. The limitation (j) provides nothing more than mere instructions to implement an abstract idea on a generic computer. The controller recited in limitation (j) is used to perform limitations (a) – (f) and (k) without placing any limits on how the controller functions. Rather, this controller only recites the outcomes and does not include any details on how the outcomes are accomplished. Additionally, limitation (g) merely recites a well-known structure to house the controller in limitation (j) and is not significantly tied into the claim as a whole. Even when viewed in combination, these additional elements do not integrate the recited judicial exception into a practical application (Step 2A, Prong Two: NO), and the claim is directed to an abstract idea (Step 2A: YES).
	The claim does not include additional elements that are sufficient to amount to more than the judicial exception. As discussed above, the recitation of an appliance to perform limitations (a) - (f) and (j) - (k) amounts to no more than mere instructions to apply the exception using a generic computer component. Also as discussed above, limitations (h), (i), and (k) are recited at a high level of generality. These elements amount to receiving sound data and image data using a camera and microphone, which is well understood, routine, conventional activity. Further, limitation (g) recites a well-known structure and does not provide an inventive concept. Even when considered in combination, these additional elements represent mere instructions to implement an abstract idea or other exception on a computer and insignificant extra-solution activity, which do not provide an inventive concept (Step 2B).

Similarly, dependent claims 2-16 and 18-20 include additional steps that are considered abstract ideas because they fail to provide meaningful significance that goes beyond generally linking the use of an abstract idea to a particular technological environment and using the computer to perform an abstract idea.
Claims 2 and 18 read on a person determining a specific word in a voice input. 
Claim 3 reads on a person determining a request in a voice input. 
Claim 4 recites common appliance functions and links the abstract idea to the particular technological environment of refrigerators.
Claim 5 reads on a person observing a user and determining that the user is interacting with an appliance.
Claim 6 reads on a person analyzing an image to determine a position of a user.
Claim 7 reads on a person determining an absence of a specific word in a voice input and deciding not to respond to a user query.
Claim 8 recites using a machine learning model, which is a generic computer component.
Claims 9 and 10 read on a person determining a specific amount of time has passed since a user voice input and deciding to respond to user input differently.
Claim 11 reads on a person providing a response to a user.
Claim 12 reads on a person using a generic computer to record a sound after determining the sound is above a certain volume and not recording when the sound is below the specified volume, then analyzing the recorded sound.
Claim 13 reads on a person analyzing a voice recording to determine if there is human voice and transcribing the audio.
Claim 14 recites using a generic ML model to determine a response to a user.
Claim 15 reads on a person transcribing a voice input and analyzing the transcription.
Claim 16 reads on a person notifying a user by reading a message.
 Claim 19 reads on a person analyzing an image to determine a body position of a user.
Claim 20 reads on a person analyzing an image and audio of a user to determine a specific trigger, and responding to the user differently depending on the identification of the trigger.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-3, 8, 11, and 14-16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Duong et al. (US 20210065709 A1; hereinafter referred to as Duong).
Regarding claim 1 Duong teaches: a method of operating an appliance, the appliance comprising a microphone, the method comprising: obtaining a sound signal using the microphone ([0048] the speech input 104 may be spoken by a user and received at the dialog system 100 by way of a speech input component 105, such as a microphone);
analyzing the sound signal to identify a voice input ([0024] The dialog system 100 is configured to receive speech inputs 104, also referred to as voice inputs, from a user 102);
identifying, based at least in part on the voice input, a presence of a conversational state trigger ([0027] The wake-word detection (WD) subsystem 106 is configured to listen for and monitor a stream of audio input for input corresponding to a special sound or word or set of words);
entering a conversational state of operation, wherein the conversational state of operation analyzes the voice input using a conversational state response criteria ([0027] The wake-word detection (WD) subsystem 106 is configured to listen for and monitor a stream of audio input for input corresponding to a special sound or word or set of words, referred to as a wake-word. Upon detecting the wake-word for the dialog system 100, the WD subsystem 106 is configured to activate the ASR subsystem 108);
determining that a responsive action to the voice input is needed using the conversational state response criteria ([0034] the DM subsystem 150 is configured to interpret the intents identified in the logical forms received from the NW subsystem 110. Based on the interpretations, the DM subsystem 150 may initiate one or more actions that it interprets as being requested by the speech inputs 104 provided by the user);
and implementing the responsive action ([0024] The dialog system 100 may maintain a dialog with a user 102 and may possibly perform or cause one or more actions to be performed based upon interpretations of the speech inputs 104).

Regarding claim 2, Duong teaches: the method of claim 1, wherein identifying the presence of the conversational state trigger comprises: determining that the voice input contains a wake word for the appliance ([0027] The wake-word detection (WD) subsystem 106 is configured to listen for and monitor a stream of audio input for input corresponding to a special sound or word or set of words, referred to as a wake-word).

Regarding claim 3, Duong teaches: the method of claim 1, wherein identifying the presence of the conversational state trigger comprises: determining that the voice input contains a request ([0060] the user may provide to the dialog system 100 a first speech input 104 asking to move a workout alarm forward by one hour, and the dialog system 100 may thus translate this speech input into an input logical form 330 and execute the task of moving the workout alarm as requested. In this case, the output logical form 340 may indicate that the workout alarm was moved as requested) to perform a common appliance function ([0045] The dialog system 100 described herein may be suitable for implementation in standalone computing device 300, such as the computing device 300 shown. This computing device 300 may be, for example, a smartphone, a tablet, a notebook computer, an embedded device, a smart appliance or other smart device).

Regarding claim 8, Duong teaches: the method of claim 1, wherein determining that the responsive action to the voice input is needed using the conversational state response criteria comprises: analyzing the voice input using a machine learning model ([0032] For example, for the speech input “I'd like to order a large pepperoni pizza with mushrooms and olives,” the NLU subsystem 110 can identify the intent order pizza. The NLU subsystem can also identify and fill slots, e.g., pizza_size (filled with large) and pizza_toppings (filled with mushrooms and olives). The NLU subsystem 110 may use machine learning based techniques, rules, which may be domain specific, or a combination of machine learning techniques and rules to generate the logical forms).

Regarding claim 11, Duong teaches: the method of claim 1, wherein the responsive action comprises at least one of performing an appliance function, providing an informative response to a user, or prompting the user for further information or clarification ([0019] The dialog system is configured to receive speech inputs, interpret the speech inputs, maintain a dialog, possibly perform or cause one or more actions to be performed based on interpretations of the speech inputs, prepare appropriate responses, and output the responses to the user using audio output).

Regarding claim 14, Duong teaches: the method of claim 1, wherein determining that the responsive action to the voice input is needed comprises: analyzing the voice input using a machine learning model to identify the responsive action requested from the appliance ([0032] for the speech input “I'd like to order a large pepperoni pizza with mushrooms and olives,” the NLU subsystem 110 can identify the intent order pizza. The NLU subsystem can also identify and fill slots, e.g., pizza_size (filled with large) and pizza_toppings (filled with mushrooms and olives). The NLU subsystem 110 may use machine learning based techniques, rules, which may be domain specific, or a combination of machine learning techniques and rules to generate the logical forms. The logical forms generated by the NLU subsystem 110 are then fed to the DM subsystem 150 for further processing).

Regarding claim 15, Duong teaches: the method of claim 14, wherein analyzing the voice input using the machine learning model to identify the responsive action requested from the appliance comprises: generating a textual record of the voice input using a speech-to-text algorithm ([0029] upon the detection of the wake-word in the speech input 104, or the wake-up signal may be received upon the activation of a button) and to convert the speech input 104 to text. As part of its processing, the ASR subsystem 108 performs speech-to-text conversion. The speech input 104 may be in a natural language form, and the ASR subsystem 108 is configured to generate the corresponding natural language text in the language of the speech input 104);
and analyzing the textual record to determine the responsive action ([0055] semantic parser 114 of the NLU subsystem 110 determines a logical form, specifically the input logical form 330, corresponding to an utterance 320, which may correspond to the speech input 104. As mentioned above, the semantic parser 114 may be a neural network such as a seq2seq model. In some embodiments, such a semantic parser 114 is trained to map utterances 320 to logical forms that are TAVLFs or some other representations suitable for embodiments of the dialog system).

Regarding claim 16, Duong teaches: the method of claim 1, further comprising: providing a user notification regarding performance of the responsive action ([0132] Cloud infrastructure system 902 may send a response or notification 944 to the requesting customer to indicate when the requested service is now ready for use), wherein providing the user notification comprises converting a textual response to a verbal response using a text-to-speech algorithm ([0068] In some embodiments, as in the example of FIG. 3, the response text 360 output by the NLG subsystem 118 may then be translated into audio data by the TTS subsystem 120, and that audio data may be output as speech output 122).


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 4 are rejected under 35 U.S.C. 103 as being unpatentable over Duong in view of Tran et al. (US 20190362716 A1; hereinafter referred to as Tran).
Regarding claim 4, Duong teaches: the method of claim 3. Duong does not explicitly, but Tran discloses: wherein the appliance is a refrigerator appliance ([0201] In a smart refrigerator embodiment, the smart containers can self identify upon query, so a user can ask containers with a particular expiration date range to blink an LED associated with the smart container. One embodiment uses a natural language interface coupled with speech) and the common appliance function is at least one of temperature setting changes/queries, shopping list manipulation/queries, inventory information/queries, dispenser operations, unit conversions, or other cooking related questions ([0201] a speech enabled refrigerator can parse the verbal command “Refrigerator, identify expiring items” and issue a command to all containers inside the refrigerator that meet the expiration limit. The refrigerator can display the result on a display outside the refrigerator. The refrigerator can also display an interior cam view of the frig, annotated with the location of matching items responsive to a verbal query. The refrigerator can also identify low inventory and suggests or adds to a shopping list).
 Duong and Tran are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Duong to combine the teachings of Tran because doing so would allow for speech-enabled natural language interaction with different types of appliances, such as a refrigerator, to assist a user through conversation, leading to improved efficiency between users and appliances using natural language (Tran [0201] When a user wishes to use voice command, the user can say the wake word such as “Refrigerator”, “Oven” or “Alexa” (for Amazon Echo) and then a command. The appliance would provide context to improve understanding of the command. For example, a speech enabled refrigerator can parse the verbal command “Refrigerator, identify expiring items” and issue a command to all containers inside the refrigerator that meet the expiration limit. The refrigerator can display the result on a display outside the refrigerator).

Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Duong in view of Yu et al. (US 20180285752 A1; hereinafter referred to as Yu).
Regarding claim 5, Duong teaches: the method of claim 1. Duong does not explicitly, but Yu teaches: wherein the appliance further comprises a camera ([0009] receiving an input including at least one of voice input through a microphone and an image of a user captured through a camera), the method further comprising: obtaining an image of a user that provided the voice input ([0046] the control module 130 may analyze the user's speech received through the microphone 111, or may analyze the image of the surroundings of the user, which is obtained through the camera 113), wherein identifying the presence of the conversational state trigger comprises determining that the user is interacting with the appliance based at least in part on the image ([0120] control module 513 may identify the location where the image is taken, by using the location of the body. When the emotion or location of the user 510 is identified, the control module 513 may more clearly determine the intent of the user 510. For example, even though the user 510 utters speech including the word “hungry”, the intent of the user 510 may be differently recognized depending on the emotion or location of the user 510. For example, in the case where the user 510 utters “I am hungry” while doing an exercise using fitness equipment, the control module 513 may determine that the utterance of the user 510 is intended to hear words of encouragement, but not to relieve hunger).
Duong and Yu are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Duong to combine the teachings of Yu because doing so would allow for an appliance to analyze images of a user to determine a user’s emotional state when interacting with the appliance, leading to improved responses to user voice inputs (Yu [0120] the control module 513 may perform an image analysis on the image information (e.g., extract or identify an object) and may determine an emotional state or a situation of the user 510 by using characteristics of an object (e.g., the user 510, a person with the user 510, an object, or the like) that is included in the image. At this time, the control module 513 may use the type of emotion according to an expression, the location of an object, or the like that is stored in advance in the information storage module).

Regarding claim 6, the combination of Duong and Yu teaches: the method of claim 5. Yu further teaches: wherein determining that the user is interacting with the appliance comprises: analyzing the image to detect at least one of a body position, posture, eye contact, body proximity, or approach angle of the user ([0049] control module 130 may detect an object included in the obtained image. For example, the control module 130 may extract feature points from the image and may detect a shape (e.g., an omega shape) indicated by adjacent feature points, among the feature points, as an object (e.g., a face). The feature points may be, for example, points that represent a feature of the image to detect, track, or recognize the object in the image and may include points that are easily distinguishable despite a change in the shape, size, or position of each object in the image).

Claims 7 and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Duong in view of Smith et al. (US 20200349935 A1; hereinafter referred to as Smith).
Regarding claim 7, Duong teaches: the method of claim 1. Duong does not explicitly, but Smith teaches: the method further comprising: identifying, based at least in part on the voice input, an absence of the conversational state trigger ([0118] In this inactive state, the NMD 503 remains in a standby mode, ready to transition to an active state if a wake-word is detected, but not yet transmitting any data based on detected sound via a network interface);
entering a standard state of operation, wherein the standard state of operation analyzes the voice input using a strict response criteria, wherein the responsive action is less likely to be taken under the strict response criteria relative to the conversational state response criteria ([0133] in the inactive state, the NMD evaluates detected sound to identify a wake word (i.e., the occurrence of a wake-word event), but does not transmit sound data via a network interface to other devices for further processing);
and determining that the responsive action to the voice input is not needed using the strict response criteria ([0133] In this inactive state, the NMD 503 remains in a standby mode, ready to transition to an active state if a wake-word is detected, but not yet transmitting sound data based on detected sound via the network interface 224 (FIG. 5)).
Duong and Smith are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Duong to combine the teachings of Smith because doing so would allow for an appliance to transition between different response states to more efficiently and quickly respond to a user query depending on a trigger, leading to an improved user experience and energy efficiency (Smith [0031] In some embodiments, some or all of the NMDs can transition from the active state back to the inactive state after a predetermined time, for example a predetermined period of time after the last response output from that particular NMD. Accordingly, as described in more detail below, multiple NMDs may coordinate responsibility for voice control interactions to deliver an improved user experience).

Regarding claim 9, Duong teaches: the method of claim 1. Duong does not explicitly, but Smith teaches: further comprising: determining that a predetermined amount of time has passed since obtaining of the voice input ([0155] each NMD 103 may transition from the active state back to the inactive state after expiry of a predetermined period following the last response output by that particular NMD, or following the last captured voice input);
and entering a standard state of operation ([0157] an NMD can be transitioned from the active state back to the inactive state after expiry of a predetermined time), wherein the standard state of operation analyzes the voice input using a strict response criteria instead of the conversational state response criteria ([0133] in the inactive state, the NMD evaluates detected sound to identify a wake word (i.e., the occurrence of a wake-word event), but does not transmit sound data via a network interface to other devices for further processing).

Regarding claim 10, the combination of Duong and Smith teaches: the method of claim 9. Smith further teaches: wherein the predetermined amount of time is between 5 and 30 seconds ([0157] an NMD can be transitioned from the active state back to the inactive state after expiry of a predetermined time. For example, the predetermined time can be a length of time (e.g., 0.5 seconds, 1 second, 2 seconds, 5 seconds, 10 seconds, 30 seconds, 1 minute) from a particular event).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Duong in view of Marelus et al. (US 20250349294 A1; hereinafter referred to as Marelus).
Regarding claim 12, Duong teaches: the method of claim 1. Duong does not explicitly, but Marelus teaches: wherein analyzing the sound signal to identify the voice input comprises: determining that the sound signal exceeds a predetermined sound level ([0124] detecting voice activity in sound detected by the at least one microphone, based on at least one of: a time duration exceeding at least one predetermined corresponding threshold; and a speech detection level exceeding at least one predetermined corresponding threshold);
commencing a recording of the sound signal ([0029] the term ‘detecting a voice utterance using the at least one microphone’ may be taken to relate to recording, registering, capturing, sensing, etc.);
determining that the sound signal drops below the predetermined sound level ([0230] A voice activity model is used, and an activity threshold is set, for example to 30% (i.e. the threshold at which the system assumes the input is human voice). Additionally, audio input may be split into frames (e.g. set a sample rate and a frame size). Where this activity threshold is met, for a period of, for example, X number of frames, the system considers said activity as speech, and is to process it. Where there is, for example, less than X frames that rise above said threshold, the system is to consider said activity as non-speech and/or to delete and not process) for a predetermined amount of time ([0230] the system is to consider a silentframe threshold, i.e. a silent period of say Y frames, where, if no activity frames falls above the activity threshold of 30% above, the system is to consider that the user has ended the user's input); 
stopping the recording of the sound signal ([0230] Where there is, for example, less than X frames that rise above said threshold, the system is to consider said activity as non-speech and/or to delete and not process);
and analyzing the recording of the sound signal to identify the voice input ([0409] one or more ML models may be used to analyze the voice of the user, compare the voice to past stored elements of voice of said user, and to derive insight thereof. Additionally, and for example, one or more ML models or other models can derive insights from analysis of transcripts of the user's conversation with the system).
Duong and Marelus are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Duong to combine the teachings of Marelus because doing so would allow for the appliance to respond to a user’s voice input without the need for wake-word or trigger and instead rely on another condition such sound level, leading to improved user experience and saving appliance energy (Marelus [0227] the system may continue in conversation with the user without requiring the user to utter the wake-word every time it communicates with the system during that conversation (unlike traditional voice assistance systems). The user and the system can then engage in multi-turn conversation without the user needing to utter the wake-word throughout the conversation, after the initial utterance of the wake-word. Additionally, however, the system needs to know when the conversation has ended, so that it for example stops actively listening (and e.g. transcribing etc.), as it may possibly otherwise pick up utterances that are not directed to the system yet think it is a conversation with the system).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Duong in view of Marelus, as applied to claim 12 above, and further in view of Li et al. (US 20240395261 A1; hereinafter referred to as Li).
Regarding claim 13, the combination of Duong and Marelus teaches: the method of claim 12. Duong further teaches: and analyzing the textual record to determine if the responsive action is needed from the appliance ([0034] the DM subsystem 150 is configured to interpret the intents identified in the logical forms received from the NW subsystem 110. Based on the interpretations, the DM subsystem 150 may initiate one or more actions that it interprets as being requested by the speech inputs 104 provided by the user).
The combination of Duong and Marelus does not explicitly, but Li teaches: wherein analyzing the sound signal to identify the voice input further comprises: analyzing the recording to determine that a human voice is present in the recording ([0055] the subsystem controller 310 may include a voice detection component 312 configured to detect human voices (or speech) in the received audio data 301. For example, the voice detection component may be a voice activity detector (VAD) that can separate human speech from background noise (or other sources of audio));
generating a textual record of the recording using a speech-to-text algorithm… ([0023] a virtual assistant may detect user speech and convert the speech into a sequence of input tokens (collectively referred to as a “prompt”) that can be processed by a natural language processor (NLP)).
Duong, Marelus, and Li are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Duong and Marelus to combine the teachings of Li because doing so would allow for the appliance to provide more personalized responses to a user’s input when detecting human speech, leading to improved user experience (Li [0027] the virtual assistant of the present implementations can provide an improved user experience, for example, by developing a personality that is well-suited to the preferences of the user. Unlike existing virtual assistants that respond to user queries in the same robotic manner, the virtual assistant of the present implements can grow and adapt its personality to the whims of the user).

Claims 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Mischel et al. (US 12474889 B1; hereinafter referred to as Mischel) in view of Marelus.
Regarding claim 17, Mischel teaches: an appliance comprising: a cabinet ([col 28, lines 10-12] FIG. 15 illustrates, generally at 1500, an interactive device configured as an interactive medicine cabinet, according to embodiments of the invention); 
a microphone mounted to the cabinet ([col 28, lines 55-60] A microphone (one or more of 1514a, 1516a, 1518a, 1514b, 1516b, 1518b, 1520, 1522, and 1524) is used to receive acoustic signals from a user during an interaction with the medicine cabinet 1502 as described above in conjunction with the previous figures); 
a camera mounted to the cabinet ([col 28-29, lines 65-2] A camera (one or more of 1514a, 1516a, 1518a, 1514b, 1516b, 1518b, 1520, 1522, and 1524) is used to capture images or video of a front side of the medicine cabinet 1502 (including in some embodiments of the user) during an interaction with the medicine cabinet 1502); 
and a controller in operative communication with the microphone and the camera ([col 31, lines 39-42] 1934 represents a controller for a device such as a physical phenomenon device, 1934 can represent any number of different controllers used with the devices described herein), the controller being configured to:
obtain a sound signal using the microphone ([col 7, lines 10-13] When a user interacts with the system 100 through voice, the microphone 120 receives voice signals from the user. The voice signals are input to the first computing module 104 and are transformed into output data);
analyze the sound signal to identify a voice input ([col 6, lines 63-67] a voice-to-text (VT) system resides on the intelligent device and is implemented with a commercially available software solution such as is available from SNOWBOY™. The resident VT system can be used to process a custom wake word configured for a particular application);
obtain an image of a user that provided the voice input… ([col 28-29, lines 65-2] A camera (one or more of 1514a, 1516a, 1518a, 1514b, 1516b, 1518b, 1520, 1522, and 1524) is used to capture images or video of a front side of the medicine cabinet 1502 (including in some embodiments of the user) during an interaction with the medicine cabinet 1502).
Mischel does not explicitly, but Marelus teaches: identify, based at least in part on the voice input and the image of the user, a presence of a conversational state trigger ([0280] cause the system to be activated (whether the whole system, the listening function of the system, or some other function of the system) upon detecting a wake-word spoken by a person, and/or upon detecting a person via voice, image, or video, or other biometrics. The detected person may be identified by recognizing the person's voice pattern or image);
enter a conversational state of operation, wherein the conversational state of operation analyzes the voice input using a conversational state response criteria ([0215] As the system's responses are not rigidly pre-programmed, unlike traditional voice assistance systems, the ML model can be prompted, instructed, trained or otherwise formed to take on a certain character, behave a certain way and the like. This means that the system may (i) adapt to the user, and (ii) the user may tweak the experience);
determine that a responsive action to the voice input is needed using the conversational state response criteria ([0219] as part of the backend process, a “state decider” functionality can be provided. Such a functionality may be provided by a ML model, such as an LLM, or some other way, which may be prompted or otherwise trained or fine-tuned or instructed in a certain manner or for a certain purpose, and which may be configured to decide whether, in order to provide an adequate response to the user's input, the system ought to obtain the response solely from an LLM, whether to obtain additional data from a database, such as for example through a RAG system (which is well-documented to the skilled person), through a system configured for browsing the web and for retrieving information, through accessing a proprietary knowledge base);
and implement the responsive action ([0218] the system may be configured to use function calling or other method (to connect the one or more ML models, and preferably an LLM, to another tool such as an application), to additionally provide the user with a better experience, or to undertake tasks on behalf of a user such as for example to switch a light on in response to a user requesting that a light be switched on).
Mischel and Marelus are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Mischel to combine the teachings of Marelus because doing so would allow for an appliance to analyze both voice and image data to better assist a user with a requested task, improving user satisfaction and experience (Marelus [0314] a camera can be used, with suitable operating instructions, that are adapted to analyze body movements, to, for example, detect if the exercises (as per the example above, for example) are being undertaken, whether they are being undertaken well, and to further guide the user. Similarly, for example, to detect, if any, what and what amount of nutrients or nutritions, amount and types of calories, the user is consuming, and, for example, when).

Regarding claim 18, the combination of Mischel and Marelus teaches: the appliance of claim 17. Mischel further teaches: wherein identifying the presence of the conversational state trigger comprises: determining that the voice input contains a wake word for the appliance or determining that the voice input contains a request to perform a common appliance function (col 6, lines 63-67] In one or more embodiments, a voice-to-text (VT) system resides on the intelligent device and is implemented with a commercially available software solution such as is available from SNOWBOY™. The resident VT system can be used to process a custom wake word configured for a particular application).

Regarding claim 19, the combination of Mischel and Marelus teaches: the appliance of claim 18. Marelus further teaches: wherein identifying, based at least in part on the voice input and the image of the user, the presence of the conversational state trigger comprises: determining that the user is interacting with the appliance by analyzing the image to detect at least one of a body position, posture, eye contact ([0233] To mimic human conversations, the system can be so programmed so that no wake-word needs to be used. The system can use data from its optional camera, for example, to sense whether the user is looking at the system and can thus commence listening for user input (rather than listening all the time, as, even with a module to differentiate between noise and voice, the system should ideally know whether it is random voice (e.g. conversation between the user and a visitor, for example) or whether it is voice directed to the system), body proximity, or approach angle of the user.

Claims 20 is rejected under 35 U.S.C. 103 as being unpatentable over Mischel in view of Marelus, as applied to claims 17-19 above, and further in view of Smith.
Regarding claim 20, the combination of Mischel and Marelus teaches: the appliance of claim 17. Marelus further teaches: wherein the controller is further configured to: identify, based at least in part on the voice input and the image of the user, an absence of the conversational state trigger… ([0231] the voice activity detection may also include a base-line threshold, which would consider any activity that falls between the activity threshold and the base-line threshold as non-speech, with the baseline threshold also functioning as the silentframe threshold. Further, where a wake-word is used, or for example if a button, motion detection, face-detection, etc., is used, that insinuates or otherwise demonstrates that user intends to speak, the system need not adhere to the above, for the first statement, as the system may assume voice activity, but may be so programmed to still adhere to a silentframe threshold, or some other method (e.g. the user stopped looking at the system for X number of frames, or the user pressed a button), so to know when the user stopped speaking).
The combination of Mischel and Marelus does not explicitly, but Smith teaches: enter a standard state of operation, wherein the standard state of operation analyzes the voice input using a strict response criteria, wherein the responsive action is less likely to be taken under the strict response criteria relative to the conversational state response criteria; and determine that the responsive action to the voice input is not needed using the strict response criteria ([0133] in the inactive state, the NMD evaluates detected sound to identify a wake word (i.e., the occurrence of a wake-word event), but does not transmit sound data via a network interface to other devices for further processing).  
Mischel, Marelus, and Smith are considered analogous in the field of speech processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Mischel and Marelus to combine the teachings of Smith because doing so would allow for an appliance to transition between different response states to more efficiently and quickly respond to a user query depending on a trigger, leading to an improved user experience and energy efficiency (Smith [0031] In some embodiments, some or all of the NMDs can transition from the active state back to the inactive state after a predetermined time, for example a predetermined period of time after the last response output from that particular NMD. Accordingly, as described in more detail below, multiple NMDs may coordinate responsibility for voice control interactions to deliver an improved user experience).

	
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
Hanson et al. (US 20220199079 A1) – discloses a method of receiving a user input and using natural langue processing to analyze and generate a response or action for a user in a variety of smart devices.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Nathan Tengbumroong whose telephone number is (703)756-1725. The examiner can normally be reached Monday - Friday, 11:30 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NATHAN TENGBUMROONG/Examiner, Art Unit 2654    

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654
Read full office action
Prosecution Timeline

Jul 31, 2024
Application Filed
Feb 04, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/173,495
Patent 12530536
Mixture-Of-Expert Approach to Reinforcement Learning-Based Dialogue Management
2y 5m to grant Granted Jan 20, 2026
17/876,156
Patent 12451142
NON-WAKE WORD INVOCATION OF AN AUTOMATED ASSISTANT FROM CERTAIN UTTERANCES RELATED TO DISPLAY CONTENT
2y 5m to grant Granted Oct 21, 2025
17/883,265
Patent 12412050
MULTI-PLATFORM VOICE ANALYSIS AND TRANSLATION
2y 5m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
43%
Grant Probability
99%
With Interview (+75.0%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 14 resolved cases by this examiner. Grant probability derived from career allow rate.