Prosecution Insights
Last updated: April 19, 2026
Application No. 17/541,995

AUTOMATICALLY ADAPTING AUDIO DATA BASED ASSISTANT PROCESSING

Final Rejection §103
Filed
Dec 03, 2021
Examiner
CHAVEZ, RODRIGO A
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
4 (Final)
50%
Grant Probability
Moderate
5-6
OA Rounds
3y 5m
To Grant
88%
With Interview

Examiner Intelligence

Grants 50% of resolved cases
50%
Career Allow Rate
115 granted / 228 resolved
-11.6% vs TC avg
Strong +37% interview lift
Without
With
+37.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
22 currently pending
Career history
250
Total Applications
across all art units

Statute-Specific Performance

§101
16.4%
-23.6% vs TC avg
§103
53.1%
+13.1% vs TC avg
§102
20.9%
-19.1% vs TC avg
§112
5.6%
-34.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 228 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant’s arguments with respect to the rejection of claim(s) 1-6, 8-11 and 16-20 under 35 U.S.C. § 102(a)(1) have been considered but are moot because of the new ground of rejection under 35 U.S.C. § 103 in view of Sharifi and Golikov. Furthermore the examiner acknowledges the comments in the Remarks filed 09/17/2025, regarding the telephonic interview held on September 15, 2025. The examiner acknowledges the efforts to advance prosecution by providing the proposed amendments in the claim set filed 09/17/2025. The examiner, however, upon further consideration under obviousness, has found the recited amendments to be unpatentable under 35 U.S.C. 103 in view of Sharifi and Golikov. The applicant, in the Remarks filed 09/17/2025, argues: “The Office Action fails to render obvious at least these amended features of claim 1. For example, the cited aspects of Sharifi fail to disclose "automatically adapting" to "Store Data In Sound Buffer" (the alleged "second state") from the "Active Listening State" (the alleged "third state"). As another example, the cited aspects of Sharifi fail to disclose "automatically adapting", independent of receiving any explicit user interface input that requests or confirms the automatically adapting...", from the "Inactive Listening State" (the alleged "first state") to the "Active Listening State" (the alleged "third state").” Regarding applicant’s arguments, the examiner respectfully disagrees. The examiner contends that, under broadest reasonable interpretation, the automatically adapting from a from a first state to a second state, from a first state to a third state and from a third state to a second state would be obvious because one of ordinary skill in the art would find this to be, broadly speaking, state transitioning based on contextual parameters. Based on how broad the elements are recited, one of ordinary skill would understand that it does not matter how many states exist or how many ways exist to transition from one state to another, if those contextual parameters that condition each state transition are not specifically defined for each and every state transition. Although the claim broadly defines the dynamic contextual parameters, this definition fails to define how those dynamic contextual parameters uniquely affect each and every transition from one state to another. Thus, one of ordinary skill in the art would find it obvious to combine the state transitioning of Sharifi with the dynamic contextual parameters of Golokov to render the recited claim elements obvious and unpatentable. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-6, 8-11 and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Sharifi (WIPO Publication WO2020/171809) in view of Golikov (US PG Pub 20210074285). As per claim 1, Sharifi discloses: A method implemented by one or more processors, the method comprising: processing, at a first time, first values, wherein the first values are for dynamic contextual parameters at the first time (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice); in response to the processing at the first time satisfying one or more first conditions (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice): automatically adapting particular audio data based assistant processing, performed locally at an assistant device, to a second state and from a first state that is active at the first time, the automatically adapting to the second state from the first state being independent of receiving any explicit user interface input that requests or confirms the automatically adapting to the second state (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice (automatically adapting without any explicit user interface input); see also p. 0046 & 0056 - STT and natural language processing are performed locally); processing, at a second time, second values, wherein the second values are for the dynamic contextual parameters at the second time (Sharifi; p. 0077 - when a sound is detected while automated assistant 120 is in the inactive listening state, automated assistant may store the sound data in a buffer, such as memory buffer 144. If the sound stops and no sound-based event (e.g., hot word or phrase) is detected, then automated assistant 120 may transition back into the inactive listening state. However, if a sound-based event is detected, then automated assistant 120 may transition into the active listening state); in response to the processing at the second time satisfying one or more second conditions (Sharifi; p. 0077 - if a sound-based event is detected; see also p. 0091 - Next, second user 101B says, “Sleuth that,” which in this example operates as hot word(s) that constitute an event to invoke automated assistant 120; see also p. 0095 – “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state): automatically adapting the particular audio data based assistant processing, performed locally at the assistant device, to a third state (Sharifi; p. 0077 - However, if a sound- based event is detected, then automated assistant 120 may transition into the active listening state), the adaptation to the third state being from the first state, the first state being active at the second time, the automatically adapting to the third state, from the first state, being independent of receiving any explicit user interface input that requests or confirms the automatically adapting to the third state (Sharifi p. 0077; see also p. 0091-0094 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice (automatically adapting without any explicit user interface input)); processing, at a third time, third values, wherein the third values are for the dynamic contextual parameters at the third time (Sharifi; p. 0077 - when a sound is detected while automated assistant 120 is in the inactive listening state, automated assistant may store the sound data in a buffer, such as memory buffer 144. If the sound stops and no sound-based event (e.g., hot word or phrase) is detected, then automated assistant 120 may transition back into the inactive listening state. However, if a sound-based event is detected, then automated assistant 120 may transition into the active listening state); in response to the processing at the third time satisfying one or more third conditions (Sharifi; p. 0077 - if a sound-based event is detected; see also p. 0091 - Next, second user 101B says, “Sleuth that,” which in this example operates as hot word(s) that constitute an event to invoke automated assistant 120; see also p. 0095 – “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state): automatically adapting the particular audio data based assistant processing, performed locally at the assistant device, to the second state, the adaptation to the second state being from the third state, the third state being active at the third time, the automatically adapting to the second state, from the third state, being independent of receiving any explicit user interface input that requests or confirms the automatically adapting to the second state (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice (automatically adapting without any explicit user interface input)); wherein one of the first state, the second state, and the third state is an inactive state (Sharifi; p. 0077, 0093 & 0095 - inactive listening state) in which at least part of the particular audio data based assistant processing is suppressed for one or more registered users that are registered with the assistant device (Sharifi; p. 0093 – speaker recognition technology may be used to determine a speaker's identity. If the speaker's identity is successfully determined, and/or it is determined that the speaker has sufficient permissions, then certain hot word(s) may become activated; see also p. 0095 - “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state; see also p. 0097 – In Fig. 7, assume once again that first user 101A is speaker-recognizable by automated assistant 120 as registered user… first user 101A invokes assistant; the permissions policy permits the registered user to use certain hot words while others will not be recognized (suppressed)) and is also suppressed for any users that are not registered with the assistant device (Sharifi; p. 0093 - the sound may not be captured at all, or may be discarded from memory immediately after it is determined that the speaker is not recognized). Sharifi, however, fails to disclose wherein the dynamic contextual parameters include at least one of a current activity parameter indicating an activity occurring in an environment of the assistant device, or a temporal parameter indicating a current temporal condition. Golikov does teach wherein the dynamic contextual parameters include at least one of a current activity parameter indicating an activity occurring in an environment of the assistant device, or a temporal parameter indicating a current temporal condition (Golikov; p. 0068 - one or more component(s) and/or function(s) of the automated assistant client 170 can be initiated responsive to a detection of human presence based on output from presence sensor(s) 167 (activity occurring in an environment of the assistant device). For example, attention handler 115, visual capture engine 174, and/or speech capture engine 172 can optionally be activated only responsive to a detection of human presence. Also, for example, those and/or other component(s) (e.g., on-device speech recognition engine 120, on-device NLU engine 140, and/or on-device fulfillment engine 145) can optionally be deactivated responsive to no longer detecting human presence; see also p. 0010; see also p. 0116 - deactivating the on-device speech recognition may include deactivating the on-device speech recognition when it is determined to not activate the on-device natural language understanding and/or the fulfillment, and further based on at least a threshold duration of time passing without further voice activity detection and/or further recognized text). Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Sharifi to include wherein the dynamic contextual parameters include at least one of a current activity parameter indicating an activity occurring in an environment of the assistant device, or a temporal parameter indicating a current temporal condition, as taught by Golikov, because those implementations enable interaction of a user with an automated assistant to be initiated and/or guided without the user needing to preface such interaction with utterance of a hot-word and/or with other explicit invocation cue. This enables reduced user input to be provided by the user (at least due to omission of the hot-word or other explicit invocation cue), which directly lessens the duration of the interaction and thereby may reduce time-to-fulfillment and conserve various local processing resources that would otherwise be utilized in a prolonged interaction (Golikov; p. 0011). As per claim 2, Sharifi in view of Golikov discloses: The method of claim 1, wherein the first state is one of: (a) a fully active state in which the particular audio data based assistant processing is fully performed for one or more registered users that are registered with the assistant device and is also fully performed for any users that are not registered with the assistant device (Sharifi; p. 0094-0097 - in the example of figures 6-7, once the hot ward has been recognized and the assistant transitions to the active listening mode, both utterances from the registered and unregistered user are considered); (b) a partially active state in which the particular audio data based assistant processing is fully performed for at least some of the one or more registered users, but at least part of the particular audio data based assistant processing is suppressed for any users that are not registered with the assistant device (Sharifi; p. 0093 automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER’ state of FIG. 2 if the sound corresponds to a registered user's voice. Otherwise the sound may not be captured at all, or may be discarded from memory immediately after it is determined that the speaker is not recognized); and (c) the inactive state (Sharifi; p. 0093 – speaker recognition technology may be used to determine a speaker's identity. If the speaker's identity is successfully determined, and/or it is determined that the speaker has sufficient permissions, then certain hot word(s) may become activated; see also p. 0095 - “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state; see also p. 0097 – In Fig. 7, assume once again that first user 101A is speaker-recognizable by automated assistant 120 as registered user… first user 101A invokes assistant; the permissions policy permits the registered user to use certain hot words while others will not be recognized (suppressed)). As per claim 3, Sharifi in view of Golikov, discloses: The method of claim 2, wherein the second state is another of: (a) the fully active state, (b) the partially active state, and (c) the inactive state (Sharifi; p. 0077 & p. 0093-0097). As per claim 4, Sharifi in view of Golikov discloses: The method of claim 3, wherein the third state is the remaining of: (a) the fully active state, (b) the partially active state, and (c) the inactive state (Sharifi; p. 0077 & p. 0093-0097). As per claim 5, Sharifi in view of Golikov discloses: The method of claim 4, wherein the particular audio data based assistant processing is hot word processing (Sharifi; p. 0038 & p. 0048), the hot word processing comprising: processing a stream of audio data, using one or more local hot word models of the assistant device, to monitor for occurrence of a hot word (Sharifi; p. 0048 - one or more on-device invocation models 114 may be used by invocation module 113 to determine whether an utterance and/or visual cue(s) qualify as an invocation. Such an on-device invocation model 113 may be trained to detect variations of hot words/phrases and/or gestures), the stream of audio data being detected via at least one microphone of the assistant device (Sharifi; p. 0046 - Speech capture module 110 may be configured to capture a user's speech, e.g., via a microphone 109); and causing further assistant processing to be performed based on detecting occurrence of the hot word in the stream of audio data (Sharifi; p. 0077 - when a sound is detected while automated assistant 120 is in the inactive listening state, automated assistant may store the sound data in a buffer, such as memory buffer 144. If the sound stops and no sound-based event (e.g., hot word or phrase) is detected, then automated assistant 120 may transition back into the inactive listening state. However, if a sound-based event is detected, then automated assistant 120 may transition into the active listening state). As per claim 6, Sharifi in view of Golikov discloses: The method of claim 5, wherein, in the partially active state, the particular audio data based assistant processing is suppressed, for any users that are not registered with the assistant device, by causing the further assistant processing to be performed further based on: verifying that the hot word was uttered by one of the one or more registered users (Sharifi; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice). As per claim 8, Sharifi in view of Golikov discloses: The method of claim 5, wherein the hot word is an assistant hot word for invoking an automated assistant (Sharifi; p. 0038 & p. 0048 - …A user may verbally provide (e.g., type, speak) a predetermined invocation phrase, such as "OK, Assistant," or "Hey, Assistant," to cause automated assistant 120 to begin actively listening or monitoring typed text…). As per claim 9, Sharifi in view of Golikov discloses: The method of claim 8, wherein the further assistant processing comprises: performing speech recognition, on audio data that captures a spoken utterance and that follows and/or precedes the hot word in the stream of audio data, to generate a recognition of the spoken utterance (Sharifi; p. 0040 - automated assistant 120 may utilize automatic speech recognition ("ASR," also referred to as "speech-to-text," or "STT") to convert utterances from users into text); performing natural language understanding, on the recognition, to generate natural language understanding data (Sharifi; p. 0056 - … intent matcher 135 may include a natural language processor 122…; see also p. 0058-0064); and/or causing one or more actions to be performed based on the natural language understanding data (Sharifi; p. 0064 - known command syntaxes may be used to determine fitness of spoken utterances for triggering responsive action by automated assistant 120). As per claim 10, Sharifi in view of Golikov discloses: The method of claim 5, wherein the hot word is an action hot word for directly invoking a particular action via the automated assistant, and wherein the further processing comprises causing, by the automated assistant, the particular action to be performed based on detecting occurrence of the hot word in the stream of audio data (Sharifi; p. 0070 - Additionally or alternatively, fulfillment module 124 may be configured to receive, e.g., from intent matcher 135, a user's intent and any slot values provided by the user or determined using other means (e.g., GPS coordinates of the user, user preferences, etc.) and trigger a responsive action. Responsive actions may include, for instance, ordering a good/service, starting a timer, setting a reminder, initiating a phone call, playing media, sending a message, etc. In some such implementations, fulfillment information may include slot values associated with the fulfillment, confirmation responses (which may be selected from predetermined responses in some cases), etc.). As per claim 11, Sharifi in view of Golikov discloses: The method of claim 5, wherein the hot word is a third-party assistant application hot word for directly invoking a particular third-party application via the automated assistant, and wherein the further processing comprises causing, by the automated assistant, the particular third-party assistant application to be invoked based on detecting occurrence of the hot word in the stream of audio data (Sharifi; p. 0067 - In some implementations, automated assistant 120 may serve as an intermediary between users and one or more third party computing services 130 (or "third party agents", or "agents")…). As per claim 16, Sharifi in view of Golikov discloses: The method of claim 1, wherein the first state is one of:(a) a first threshold state in which one or more first thresholds are utilized for the particular audio data based assistant processing (Sharifi; p. 0109 - if the confidence measure falls below a first threshold, then no action may be taken on data stored in buffer 144); (a) a second threshold state in which one or more second thresholds are utilized for the particular audio data based assistant processing (Sharifi; p. 0109 - If the confidence measure falls between the first threshold and a higher second threshold, then automated assistant 120 may seek permission from a user before taking responsive action, e.g., by confirming the user's request. If the confidence measure falls between the second threshold and a third higher threshold, automated assistant 120 may take responsive action if only local resources are required... If the confidence measure falls above the third threshold, automated assistant 120 may take responsive action without needing further input and without regard to whether local or cloud-based resources are required to fulfill the request); and (c) an inactive state in which the particular audio data processing is suppressed for the one or more registered users and is suppressed for any users that are not registered with the assistant device (Sharifi; p. 0093 – speaker recognition technology may be used to determine a speaker's identity. If the speaker's identity is successfully determined, and/or it is determined that the speaker has sufficient permissions, then certain hot word(s) may become activated; see also p. 0095 - “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state; see also p. 0097 – In Fig. 7, assume once again that first user 101A is speaker-recognizable by automated assistant 120 as registered user… first user 101A invokes assistant; the permissions policy permits the registered user to use certain hot words while others will not be recognized (suppressed); see also p. 0077). As per claim 17, Sharifi in view of Golikov discloses: The method of claim 16, wherein the second state is another of: (a) the first threshold state, (b) the second threshold state, and (c) the inactive state; and/or wherein the third state is the remaining of: (a) the first threshold state, (b) the second threshold state, and (c) the inactive state (Sharifi; p. 0077 & p. 0093-0097). As per claim 18, Sharifi in view of Golikov disclose: The method of claim 17, wherein the particular audio data based assistant processing is hot word processing (Sharifi; p. 0038 & p. 0048), the hot word processing comprising: processing a stream of audio data, using one or more local hot word models of the assistant device, to monitor for occurrence of a hot word (Sharifi; p. 0048 - one or more on-device invocation models 114 may be used by invocation module 113 to determine whether an utterance and/or visual cue(s) qualify as an invocation. Such an on-device invocation model 113 may be trained to detect variations of hot words/phrases and/or gestures), the stream of audio data being detected via at least one microphone of the assistant device (Sharifi; p. 0046 - Speech capture module 110 may be configured to capture a user's speech, e.g., via a microphone 109); and causing further assistant processing to be performed based on detecting occurrence of the hot word in the stream of audio data (Sharifi; p. 0077 - when a sound is detected while automated assistant 120 is in the inactive listening state, automated assistant may store the sound data in a buffer, such as memory buffer 144. If the sound stops and no sound-based event (e.g., hot word or phrase) is detected, then automated assistant 120 may transition back into the inactive listening state. However, if a sound-based event is detected, then automated assistant 120 may transition into the active listening state). As per claim 19, Sharifi discloses: A method implemented by one or more processors, the method comprising: processing, at a first time, first values, for dynamic contextual parameters, at the first time (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice); in response to the processing at the first time satisfying one or more first conditions (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice); automatically adapting particular audio data based assistant processing, performed locally at an assistant device, to a second state and from a first state that is active at the first time, the automatically adapting to the second state from the first state being independent of receiving any explicit user interface input that requests or confirms the automatically adapting to the second state (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice (automatically adapting without any explicit user interface input); see also p. 0046 & 0056 - STT and natural language processing are performed locally); processing, at a second time, second values, wherein the second values are for the dynamic contextual parameters at the second time (Sharifi; p. 0077 - when a sound is detected while automated assistant 120 is in the inactive listening state, automated assistant may store the sound data in a buffer, such as memory buffer 144. If the sound stops and no sound-based event (e.g., hot word or phrase) is detected, then automated assistant 120 may transition back into the inactive listening state. However, if a sound-based event is detected, then automated assistant 120 may transition into the active listening state); in response to the processing at the second time satisfying one or more second conditions (Sharifi; p. 0077 - if a sound-based event is detected; see also p. 0091 - Next, second user 101B says, “Sleuth that,” which in this example operates as hot word(s) that constitute an event to invoke automated assistant 120; see also p. 0095 – “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state): automatically adapting the particular audio data based assistant processing, performed locally at the assistant device, to the first state and from the second state that is active at the second time (Sharifi; p. 0077 - However, if a sound- based event is detected, then automated assistant 120 may transition into the active listening state), the automatically adapting to the first state from the second state being independent of receiving any explicit user interface input that requests or confirms the automatically adapting to the first state (Sharifi p. 0077; see also p. 0091-0094 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice (automatically adapting without any explicit user interface input)); wherein the first state is one of: (a) a fully active state in which the particular audio data based assistant processing is fully performed for one or more registered users that are registered with the assistant device and is also fully performed for any users that are not registered with the assistant device (Sharifi; p. 0094-0097 - in the example of figures 6-7, once the hot ward has been recognized and the assistant transitions to the active listening mode, both utterances from the registered and unregistered user are considered); and (b) an inactive state in which at least part of the particular audio data based assistant processing is suppressed for one or more registered users that are registered with the assistant device and is also suppressed for any users that are not registered with the assistant device (Sharifi; p. 0093 – speaker recognition technology may be used to determine a speaker's identity. If the speaker's identity is successfully determined, and/or it is determined that the speaker has sufficient permissions, then certain hot word(s) may become activated; see also p. 0095 - “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state; see also p. 0097 – In Fig. 7, assume once again that first user 101A is speaker-recognizable by automated assistant 120 as registered user… first user 101A invokes assistant; the permissions policy permits the registered user to use certain hot words while others will not be recognized (suppressed); see also p. 0077); and wherein the second state is the other of (a) the fully active state and (b) the inactive state (Sharifi; p. 0093-0097). Sharifi, however, fails to disclose wherein the dynamic contextual parameters include at least one of: a current activity parameter indicating an activity occurring in an environment of the assistant device, wherein the current activity parameter is determined based on at least one of:(i) accessing a calendar entry of a registered user, or (ii) accessing activity data from an application running on the assistant device or another client device in the environment of the assistant device; or a temporal parameter indicating a current temporal condition. Golikov does teach wherein the dynamic contextual parameters include at least one of: a current activity parameter indicating an activity occurring in an environment of the assistant device, wherein the current activity parameter is determined based on at least one of:(i) accessing a calendar entry of a registered user, or (ii) accessing activity data from an application running on the assistant device or another client device in the environment of the assistant device; or a temporal parameter indicating a current temporal condition (Golikov; p. 0116 - deactivating the on-device speech recognition may include deactivating the on-device speech recognition when it is determined to not activate the on-device natural language understanding and/or the fulfillment, and further based on at least a threshold duration of time passing without further voice activity detection and/or further recognized text). Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Sharifi to include wherein the dynamic contextual parameters include at least one of: a current activity parameter indicating an activity occurring in an environment of the assistant device, wherein the current activity parameter is determined based on at least one of:(i) accessing a calendar entry of a registered user, or (ii) accessing activity data from an application running on the assistant device or another client device in the environment of the assistant device; or a temporal parameter indicating a current temporal condition, as taught by Golikov, because those implementations enable interaction of a user with an automated assistant to be initiated and/or guided without the user needing to preface such interaction with utterance of a hot-word and/or with other explicit invocation cue. This enables reduced user input to be provided by the user (at least due to omission of the hot-word or other explicit invocation cue), which directly lessens the duration of the interaction and thereby may reduce time-to-fulfillment and conserve various local processing resources that would otherwise be utilized in a prolonged interaction (Golikov; p. 0011). As per claim 20, Sharifi discloses: A method implemented by one or more processors, the method comprising: processing, at a first time, first values, for dynamic contextual parameters, at the first time (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice); in response to the processing at the first time satisfying one or more first conditions (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice); automatically adapting particular audio data based assistant processing, performed locally at an assistant device, to a second state and from a first state that is active at the first time, the automatically adapting to the second state from the first state being independent of receiving any explicit user interface input that requests or confirms the automatically adapting to the second state (Sharifi; Fig. 2; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice (automatically adapting without any explicit user interface input); see also p. 0046 & 0056 - STT and natural language processing are performed locally); processing, at a second time, second values, wherein the second values are for the dynamic contextual parameters at the second time (Sharifi; p. 0077 - when a sound is detected while automated assistant 120 is in the inactive listening state, automated assistant may store the sound data in a buffer, such as memory buffer 144. If the sound stops and no sound-based event (e.g., hot word or phrase) is detected, then automated assistant 120 may transition back into the inactive listening state. However, if a sound-based event is detected, then automated assistant 120 may transition into the active listening state); in response to the processing at the second time satisfying one or more second conditions (Sharifi; p. 0077 - if a sound-based event is detected; see also p. 0091 - Next, second user 101B says, “Sleuth that,” which in this example operates as hot word(s) that constitute an event to invoke automated assistant 120; see also p. 0095 – “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state): automatically adapting the particular audio data based assistant processing, performed locally at the assistant device, to the first state and from the second state that is active at the second time (Sharifi; p. 0077 - However, if a sound- based event is detected, then automated assistant 120 may transition into the active listening state), the automatically adapting to the first state from the second state being independent of receiving any explicit user interface input that requests or confirms the automatically adapting to the first state (Sharifi p. 0077; see also p. 0091-0094 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice (automatically adapting without any explicit user interface input)); wherein the first state is: (a) a first threshold state in which one or more first less restrictive thresholds are utilized for the particular audio data based assistant processing (Sharifi; p. 0109 - if the confidence measure falls below a first threshold, then no action may be taken on data stored in buffer 144); wherein the second state is: (b) a second threshold state in which one or more second more restrictive thresholds are utilized for the particular audio data based assistant processing (Sharifi; p. 0109 - If the confidence measure falls between the first threshold and a higher second threshold, then automated assistant 120 may seek permission from a user before taking responsive action, e.g., by confirming the user's request. If the confidence measure falls between the second threshold and a third higher threshold, automated assistant 120 may take responsive action if only local resources are required... If the confidence measure falls above the third threshold, automated assistant 120 may take responsive action without needing further input and without regard to whether local or cloud-based resources are required to fulfill the request); wherein automatically adapting the particular audio data based assistant processing, to the second state and from the first state, causes the particular audio data based assistant processing to switch from utilizing the one or more first less restrictive thresholds to utilizing the one or more second more restrictive thresholds (Sharifi; p. 0109 - If the confidence measure falls between the first threshold and a higher second threshold, then automated assistant 120 may seek permission from a user before taking responsive action, e.g., by confirming the user's request. If the confidence measure falls between the second threshold and a third higher threshold, automated assistant 120 may take responsive action if only local resources are required... If the confidence measure falls above the third threshold, automated assistant 120 may take responsive action without needing further input and without regard to whether local or cloud-based resources are required to fulfill the request); and wherein utilizing the one or more first less restrictive thresholds in the first threshold state and utilizing the one or more second more restrictive thresholds in the second threshold state causes the audio data based assistant processing to be less restrictive in the first threshold state than in the second threshold state (Sharifi; p. 0093 – speaker recognition technology may be used to determine a speaker's identity. If the speaker's identity is successfully determined, and/or it is determined that the speaker has sufficient permissions, then certain hot word(s) may become activated; see also p. 0095 - “Sleuth that” in this example is a hot phrase that triggers automated assistant 120 to transition from the inactive listening state to the active listening state; see also p. 0097 – In Fig. 7, assume once again that first user 101A is speaker-recognizable by automated assistant 120 as registered user… first user 101A invokes assistant; the permissions policy permits the registered user to use certain hot words while others will not be recognized (less restrictive, because it permits the usage of more hot words)). Sharifi, however, fails to disclose wherein the dynamic contextual parameters include at least one of: a current activity parameter indicating an activity occurring in an environment of the assistant device, wherein the current activity parameter is determined based on at least one of:(i) accessing a calendar entry of a registered user, or (ii) accessing activity data from an application running on the assistant device or another client device in the environment of the assistant device; or a temporal parameter indicating a current temporal condition. Golikov does teach wherein the dynamic contextual parameters include at least one of: a current activity parameter indicating an activity occurring in an environment of the assistant device, wherein the current activity parameter is determined based on at least one of:(i) accessing a calendar entry of a registered user, or (ii) accessing activity data from an application running on the assistant device or another client device in the environment of the assistant device; or a temporal parameter indicating a current temporal condition (Golikov; p. 0116 - deactivating the on-device speech recognition may include deactivating the on-device speech recognition when it is determined to not activate the on-device natural language understanding and/or the fulfillment, and further based on at least a threshold duration of time passing without further voice activity detection and/or further recognized text). Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Sharifi to include wherein the dynamic contextual parameters include at least one of: a current activity parameter indicating an activity occurring in an environment of the assistant device, wherein the current activity parameter is determined based on at least one of:(i) accessing a calendar entry of a registered user, or (ii) accessing activity data from an application running on the assistant device or another client device in the environment of the assistant device; or a temporal parameter indicating a current temporal condition, as taught by Golikov, because those implementations enable interaction of a user with an automated assistant to be initiated and/or guided without the user needing to preface such interaction with utterance of a hot-word and/or with other explicit invocation cue. This enables reduced user input to be provided by the user (at least due to omission of the hot-word or other explicit invocation cue), which directly lessens the duration of the interaction and thereby may reduce time-to-fulfillment and conserve various local processing resources that would otherwise be utilized in a prolonged interaction (Golikov; p. 0011). Claims 12, 13 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Sharifi in view of Golikov and further in view of Krishnan (US PG Pub 20220093101). As per claim 12, Sharifi in view of Golikov disclose: The method of claim 4, upon which claim 12 depends. Sharifi, however, fails to disclose wherein the particular audio data processing is invocation-free speech recognition processing, the invocation-free speech recognition processing comprising: performing speech recognition, on audio data that captures a spoken utterance and using one or more local speech recognition models of the assistant device, to generate a recognition of the spoken utterance; determining, based on processing the recognition, whether the spoken utterance is an assistant command; and causing further assistant processing to be performed based on the spoken utterance being determined to be an assistant command. Krishnan does teach wherein the particular audio data processing is invocation-free speech recognition processing, the invocation-free speech recognition processing comprising: performing speech recognition, on audio data that captures a spoken utterance and using one or more local speech recognition models of the assistant device, to generate a recognition of the spoken utterance; determining, based on processing the recognition, whether the spoken utterance is an assistant command; and causing further assistant processing to be performed based on the spoken utterance being determined to be an assistant command (Krishnan; p. 0045 - the multi-user dialog may allow all audio captured by the device 110 to be processed (at least minimally) by the system to determine if audio is system directed, without requiring the wakeword to be spoken). Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of in view of Golikov to include wherein the particular audio data processing is invocation-free speech recognition processing, the invocation-free speech recognition processing comprising: performing speech recognition, on audio data that captures a spoken utterance and using one or more local speech recognition models of the assistant device, to generate a recognition of the spoken utterance; determining, based on processing the recognition, whether the spoken utterance is an assistant command; and causing further assistant processing to be performed based on the spoken utterance being determined to be an assistant command, as taught by Krishnan, in order to restrict processing where appropriate and only process user information in a manner that ensures compliance with all appropriate laws, regulations, standards, and the like (Krishnan; p. 0034). As per claim 13, Sharifi in view of Golikov and Krishnan disclose: The method of claim 12, wherein, in the partially active state, the particular audio data based assistant processing is suppressed, for any users that are not registered with the assistant device, by causing the further assistant processing to be performed further based on: verifying that the spoken utterance was uttered by one of the one or more registered users (Sharifi; p. 0093 - automated assistant 120 may only transition from the inactive listening state of FIG. 2 to the “STORE SOUND DATA IN BUFFER?” state of FIG. 2 if the sound corresponds to a registered user's voice). As per claim 15, Sharifi in view of Golikov and Krishnan discloses: The method of claim 12, wherein the further assistant processing comprises: causing one or more actions to performed based on the recognition (Sharifi; p. 0070 - Additionally or alternatively, fulfillment module 124 may be configured to receive, e.g., from intent matcher 135, a user's intent and any slot values provided by the user or determined using other means (e.g., GPS coordinates of the user, user preferences, etc.) and trigger a responsive action. Responsive actions may include, for instance, ordering a good/service, starting a timer, setting a reminder, initiating a phone call, playing media, sending a message, etc. In some such implementations, fulfillment information may include slot values associated with the fulfillment, confirmation responses (which may be selected from predetermined responses in some cases), etc.). Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Sharifi in view of Golikov and further in view of Huang (US PG Pub 20190043507). As per claim 7, Sharifi in view of Golikov discloses: The method of claim 6, upon which claim 7 depends. Sharifi in view of Golikov, however, fails to disclose wherein verifying that the hot word was uttered by one of the one or more registered users comprises: processing, using a text-dependent speaker identification (TDSID) model, at least a portion of the stream of audio data that captures the hot word; and verifying that output, generated using the TDSID model based on the processing, matches a stored TDSID embedding for the one of the one or more registered users. Huang does teach wherein verifying that the hot word was uttered by one of the one or more registered users comprises: processing, using a text-dependent speaker identification (TDSID) model, at least a portion of the stream of audio data that captures the hot word; and verifying that output, generated using the TDSID model based on the processing, matches a stored TDSID embedding for the one of the one or more registered users (Huang; p. 0029 - keyphrase detection and text-dependent speaker recognition is applied to a subsequent phrase in addition to waking keyphrase detection and speaker recognition of a waking keyphrase). Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Sharifi in view of Golikov to include wherein verifying that the hot word was uttered by one of the one or more registered users comprises: processing, using a text-dependent speaker identification (TDSID) model, at least a portion of the stream of audio data that captures the hot word; and verifying that output, generated using the TDSID model based on the processing, matches a stored TDSID embedding for the one of the one or more registered users, as taught by Huang, because this has the effect of providing more phonetically-constrained (text-dependent) data for more accurate speaker recognition, and it has been found that for higher security applications, users are more willing to tolerate a slightly longer interaction when provided the perception of greater security anyway (Huang; p. 0029). Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Sharifi in view of Golikov and Krishnan and further in view of Huang. As per claim 14, Sharifi in view of Golikov and Krishnan disclose: The method of claim 13, upon which claim 14 depends. And further, Huang discloses wherein verifying that the hot word was uttered by one of the one or more registered users comprises: processing, using a text-independent speaker identification (TISID) model, at least a portion of the audio data that captures the spoken utterance; and verifying that output, generated using the TISID model based on the processing, matches a stored TISID embedding for the one of the one or more registered users (Huang; p. 0025 - …The segmented speech is then passed to a text-independent speech recognition (TI-SR) unit to perform the TI-SR 110 to determine one or more speaker scores that indicate the likelihood that the command (or entire utterance) was spoken by an authorized user… (hybrid)). Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Sharifi in view of Golikov and Krishnan to include wherein verifying that the hot word was uttered by one of the one or more registered users comprises: processing, using a text-independent speaker identification (TISID) model, at least a portion of the audio data that captures the spoken utterance; and verifying that output, generated using the TISID model based on the processing, matches a stored TISID embedding for the one of the one or more registered users, as taught by Huang, because this has the effect of providing more phonetically-constrained (text-dependent) data for more accurate speaker recognition, and it has been found that for higher security applications, users are more willing to tolerate a slightly longer interaction when provided the perception of greater security anyway (Huang; p. 0029). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes: Maury (US PG Pub 20220148592) discloses examples described herein relate to triggering voice assistant(s) on a network microphone device (NMD). An NMD is a networked computing device that typically includes an arrangement of microphones, such as a microphone array, that is configured to detect sound present in the NMD's environment. Once the voice assistant is triggered, the NMD may start recording voice input as a potential voice command. Within examples, the NMD may operate in a wakewordless mode if certain conditions are met. These conditions may involve detecting user proximity in one of multiple different ranges. For instance, an example NMD may monitor for user proximity in a first range from the playback device via at least one touch-sensitive sensor and/or user line-of-sight in a second range that is further from the playback device than the first range. When either user proximity or user line-of-sight is detected, the the NMD may enables the wakewordless mode (Maury; Abstract). Smith (US PG Pub 20200395006) discloses a playback device includes at least one microphone configured to detect sound. The playback detects sound via the one or more microphones and determines whether (i) the detected sound includes a voice input, (ii) the detected sound excludes background speech, and (iii) the voice input includes a command keyword. In response to the determining, the playback device performs a playback function corresponding to the command keyword (Smith; Abstract). Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139. The examiner can normally be reached Monday - Friday 9-6 ET. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 5712727602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /RODRIGO A CHAVEZ/Examiner, Art Unit 2658 /RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action

Prosecution Timeline

Dec 03, 2021
Application Filed
Jun 06, 2024
Non-Final Rejection — §103
Sep 11, 2024
Response Filed
Dec 04, 2024
Final Rejection — §103
Mar 13, 2025
Request for Continued Examination
Mar 15, 2025
Response after Non-Final Action
Jun 12, 2025
Non-Final Rejection — §103
Sep 15, 2025
Examiner Interview Summary
Sep 15, 2025
Applicant Interview (Telephonic)
Sep 17, 2025
Response Filed
Jan 23, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12597430
MULTI-CHANNEL SIGNAL GENERATOR, AUDIO ENCODER AND RELATED METHODS RELYING ON A MIXING NOISE SIGNAL
2y 5m to grant Granted Apr 07, 2026
Patent 12579984
DATA AUGMENTATION SYSTEM AND METHOD FOR MULTI-MICROPHONE SYSTEMS
2y 5m to grant Granted Mar 17, 2026
Patent 12541653
ENTERPRISE COGNITIVE SOLUTIONS LOCK-IN AVOIDANCE
2y 5m to grant Granted Feb 03, 2026
Patent 12542136
DYNAMICALLY CONFIGURING A WARM WORD BUTTON WITH ASSISTANT COMMANDS
2y 5m to grant Granted Feb 03, 2026
Patent 12531077
METHOD AND APPARATUS IN AUDIO PROCESSING
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

5-6
Expected OA Rounds
50%
Grant Probability
88%
With Interview (+37.3%)
3y 5m
Median Time to Grant
High
PTA Risk
Based on 228 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month