Last updated: April 19, 2026
Application No. 18/135,611
HOT-WORD FREE ADAPTATION OF AUTOMATED ASSISTANT FUNCTION(S)

Non-Final OA §103
Filed
Apr 17, 2023
Examiner
SERRAGUARD, SEAN ERIN
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
3 (Non-Final)
Interview Optional

— +33.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 134 resolved cases, 2023–2026
Examiner Intelligence

SERRAGUARD, SEAN ERIN View full profile →
Grants 69% — above average
Career Allow Rate
92 granted / 134 resolved
+6.7% vs TC avg
Strong +34% interview lift
Without
With
+33.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
43 currently pending
Career history
177
Total Applications
across all art units
Statute-Specific Performance

§101
9.4%
-30.6% vs TC avg
§103
49.7%
+9.7% vs TC avg
§102
18.6%
-21.4% vs TC avg
§112
19.2%
-20.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 134 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
All objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12 November 2025 has been entered.
 
Response to Amendments 
Applicant’s amendment filed on 04 December 2025 has been entered. 
In view of the amendment to the claim(s), the amendment of claim(s) 1, 9, 12, and 15-17 has been acknowledged and entered.  
After entry of the amendments, claims 1-2, 4-10, and 12-20 remain pending.
In view of the amendment to claim(s) 12, and 15-16, the objection to claim(s) 12-16 is withdrawn.
In view of the amendment to claim(s) 1, 9, 12, and 15-16, the rejection of claims 1-2, 4-10, and 12-16 under 35 U.S.C. §103 is withdrawn. 
In view of the amendment to claim(s) 17, the rejection of claims 17-20 under 35 U.S.C. §103 is maintained as to the cited art, as modified in response to the amendments.
In light of the amended claims, new grounds for rejection under 35 U.S.C. §103 are provided in the action below. 

Response to Arguments
Applicant’s arguments regarding the prior art rejections, see page(s) 11-15 of the Response to Final Office Action dated 11 September 2025, which was received on 04 December 2025 (hereinafter Response and Office Action, respectively), have been fully considered.
With respect to the rejection(s) of claim(s) 1, and mutatis mutandis claim 9, under 35 U.S.C. §103 in light of Colmenarez (U.S. Pat. App. Pub. No. 2003/0144844, hereinafter Colmenarez) in view of White (U.S. Pat. App. Pub. No. 2019/0187787, hereinafter White), applicant asserts that Colmenarez and White fail to teach or suggest all limitations of the claims as amended. Applicant’s arguments in light of the amendments to the claims are persuasive. Therefore, the rejection of claims 1 and 9 are withdrawn.
With respect to the rejection(s) of claim(s) 17 under 35 U.S.C. §103 in light of White in view of Klein (U.S. Pat. App. Pub. No. 2014/0207446, hereinafter Klein), applicant asserts that Klein fails to teach or suggest “determining that the gesture is assigned to a plurality of responsive actions, wherein a first action, of the plurality of responsive actions comprises altering a visual output of the client device when no audio or audiovisual content is being rendered and wherein a second action, of the plurality of responsive actions comprises altering an audible output of the client device when audio or audiovisual content is being rendered,” as recited in claim 17 as amended. This argument is not persuasive.
As mapped in the response below, Klein teaches the above recited limitation. As explained in Klein, “indefinite quantitative inputs made via other modes… may be mapped to definite quantities,” where such inputs include “mid-air gesture inputs made via an image sensor, surface gesture inputs made via a touch sensor, non-gesture depth inputs (e.g. a posture input), non-speech audio inputs, inputs made via other devices (e.g. companion devices, remote control devices, other computing devices, etc.), and any other suitable input.” (Klein, ¶ [0027]) Thus, the indefinite quantitative inputs include indefinite quantitative gesture inputs, at least as described above and as would be understood by one having ordinary skill in the art. 
Klein then explains, at paragraph [0029], both speaking generally and with reference to an example, the performance of two separate actions in response to the input, where the input may be a gesture and the input changes based on the audio context or lack thereof as rendered on the client device. Specifically regarding the first action limitation, Klein teaches wherein a first action, of the plurality of responsive actions comprises altering a visual output of the client device when no audio or audiovisual content is being rendered. As explained in Klein, “different meanings maybe specified to such quantitative inputs in each context in a manner that allows users to habituate to the use of the commands in each context.” With reference to an example, which Klein provides as the first indefinite qualitative speech input of “a little bit” which may signify “50 pixels in a scrolling context.” (Klein, ¶ [0029]) As Klein makes clear, the indefinite qualitative speech input of “a little bit” is not intended to be limiting. As understood in light of the disclosed examples of “mid-air gesture inputs made via an image sensor” and “surface gesture inputs made via a touch sensor,” based on ordinary skill in the art regarding gestures, the phrase “a little bit” has numerous equivalent mid-air gestures (e.g., the “mid-air gesture” of holding one’s pointer finger and thumb apart to indicate “a little bit,” as is well understood in the U.S.) and surface gesture inputs (e.g., the movement of two fingers together or apart, commonly referred to as pinching or stretching, on a touch responsive screen to indicate a relative change of “a little bit”). As such, it is understood that the indefinite qualitative speech input of “a little bit” can be an indefinite qualitative gesture input, as are well known in the art. Further, the portion of the example relied on above from Klein is described in the context of scrolling, where scrolling is generally understood in the context of webpages, and webpages are not audio or audiovisual content in the context of the instant application (see, for example, para [0020] of the instant application, which explains that “the ‘scroll up’ action can be selected when no audio or audiovisual content is being rendered”).
 Regarding the second action limitation, Klein teaches wherein a second action, of the plurality of responsive actions comprises altering an audible output of the client device when audio or audiovisual content is being rendered. In the context of the same example, the same first indefinite qualitative [gesture] input “may signify 10% in a volume-setting context,” where volume-setting context is when audio or audiovisual content is being rendered, and changing a volume of a “volume-setting context” at a client device by 10% is an altering of an audible output of the client device. (Klein, ¶ [0029]). As such, the rejection is maintained as to the cited references, and the cited embodiments of said references are modified in response to the amendments.
Applicant further argues that the rejection(s) of dependent claims 2, 4-8, 10, 12-16, and 18-20 should be withdrawn for at least the same reasons as independent claims 1, 9, and 17. Regarding claims 2, 4-8, 10, and 12-16, applicant’s arguments in light of the amended claims are persuasive. As such, the rejections of claims 2, 4-8, 10, and 12-16 under 35 U.S.C. §103 are withdrawn. 
Regarding claims 18-20, applicant’s arguments in light of the amended claims are not persuasive for the same reasons as described above with reference to claim 17. As such, the rejections of claims 18-20 under 35 U.S.C. §103 are maintained, as modified in response to the amendments.
However, upon further consideration, new ground(s) of rejection under 35 U.S.C. §103 are made in light of combinations of Colmenarez, White, Klein, and newly cited reference Non patent literature to Chung et al. (Chung, J.S. and Zisserman, A., 2016, November. Out of time: automated lip sync in the wild. In Asian conference on computer vision (pp. 251-263). Cham: Springer International Publishing. (Year: 2016), hereinafter Chung).
The Applicant has not provided any further statement and therefore, the Examiner directs the Applicant to the below rationale.	

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-2, 4-10, and 12-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Colmenarez in view of White and Chung.

Regarding claim 1, Colmenarez discloses A method that facilitates hot-word free interaction between a user and an automated assistant, the method implemented by one or more processors and comprising (The systems and methods described with reference to the speech recognition system described with reference to FIG. 1, the speech recognition system including a “processing arrangement”; Colmenarez, ¶ [0015], [0018], [0019]; FIG. 1): receiving, at a client device: a stream of image frames that are based on output from one or more cameras of the client device (The speech recognition system includes “camera 12” which “derive[s] electrical signals that are replicas of the...optical energy incident” on the camera 12, where the “replicas of the… optical energy” is a stream of image frames based on the output from camera 12 {one or more cameras} of the speech recognition system {the client device}; Colmenarez, ¶ [0015]), and audio data detected by one or more microphones of the client device (The speech recognition system further includes a microphone 10, which “derive[s] electrical signals that are replicas of the acoustic... energy incident” on the microphone 10, where the “replicas of the acoustic… energy” is includes the audio data detected by microphone 10 {one or more microphones} of the speech recognition system {the client device}; Colmenarez, ¶ [0015]); processing, at the client device, using a locally stored machine learning model (“Acoustic energy detector 16” processes the output data of the microphone 10 to determine whether “acoustic energy above a predetermined threshold is incident on microphone 10” and the “lip motion detector 26” processes the output data of camera 12 to determine whether “lip motion of the speaker speaking into microphone 10” is present.; Colmenarez, ¶ [0016], [0019]) trained to distinguish between: voice activity that co-occurs with mouth movement and is the result of the mouth movement (“The output signal of AND gate 28 drives one shot circuits 30 and 32 in parallel” where “One shot circuits 30 and 32 do not derive any pulses if (1) acoustic energy detector 16 derives a true value while lip motion detector 26 derives a zero value” and where “Face recognizer 50 and speech recognizer 52 are trained” to activate the “speech recognizer 40... only if the face and speech are recognized as being for the same person.” Thus, by determining that lip movement and voice activity correspond with facial recognition and speech recognition, the system is detecting that the voice activity co-occurs with the mouth movement and is the result of the mouth movement, and therefore the “speech recognizer 40 is activated” based on the detection thereof.; Colmenarez, ¶ [0021], [0028]); and voice activity that co-occurs with the mouth movement but is not a result of the mouth movement (the “speech recognizer 40 is activated only if the face,” as detected by the face recognizer 50, “and speech,” as detected by the trained speech recognizer 52, “are recognized as being for the same person.” As such, in light of the acoustic energy detector 16, the lip motion detector 26, the trained face recognizer 50, and the trained speech recognizer 52, the system rejects any speech, even co-occurring speech, where the face providing the lip movement does not correspond to the voice recognition result based on the training {distinguish between... voice activity that is not from the mouth movement, but co-occurs with the mouth movement}.; Colmenarez, ¶ [0021], [0028]), the image frames and the audio data… (“Acoustic energy detector 16” processes the output data of the microphone 10 to determine whether “acoustic energy above a predetermined threshold is incident on microphone 10” and the “lip motion detector 26” processes the output data of camera 12 to determine whether “lip motion of the speaker speaking into microphone 10” is present, and further including the “Face recognizer 50” and the “speech recognizer 52,”.; Colmenarez, ¶ [0016], [0019], [0028]) [to determine] co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user (“To perform indexing of buffer 24 only in response to utterances by the speaker who is talking into microphone 10, the system illustrated in FIG. 1 detects at least one facial characteristic associated with speech utterances of the speaker while acoustic energy is incident on microphone 10” and the facial characteristic includes “a signal indicative of lip motion {mouth movement...captured by one or more of the image frames} of the speaker speaking into microphone 10,” where detecting the lip motion of the speaker while “acoustic energy [of the user] is incident on the microphone 10” is determining the co-occurrence of mouth movement of the user and voice activity of the user.; Colmenarez, ¶ [0019], [0021]); determining, at the client device and based on determining that there is the co-occurrence of the mouth movement of the user and the voice activity of the user, to perform one or both of: certain processing of the audio data (“The bi-level output signals of acoustic energy detector 16 and motion detector 26 drive AND gate 28 which derives a bi-level signal having a true value only while the bi-level output signals of detector 16 and 26 both have true values. Thus, AND gate 28 derives a true value only while microphone 10 and camera 12 are responsive to speech utterances by the speaker; at all other times, the output of AND gate 28 has a zero, i.e., not true, value”; Colmenarez, ¶ [0020]); and initiating, at the client device, the certain processing of the audio data and/or the rendering of the at least one human perceptible cue, responsive to determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue (While the AND gate 28 “derives a true value” {...responsive to determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue}, the system performs various processing of the audio data, including storage of “a signal indicative of the speech content of the first speech segment... [and] a signal indicative of the speech content of the last speech segment of that utterance”, which are described with reference to RAM 22, comparison of “the speech segment that buffer 24 derives...[to] the first segment that RAM 22 supplies” and “speech segment that buffer 24 derives... [to] the last segment that RAM 22 supplies” at the comparison circuits 34 and 36, and the processing of “the first through the last speech segments” at the speech recognizer 40, where the RAM 22 {initiating, at the client device, the certain processing of the audio data and/or the rendering of the at least one human perceptible cue}, the comparison circuit 34 and 36, and the speech recognizer 40 are part of the speech recognition system {initiating, at the client device...}; Colmenarez, ¶ [0020]-[0026]). However, Colmenarez fails to expressly recite generate output of the stored machine learning model that indicates whether there is co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user; and determining, based on the output of the stored machine learning model, whether there is co-occurrence of the mouth movement of the user and the voice activity of the user, and rendering of at least one human perceptible cue via an output component of the client device.
White teaches systems and methods for non-verbally engaging a virtual assistant. (White, ¶ [0023]). Regarding claim 1, White teaches rendering of at least one human perceptible cue via an output component of the client device (Upon transition from a sleep mode, “the virtual assistant device may provide a response back to the user” such as “in the form of an indicator light or a sound, notifying the user that the virtual assistant device is prepared to engage with the user”; White, ¶ [0035]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include rendering of at least one human perceptible cue via an output component of the client device. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]). However, Colmenarez and White fail to expressly recite generate output of the stored machine learning model that indicates whether there is co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user; and determining, based on the output of the stored machine learning model, whether there is co-occurrence of the mouth movement of the user and the voice activity of the user.
Chung teaches systems and methods for determining the audio synchronization between mouth motion and speech. (Chung, ¶ Abstract). Regarding claim 1, Chung teaches processing, at the client device using a locally stored machine learning model (Discloses “a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabeled data” where the described implementation is “based on the MATLAB toolbox MatConvNet [26] and trained on a NVIDIA Titan X GPU with 12GB memory,” which is consumer grade hardware, and, as indicated “The data preparation pipeline and the network runs significantly faster than real-time on a mid-range laptop (Apple MacBook Pro with NVIDIA GeForce GT 750M graphics)” thus the system was locally trained and stored.; Chung, ¶ Abstract; pg. 255, lines 12-17; pg. 258, lines 17-20) trained to distinguish between (i) voice activity that co-occurs with mouth movement and is a result of the mouth movement and (ii) voice activity that co-occurs with the mouth movement but is not a result of the mouth movement (ConvNet is “a language independent and speaker independent solution to the lip-sync problem” and the described training objective for ConvNet is “that the output of the audio and the video networks are similar for genuine pairs, and different for false pairs” such that the system can be applied for “(i) determining the lip-sync error in videos” which is voice activity that co-occurs with the mouth movement but is not a result of the mouth movement, and “(ii) detecting the speaker in a scene with multiple faces” which is speaker diarization based on detecting voice activity that co-occurs with the mouth movement and is a result of the mouth movement; Chung, ¶ pg. 251, line 14 - pg. 252, line 4; pg. 254, lines 1-7), the image frames and the audio data to generate output of the stored machine learning model (As shown at FIG. 2, the two-stream ConvNet architecture receives both the audio and video streams simultaneously, and each of these are used to determine the time offset, where “to find the time offset between the audio and the video... the distance is computed between one 5-frame video feature and all audio features in the ±1 s range.” Typical output for the system is shown at FIG. 8.; Chung, ¶ pg. 258, lines 1-7, FIG. 2 and 8) that indicates whether there is co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user (FIG. 8 depicts the typical output from the system, showing the output as indicating correlation between the audio and lip movement. As indicated above, the time offset, also referred to as the synchronization error, is the measure of co-occurrence of mouth movement of the user, as captured from the video stream, and the voice activity of the user.; Chung, ¶ FIG. 8 (including Figure description)); and determining, based on the output of the stored machine learning model, whether there is co-occurrence of the mouth movement of the user and the voice activity of the user (the output of the ConvNet system (shown, for example, in FIG. 8) can detect both correlated audio (Shown in FIG. 8 left), including shifted correlation (shown in FIG. 8 middle), and also “cases where the person is speaking, but is uncorrelated to the audio” (for example, FIG. 8 right) which is an independent determination for both speaker and correlation of the voice activity, regardless of speaker, to the mouth movement.; Chung, ¶ pg. 258, line 20 - pg. 259, line 10; FIG. 8).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez, as modified by the non-verbal engagement of White, to incorporate the teachings of Chung to include generate output of the stored machine learning model that indicates whether there is co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user; and determining, based on the output of the stored machine learning model, whether there is co-occurrence of the mouth movement of the user and the voice activity of the user. Chung discloses the use of deep learning for determining audio-video synchronization, for a wide variety of purposes including speaker diarization in the cocktail party scenario, to improve speech recognition and speaker detection, where in the case of “speaker detection and lip reading, our results exceed the state-of-the-art on public datasets” while maintaining a recognized extendibility of the approach to “any problem where it is useful to learn a similarity metric between correlated data in different domains,” which would provide a recognized improvement on and be readily extendable to the similarity based association of the different domains of “speech utterances and at least one facial characteristic associated with the speech utterances” described in Colmenarez, as recognized by Chung. (Chung, ¶ [0003], [0023]; Colmenarez, ¶ [0001]).

Regarding claim 2, the rejection of claim 1 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. Colmenarez further discloses wherein the certain processing of the audio data is initiated responsive to determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue (“AND gate 28... derives a bi-level signal having a true value only while the bi-level output signals of [acoustic energy detector 16] and [motion detector 26] both have true values. Thus, AND gate 28 derives a true value only while microphone 10 and camera 12 are responsive to speech utterances by the speaker; at all other times, the output of AND gate 28 has a zero, i.e., not true, value” where AND gate 28 regulates further processing of the received audio signal {certain processing of the audio data and/or rendering of the at least one human perceptible cue}; Colmenarez, ¶ [0020]-[0021]), and wherein initiating the certain processing of the audio data comprises one or multiple of: initiating local automatic speech recognition of the audio data at the client device (The AND gate 28 “derives a true value” based on both {...responsive to determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue}, the system performs various processing of the audio data, including storage of “a signal indicative of the speech content of the first speech segment... [and] a signal indicative of the speech content of the last speech segment of that utterance”, which are described with reference to RAM 22, comparison of “the speech segment that buffer 24 derives...[to] the first segment that RAM 22 supplies” and “speech segment that buffer 24 derives... [to] the last segment that RAM 22 supplies” at the comparison circuits 34 and 36, and the processing of “the first through the last speech segments” at the speech recognizer 40, where the RAM 22 {initiating, at the client device, the certain processing of the audio data and/or the rendering of the at least one human perceptible cue}, the comparison circuit 34 and 36, and the speech recognizer 40 are part of the speech recognition system {initiating, at the client device...}; Colmenarez, ¶[0020]-[0026]). However, Colmenarez fails to expressly recite initiating transmission of the audio data to a remote server associated with the automated assistant, and initiating transmission of recognized text, from the local automatic speech recognition, to the remote server.
The relevance of White is described above with relation to claim 1. Regarding claim 2, White teaches initiating transmission of the audio data to a remote server associated with the automated assistant (“the disclosed system may rely on... remote databases... to formulate an appropriate response. This may be accomplished by utilizing … remote databases stored on or associated with servers 118, 120, 122” where “the user input data and virtual assistant device 108 response data may be stored in a remote database (e.g., on servers 118, 120, 122).”; White, ¶ [0026], [0029]), and initiating transmission of recognized text, from the local automatic speech recognition, to the remote server (“In other example aspects, the device may receive previously stored data in the form of natural language (e.g., a textual transcript, audio file, and the like), which may represent a prior dialogue between a user and another virtual assistant device” and “the user input data and virtual assistant device 108 response data may be stored in a remote database (e.g., on servers 118, 120, 122).” As certain processing of the speech, including speech recognition for generating recognized text, in Colmenarez, is initiated in response to co-occurring lip movement and voice {e.g., the true value of AND gate 28} alongside face and voice recognition, the transmission of a text transcript generated from said processing is also initiated in response to said co-occurrence.; White, ¶ [0021], [0037]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez, to incorporate the teachings of White to include initiating transmission of the audio data to a remote server associated with the automated assistant, and initiating transmission of recognized text, from the local automatic speech recognition, to the remote server. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Regarding claim 4, the rejection of claim 1 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. However, Colmenarez fails to expressly recite further comprising: determining, at the client device, a distance of the user relative to the client device, wherein determining the distance of the user relative to the client device is based on one or both of: one or more of the image frames, and additional sensor data from an additional sensor of the client device; wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on the distance of the user relative to the client device.
The relevance of White is described above with relation to claim 1. Regarding claim 4, White teaches further comprising: determining, at the client device, a distance of the user relative to the client device, wherein determining the distance of the user relative to the client device is based on one or both of: one or more of the image frames, and additional sensor data from an additional sensor of the client device (“virtual assistant device” may be equipped with “a high-resolution infrared camera {is based on... additional sensor data from an additional sensor of the client device} [and] a high resolution still camera device {is based on...one or more of image frames} to receive contextual data, including spatial topology, where “After entering the room where personal computer 106 is located, personal computer may receive input indicating that the user has entered the room (e.g., through infrared signals, changes in spatial topology, mobile device signals, and the like) {determining, at the client device, the distance of the user relative to the client device…}”; White, ¶ [0023], [0029], [0036]); wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on the distance of the user relative to the client device (“The indicator box 704A may represent a... proximity of the user to an engagement box 706A, which represents an outer boundary (or threshold) for engaging the virtual assistant” where positioning of the user within the engagement threshold may be determined based on “the head position and other spatial topological data” as relative to the virtual assistant position in the environment.; White, ¶ [0081]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez, to incorporate the teachings of White to include further comprising: determining, at the client device, a distance of the user relative to the client device, wherein determining the distance of the user relative to the client device is based on one or both of: one or more of the image frames, and additional sensor data from an additional sensor of the client device; wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on the distance of the user relative to the client device. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Regarding claim 5, the rejection of claim 4 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. However, Colmenarez fails to expressly recite wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device comprises: determining that the distance of the user, relative to the client device satisfies a threshold.
The relevance of White is described above with relation to claim 1. Regarding claim 5, White teaches wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device comprises: determining that the distance of the user, relative to the client device satisfies a threshold (“The indicator box 704A may represent a... proximity of the user to an engagement box 706A, which represents an outer boundary (or threshold) for engaging the virtual assistant” where positioning of the user within the engagement threshold may be determined based on “the head position and other spatial topological data” as relative to the virtual assistant position in the environment.; White, ¶ [0081]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device comprises: determining that the distance of the user, relative to the client device satisfies a threshold. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Regarding claim 6, the rejection of claim 4 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. However, Colmenarez fails to expressly recite wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device comprises: determining that the distance of the user relative to the client device is closer, to the client device, than one or more previously determined distances of the user relative to the client device.
The relevance of White is described above with relation to claim 1. Regarding claim 6, White teaches wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device comprises: determining that the distance of the user relative to the client device is closer, to the client device, than one or more previously determined distances of the user relative to the client device (As illustrated in FIG. 7A and 7B, the distance of the user as compared to the device is determined both while outside the proximity threshold and once the user is within the proximity threshold, where “When alignment is achieved between indicator box 704B and engagement box 706B,” meaning that once the user has transitioned from being outside the proximity threshold as compared to the client device {one or more previously determined distances of the user relative to the client device} to being inside the proximity threshold as compared to the client device {the distance of the user relative to the client device is closer, to the client device, than...}, “the virtual assistant search bar 708B may illuminate” or “an indicator noise may play {an audible indication of the determination}”; White, ¶ [0080]-[0083], FIGS. 7A and 7B).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device comprises: determining that the distance of the user relative to the client device is closer, to the client device, than one or more previously determined distances of the user relative to the client device. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Regarding claim 7, the rejection of claim 1 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. Colmenarez further discloses further comprising: determining, at the client device and based on one or more of the image frames, that a gaze of the user is directed to the client device (The system further discloses “face recognizer 50” being trained to recognize whether a first speaker “is looking directly into camera 12” or not.; Colmenarez, ¶ [0032]-[0033]); wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on determining that the gaze of the user is directed to the client device (“The speech recognition system of FIG. 1 can be modified by the arrangement illustrated in FIG. 2 so that the speech recognition system will not respond to speech utterances {determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue} when the speaker is not looking at camera 12 {based on determining that the gaze of the user is directed to the client device}”; Colmenarez, ¶ [0027]).

Regarding claim 8, the rejection of claim 1 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. However, Colmenarez fails to expressly recite further comprising: determining, at the client device and based on one or more of the image frames, that a body pose of the user is directed to the client device; wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on determining that the body pose of the user is directed to the client device.
The relevance of White is described above with relation to claim 1. Regarding claim 8, White teaches further comprising: determining, at the client device and based on one or more of the image frames, that a body pose of the user is directed to the client device (“disclosed system may receive non-verbal input data, including but not limited to, eye-gaze data [and] attributes of eye-gaze data” where “An ‘attribute’ of eye-gaze data may include but is not limited to an eye-gaze signal, a physical gesture, a body position, a head pose (or position), a facial feature, a facial expression, or any combination thereof.”; White, ¶ [0023], [0026]); wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on determining that the body pose of the user is directed to the client device (The “virtual assistant device 108 may receive the at least one eye-gaze attribute” where the at least one eye-gaze attribute includes body position, the virtual assistant device 108 then “determine[s] that a user desires to engage with the virtual assistant” based on the “input [being] directed to the virtual assistant device,”; White, ¶ [0023], [0028], [0044]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include further comprising: determining, at the client device and based on one or more of the image frames, that a body pose of the user is directed to the client device; wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on determining that the body pose of the user is directed to the client device. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Regarding claim 9, Colmenarez discloses A client device comprising: at least one camera; at least one microphone; at least one display; one or more processors executing locally stored instructions to: (The systems and methods described with reference to the speech recognition system described with reference to FIG. 1, the speech recognition system including a “processing arrangement” for performing the functions of the speech recognition system, a “camera 12,” a “microphone 10,” and a “computer display”; Colmenarez, ¶ [0015], [0018], [0019], [0026]; FIG. 1) receive: a stream of image frames that are based on output from the at least one camera (The speech recognition system includes “camera 12” which “derive[s] electrical signals that are replicas of the...optical energy incident” on the camera 12, where the “replicas of the… optical energy” is a stream of image frames based on the output from camera 12 {one or more cameras} of the speech recognition system {the client device}; Colmenarez, ¶ [0015]), and audio data detected by the at least one microphone (The speech recognition system further includes a microphone 10, which “derive[s] electrical signals that are replicas of the acoustic... energy incident” on the microphone 10, where the “replicas of the acoustic… energy” is includes the audio data detected by microphone 10 {one or more microphones} of the speech recognition system {the client device}; Colmenarez, ¶ [0015]); process, using a locally stored machine learning model (The system, as modified by the arrangement in FIG. 2, includes “acoustic energy detector 16” processes the output data of the microphone 10 to determine whether “acoustic energy above a predetermined threshold is incident on microphone 10” and the “lip motion detector 26,” which function in conjunction with a “Face recognizer 50” and a “speech recognizer 52,” stored as part of the system {locally stored}, which “are trained {machine learned model} during at least one training period to recognize the face and speech of more than one person” using video output from the camera 12 and audio output from the microphone 10, respectively; Colmenarez, ¶ [0020], [0027], [0028], [0032], FIGS. 1 and 2) trained to distinguish between: voice activity that co-occurs with mouth movement and is the result of the mouth movement (“The output signal of AND gate 28 drives one shot circuits 30 and 32 in parallel” where “One shot circuits 30 and 32 do not derive any pulses if (1) acoustic energy detector 16 derives a true value while lip motion detector 26 derives a zero value” and where “Face recognizer 50 and speech recognizer 52 are trained” to activate the “speech recognizer 40... only if the face and speech are recognized as being for the same person.” Thus, by determining that lip movement and voice activity correspond with facial recognition and speech recognition, the system is detecting that the voice activity co-occurs with the mouth movement and is the result of the mouth movement, and therefore the “speech recognizer 40 is activated” based on the detection thereof.; Colmenarez, ¶ [0021], [0028]); and voice activity that co-occurs with the mouth movement but is not a result of the mouth movement (the “speech recognizer 40 is activated only if the face,” as detected by the face recognizer 50, “and speech,” as detected by the trained speech recognizer 52, “are recognized as being for the same person.” As such, in light of the acoustic energy detector 16, the lip motion detector 26, the trained face recognizer 50, and the trained speech recognizer 52, the system rejects any speech, even co-occurring speech, where the face providing the lip movement does not correspond to the voice recognition result based on the training {distinguish between... voice activity that is not from the mouth movement, but co-occurs with the mouth movement}.; Colmenarez, ¶ [0021], [0028]), the image frames and the audio data… (“Acoustic energy detector 16” processes the output data of the microphone 10 to determine whether “acoustic energy above a predetermined threshold is incident on microphone 10” and the “lip motion detector 26” processes the output data of camera 12 to determine whether “lip motion of the speaker speaking into microphone 10” is present, and further including the “Face recognizer 50” and the “speech recognizer 52,”.; Colmenarez, ¶ [0016], [0019], [0028]) [to determine] co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user (“To perform indexing of buffer 24 only in response to utterances by the speaker who is talking into microphone 10, the system illustrated in FIG. 1 detects at least one facial characteristic associated with speech utterances of the speaker while acoustic energy is incident on microphone 10” and the facial characteristic includes “a signal indicative of lip motion {mouth movement...captured by one or more of the image frames} of the speaker speaking into microphone 10,” where detecting the lip motion of the speaker while “acoustic energy [of the user] is incident on the microphone 10” is determining the co-occurrence of mouth movement of the user and voice activity of the user.; Colmenarez, ¶ [0019], [0021]); determine, based on determining the co-occurrence of the mouth movement of the user and the voice activity of the user, to perform one or both of: certain processing of the audio data (“The bi-level output signals of acoustic energy detector 16 and motion detector 26 drive AND gate 28 which derives a bi-level signal having a true value only while the bi-level output signals of detector 16 and 26 both have true values. Thus, AND gate 28 derives a true value only while microphone 10 and camera 12 are responsive to speech utterances by the speaker; at all other times, the output of AND gate 28 has a zero, i.e., not true, value” which, as combined with recognition from both the face recognizer 50 and speech recognizer 52, result in certain processing of the audio; Colmenarez, ¶ [0020]-[0021]); and initiate the certain processing of the audio data and/or the rendering of the at least one human perceptible cue, responsive to determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue (In response to the AND gate 28 “derives a true value” {...responsive to determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue} and the face recognizer 50 and speech recognizer 52 produce corresponding outputs, the system performs various processing of the audio data, including storage of “a signal indicative of the speech content of the first speech segment... [and] a signal indicative of the speech content of the last speech segment of that utterance”, which are described with reference to RAM 22, comparison of “the speech segment that buffer 24 derives...[to] the first segment that RAM 22 supplies” and “speech segment that buffer 24 derives... [to] the last segment that RAM 22 supplies” at the comparison circuits 34 and 36, and the processing of “the first through the last speech segments” at the speech recognizer 40, where the RAM 22 {initiating, at the client device, the certain processing of the audio data and/or the rendering of the at least one human perceptible cue}, the comparison circuit 34 and 36, and the speech recognizer 40 are part of the speech recognition system {initiating, at the client device...}; Colmenarez, ¶ [0020]-[0026]). However, Colmenarez fails to expressly recite generate output of the stored machine learning model that indicates whether there is co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user; and determine, based on the output of the stored machine learning model, whether there is co-occurrence of the mouth movement of the user and the voice activity of the user, and rendering of at least one human perceptible cue via an output component of the client device.
The relevance of White is described above with relation to claim 1. Regarding claim 9, White teaches rendering of at least one human perceptible cue via an output component of the client device (Upon transition from a sleep mode, “the virtual assistant device may provide a response back to the user” such as “in the form of an indicator light or a sound, notifying the user that the virtual assistant device is prepared to engage with the user”; White, ¶ [0035]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include rendering of at least one human perceptible cue via an output component of the client device. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]). However, Colmenarez and White fail to expressly recite generate output of the stored machine learning model that indicates whether there is co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user; and determine, based on the output of the stored machine learning model, whether there is co-occurrence of the mouth movement of the user and the voice activity of the user.
The relevance of Chung is described above with relation to claim 1. Regarding claim 9, Chung teaches process, using a locally stored machine learning model (Discloses “a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabeled data” where the described implementation is “based on the MATLAB toolbox MatConvNet [26] and trained on a NVIDIA Titan X GPU with 12GB memory,” which is consumer grade hardware, and, as indicated “The data preparation pipeline and the network runs significantly faster than real-time on a mid-range laptop (Apple MacBook Pro with NVIDIA GeForce GT 750M graphics)” thus the system was locally trained and stored.; Chung, ¶ Abstract; pg. 255, lines 12-17; pg. 258, lines 17-20) trained to distinguish between (i) voice activity that co-occurs with mouth movement and is a result of the mouth movement and (ii) voice activity that co-occurs with the mouth movement but is not a result of the mouth movement (ConvNet is “a language independent and speaker independent solution to the lip-sync problem” and the described training objective for ConvNet is “that the output of the audio and the video networks are similar for genuine pairs, and different for false pairs” such that the system can be applied for “(i) determining the lip-sync error in videos” which is voice activity that co-occurs with the mouth movement but is not a result of the mouth movement, and “(ii) detecting the speaker in a scene with multiple faces” which is speaker diarization based on detecting voice activity that co-occurs with the mouth movement and is a result of the mouth movement; Chung, ¶ pg. 251, line 14 - pg. 252, line 4; pg. 254, lines 1-7), the image frames and the audio data to generate output of the stored machine learning model (As shown at FIG. 2, the two-stream ConvNet architecture receives both the audio and video streams simultaneously, and each of these are used to determine the time offset, where “to find the time offset between the audio and the video... the distance is computed between one 5-frame video feature and all audio features in the ±1 s range.” Typical output for the system is shown at FIG. 8.; Chung, ¶ pg. 258, lines 1-7, FIG. 2 and 8) that indicates whether there is co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user (FIG. 8 depicts the typical output from the system, showing the output as indicating correlation between the audio and lip movement. As indicated above, the time offset, also referred to as the synchronization error, is the measure of co-occurrence of mouth movement of the user, as captured from the video stream, and the voice activity of the user.; Chung, ¶ FIG. 8 (including Figure description)); and determining, based on the output of the stored machine learning model, whether there is co-occurrence of the mouth movement of the user and the voice activity of the user (the output of the ConvNet system (shown, for example, in FIG. 8) can detect both correlated audio (Shown in FIG. 8 left), including shifted correlation (shown in FIG. 8 middle), and also “cases where the person is speaking, but is uncorrelated to the audio” (for example, FIG. 8 right) which is an independent determination for both speaker and correlation of the voice activity, regardless of speaker, to the mouth movement.; Chung, ¶ pg. 258, line 20 - pg. 259, line 10; FIG. 8).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez, as modified by the non-verbal engagement of White, to incorporate the teachings of Chung to include generate output of the stored machine learning model that indicates whether there is co-occurrence of: mouth movement of a user, captured by one or more of the image frames, and voice activity of the user; and determine, based on the output of the stored machine learning model, whether there is co-occurrence of the mouth movement of the user and the voice activity of the user. Chung discloses the use of deep learning for determining audio-video synchronization, for a wide variety of purposes including speaker diarization in the cocktail party scenario, to improve speech recognition and speaker detection, where in the case of “speaker detection and lip reading, our results exceed the state-of-the-art on public datasets” while maintaining a recognized extendibility of the approach to “any problem where it is useful to learn a similarity metric between correlated data in different domains,” which would provide a recognized improvement on and be readily extendable to the similarity based association of the different domains of “speech utterances and at least one facial characteristic associated with the speech utterances” described in Colmenarez, as recognized by Chung. (Chung, ¶ [0003], [0023]; Colmenarez, ¶ [0001]).

Regarding claim 10, the rejection of claim 9 is incorporated. Claim 10 is substantially the same as claim 2 and is therefore rejected under the same rationale as above.

Regarding claim 12, the rejection of claim 9 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. However, Colmenarez fails to expressly recite wherein one or more of the processors, in executing the locally stored instructions, are further to: determine a distance of the user relative to the client device, wherein determining the distance of the user relative to the client device is based on one or both of: one or more of the image frames, and additional sensor data from an additional sensor of the client device wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on the distance of the user relative to the client device.
The relevance of White is described above with relation to claim 1. Regarding claim 12, White teaches wherein one or more of the processors, in executing the locally stored instructions, are further to: determine a distance of the user relative to the client device, wherein determining the distance of the user relative to the client device is based on one or both of: one or more of the image frames, and additional sensor data from an additional sensor of the client device (“virtual assistant device” may be equipped with “a high-resolution infrared camera {is based on... additional sensor data from an additional sensor of the client device} [and] a high resolution still camera device {is based on...one or more of image frames} to receive contextual data, including spatial topology, where “After entering the room where personal computer 106 is located, personal computer may receive input indicating that the user has entered the room (e.g., through infrared signals, changes in spatial topology, mobile device signals, and the like) {determining, at the client device, the distance of the user relative to the client device…}”; White, ¶ [0023], [0029], [0036]) wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on the distance of the user relative to the client device. (“The indicator box 704A may represent a... proximity of the user to an engagement box 706A, which represents an outer boundary (or threshold) for engaging the virtual assistant” where positioning of the user within the engagement threshold may be determined based on “the head position and other spatial topological data” as relative to the virtual assistant position in the environment.; White, ¶ [0081]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include wherein one or more of the processors, in executing the locally stored instructions, are further to: determine a distance of the user relative to the client device, wherein determining the distance of the user relative to the client device is based on one or both of: one or more of the image frames, and additional sensor data from an additional sensor of the client device wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on the distance of the user relative to the client device. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Regarding claim 13, the rejection of claim 12 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. However, Colmenarez fails to expressly recite wherein in determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device, one or more of the processors are further to: determine that the distance of the user, relative to the client device satisfies a threshold.
The relevance of White is described above with relation to claim 1. Regarding claim 13, White teaches wherein in determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device, one or more of the processors are further to: determine that the distance of the user, relative to the client device satisfies a threshold. (“The indicator box 704A may represent a... proximity of the user to an engagement box 706A, which represents an outer boundary (or threshold) for engaging the virtual assistant” where positioning of the user within the engagement threshold may be determined based on “the head position and other spatial topological data” as relative to the virtual assistant position in the environment.; White, ¶ [0081]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include wherein in determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device, one or more of the processors are further to: determine that the distance of the user, relative to the client device satisfies a threshold. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Regarding claim 14, the rejection of claim 12 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. However, Colmenarez fails to expressly recite wherein in determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device, one or more of the processors are configured to: determine that the distance of the user relative to the client device is closer, to the client device, than one or more previously determined distances of the user relative to the client device.
The relevance of White is described above with relation to claim 1. Regarding claim 14, White teaches wherein in determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device, one or more of the processors are configured to: determine that the distance of the user relative to the client device is closer, to the client device, than one or more previously determined distances of the user relative to the client device. (As illustrated in FIG. 7A and 7B, the distance of the user as compared to the device is determined both while outside the proximity threshold and once the user is within the proximity threshold, where “When alignment is achieved between indicator box 704B and engagement box 706B,” meaning that once the user has transitioned from being outside the proximity threshold as compared to the client device {one or more previously determined distances of the user relative to the client device} to being inside the proximity threshold as compared to the client device {the distance of the user relative to the client device is closer, to the client device, than...}, “the virtual assistant search bar 708B may illuminate” or “an indicator noise may play {an audible indication of the determination}”; White, ¶ [0080]-[0083], FIGS. 7A and 7B).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include wherein in determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible further based on the distance of the user relative to the client device, one or more of the processors are configured to: determine that the distance of the user relative to the client device is closer, to the client device, than one or more previously determined distances of the user relative to the client device. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Regarding claim 15, the rejection of claim 9 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. Colmenarez further discloses wherein one or more of the processors, in executing the locally stored instructions, are further to: determine, based on one or more of the image frames, that a gaze of the user is directed to the client device (The system further discloses “face recognizer 50” being trained to recognize whether a first speaker “is looking directly into camera 12” or not.; Colmenarez, ¶ [0032]-[0033]); wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on determining that the gaze of the user is directed to the client device. (“The speech recognition system of FIG. 1 can be modified by the arrangement illustrated in FIG. 2 so that the speech recognition system will not respond to speech utterances {determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue} when the speaker is not looking at camera 12 {based on determining that the gaze of the user is directed to the client device}”; Colmenarez, ¶ [0027]).

Regarding claim 16, the rejection of claim 9 is incorporated. Colmenarez, White and Chung disclose all of the elements of the current invention as stated above. However, Colmenarez fails to expressly recite wherein one or more of the processors, in executing the locally stored instructions, are further to: determine, based on one or more of the image frames, that a body pose of the user is directed to the client device; wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on determining that the body pose of the user is directed to the client device.
The relevance of White is described above with relation to claim 1. Regarding claim 16, White teaches wherein one or more of the processors, in executing the locally stored instructions, are further to: determine, based on one or more of the image frames, that a body pose of the user is directed to the client device (“disclosed system may receive non-verbal input data, including but not limited to, eye-gaze data [and] attributes of eye-gaze data” where “An ‘attribute’ of eye-gaze data may include but is not limited to an eye-gaze signal, a physical gesture, a body position, a head pose (or position), a facial feature, a facial expression, or any combination thereof.”; White, ¶ [0023], [0026]); wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on determining that the body pose of the user is directed to the client device. (The “virtual assistant device 108 may receive the at least one eye-gaze attribute” where the at least one eye-gaze attribute includes body position, the virtual assistant device 108 then “determine[s] that a user desires to engage with the virtual assistant” based on the “input [being] directed to the virtual assistant device,”; White, ¶ [0023], [0028], [0044]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speech recognition activation systems of Colmenarez to incorporate the teachings of White to include wherein one or more of the processors, in executing the locally stored instructions, are further to: determine, based on one or more of the image frames, that a body pose of the user is directed to the client device; wherein determining to perform the certain processing of the audio data and/or the rendering of the at least one human perceptible cue is further based on determining that the body pose of the user is directed to the client device. The related non-verbal engagement systems of White allow the user to interact with local or remote systems using non-verbal and non-tactile attributes of eye-gaze, such as “gaze fixation data, facial recognition data, motion or gesture detection, gaze direction data, head-pose or head-position data, and the like” which can assist “users with certain motor disabilities… to communicate with” automated assistant devices having “eye gaze technology,” as recognized by White. (White, ¶ [0003], [0023]).

Claim(s) 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over White in view of Klein.

Regarding claim 17, White discloses A method that facilitates hot-word free and touch-free gesture interaction between a user and an automated assistant, the method implemented by one or more processors and comprising (“a method for non-verbally engaging a virtual assistant” as performed by a virtual assistant device; White, ¶ [0032]): receiving, at a client device, a stream of image frames that are based on output from one or more cameras of the client device (“the input data received and/or the processing results may be stored locally {receiving}” at the client device where input data includes “currently captured data of classified color eye images {image frames}” from “a high-resolution still camera {based on the output from one or more cameras}” and where the camera “may be attached to, or in communication with, a virtual assistant device”; White, ¶ [0049], [0058]); processing, at the client device, the image frames of the stream using at least one trained machine learning model (“the at least one machine-learning algorithm associated with the engagement system may become more familiar with user-specific non-verbal inputs” where “the machine-learning algorithm may utilize data captured from a high-resolution still camera and compare previous data of classified color eye images with the currently captured data of classified color eye images.”; White, ¶ [0049]) stored locally on the client device (“processed by a machine-learning algorithm … operating within the engagement system,” where the “engagement system may be deployed locally” on the virtual assistant device {stored locally on the client device}”; White, ¶ [0043], [0030]) to detect occurrence of: a gaze of a user that is directed toward the client device (a virtual assistant device can use “gaze tracking (where the eye tracking hardware is able to detect specific locations focused on by eye gaze, such as a virtual assistant icon on a user interface)”; White, ¶ [0033], [0034]); determining, based on detecting the occurrence of the gaze of the user, to generate a response to a gesture of the user that is captured by one or more of the image frames of the stream (“the virtual assistant device may be in a ‘sleep’ (i.e., inactive) mode prior to receiving the eye-gaze signals... [where] upon receiving at least one eye-gaze signal” and the receipt of a gesture, such as a wave, the system “may engage a user based on the receipt of [the] gesture”; White, ¶ [0034]-[0036]); generating the response to the gesture of the user, generating the response comprising: determining the gesture of the user based on processing of the one or more of the image frames of the stream (The system can receive “the user’s physical gesture of pointing” as a non-verbal input based on “data captured” using a camera “and compare previous data” for the non-verbal inputs “with the currently captured data”; White, ¶ [0064], [0049]), and generating the response based on the gesture of the user and based on content being rendered by the client device at a time of the gesture, (In one example, the user may ask “What is that?” where “the user’s physical gesture of pointing to the screen and the screen contents (e.g., a series of screenshots may be captured and processed by the engagement system).” where the physical gesture and the screenshots are used by the system to resolve the ambiguity and prepare a response to the user.; White, ¶ [0064], [0049]) wherein generating the response based on the gesture of the user and based on the content being rendered by the client device at the time of the gestures comprises (The response generated in the example above is generated based on the pointing of the user and the contents displayed on the screen, as captured in the screenshot.; White, ¶ [0064], [0049]); selecting... a single responsive action, (The system selects the single responsive action of “responding accordingly” to the dialogue, in light of the gesture; White, ¶ [0005], [0064]) wherein selecting the single responsive action is based on the content being rendered by the client device at the time of the gesture (The single responsive action of “responding accordingly” to the user’s exclamation of “what is that” alongside the user pointing at the screen, with a response based on the video playing on the screen {based on the content being rendered by the client device…} which occurs based on the gesture, the dialogue, and the contemporaneous portion of the video {at the time of the gesture}; White, ¶ [0064]); generating the response to cause performance of the selected single responsive action, (“The response determination engine 312 may consider the processed input data results in determining how best to respond to the input {generating the response to cause the performance of...}” to “respond accordingly” based on “user’s verbal input,... the user’s physical gesture of pointing to the screen and the screen contents (e.g., a series of screenshots may be captured and processed by the engagement system)” thus a selected single responsive action; White, ¶ [0064]); and effectuating the response at the client device (The system “receive the dialogue and promptly activate[s], [and] respond[s] to the user accordingly,” thus, effectuating the response at the client.; White, ¶ [0064]). However, White fails to expressly recite determining that the gesture is assigned to a plurality of responsive actions, wherein a first action, of the plurality of responsive actions comprises altering a visual output of the client device when no audio or audiovisual content is being rendered and wherein a second action, of the plurality of responsive actions comprises altering an audible output of the client device when audio or audiovisual content is being rendered; selecting, from the plurality of responsive actions assigned to the gesture, a single responsive action, wherein generating the response to cause performance of the selected single responsive action is a result of selecting the single responsive action from the plurality of responsive actions assigned to the gesture.
Klein teaches systems and methods for clarifying indefinite quantitative inputs. (Klein, ¶ [0003]). Regarding claim 17, Klein teaches determining that the gesture is assigned to a plurality of responsive actions, (“The above-described embodiments may allow a wide range of quantitative actions to be addressed via a relatively small, consistent set of indefinite quantitative inputs, whether speech, gesture, or other,” which includes a set of “indefinite qualitative [gesture] inputs” for the “wide range of quantitative actions to be addressed”; Klein, ¶ [0029]) wherein a first action, of the plurality of responsive actions comprises altering a visual output of the client device when no audio or audiovisual content is being rendered (“different meanings maybe specified to such quantitative inputs in each context in a manner that allows users to habituate to the use of the commands in each context,” where, in one example, a first indefinite qualitative gesture input may signify “50 pixels in a scrolling context,” where scrolling is generally understood in the context of webpages, and webpages are not audio or audiovisual content in the context of the instant application (see, for example, para [0020] of the instant application, which explains that “the ‘scroll up’ action can be selected when no audio or audiovisual content is being rendered”.); Klein, ¶ [0029]) and wherein a second action, of the plurality of responsive actions comprises altering an audible output of the client device when audio or audiovisual content is being rendered (In the context of the same example, the same first indefinite qualitative gesture input “may signify 10% in a volume-setting context,” where volume-setting context is when audio or audiovisual content is being rendered, and changing a volume of a “volume-setting context” at a client device by 10% is an altering of an audible output of the client device; Klein, ¶ [0029]); selecting, from the plurality of responsive actions assigned to the gesture, a single responsive action, (“For example, FIG. 2 shows a user making a compound input comprising both a speech input having an indefinite quantitative term (“this much”), and also an indefinite quantitative gesture in which the user modifies the spoken term ‘this much’ by holding his hands a distance apart,” where indefinite qualitative gesture refers to the “wide range of qualitative actions {a plurality of responsive actions assigned to the gesture}” which may be deemed responsive to said gesture, where the system “determin[es] a mapping of the gesture input to a definite quantity {selecting...a single responsive action}”; Klein, ¶ [0018], [0029]) wherein generating the response to cause performance of the selected single responsive action is a result of selecting the single responsive action from the plurality of responsive actions assigned to the gesture (“the speech input system may disambiguate the speech input by detecting the reference to the contextual gesture, detecting the gesture input itself, and then determining a mapping of the gesture input to a definite quantity” such as, in relation to a user indicating quantity by holding their hands apart by a specific width, “by determining how far apart the user’s hands are relative to the total span of the user’s reach {selecting the single responsive action from the plurality of responsive actions assigned to the gesture}, and then using the relative distance to identify a definite quantity mapped to that difference {wherein generating the response to cause performance of the selected single responsive action is a result of}”; Klein, ¶ [0018]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal activation systems of White to incorporate the teachings of Klein to include determining that the gesture is assigned to a plurality of responsive actions, wherein a first action, of the plurality of responsive actions comprises altering a visual output of the client device when no audio or audiovisual content is being rendered and wherein a second action, of the plurality of responsive actions comprises altering an audible output of the client device when audio or audiovisual content is being rendered; selecting, from the plurality of responsive actions assigned to the gesture, a single responsive action, wherein generating the response to cause performance of the selected single responsive action is a result of selecting the single responsive action from the plurality of responsive actions assigned to the gesture. The related input disambiguation systems of Klein can clarify indefinite quantitative inputs, such as “indefinite speech input” or “one or more non-speech input modes, such as gesture inputs,” which enables the system to provide increased functionality with regards to ambiguous commands and more effectively provide a desired response in light of said ambiguities, as recognized by Klein. (Klein, ¶ [0002]-[0003], and [0015]).

Regarding claim 18, the rejection of claim 17 is incorporated. White and Klein disclose all of the elements of the current invention as stated above. White further discloses further comprising: determining, at the client device, a distance of the user relative to the client device, wherein determining the distance of the user relative to the client device is based on one or both of: one or more of the image frames, and additional sensor data from an additional sensor of the client device (“virtual assistant device” may be equipped with “a high-resolution infrared camera {is based on... additional sensor data from an additional sensor of the client device} [and] a high resolution still camera device {is based on...one or more of image frames} to receive contextual data, including spatial topology, where “After entering the room where personal computer 106 is located, personal computer may receive input indicating that the user has entered the room (e.g., through infrared signals, changes in spatial topology, mobile device signals, and the like) {determining, at the client device, the distance of the user relative to the client device…}”; White, ¶ [0023], [0029], [0036]); and wherein determining to generate the response to the gesture of the user is further based on a magnitude of the distance of the user relative to the client device (“The indicator box 704A may represent a... proximity of the user to an engagement box 706A, which represents an outer boundary (or threshold) for engaging the virtual assistant” where positioning of the user within the engagement threshold may be determined based on “the head position and other spatial topological data” as relative to the virtual assistant position in the environment.; White, ¶ [0081]).

Regarding claim 19, the rejection of claim 17 is incorporated. White and Klein disclose all of the elements of the current invention as stated above. White further discloses further comprising: determining, based on processing of one or more of the image frames locally at the client device, that the user is a recognized user (“face recognition technology {...based on processing of one or more image frames...} may allow the virtual assistant engagement system to discern when a particular user {determining...that the user is a recognized user} desires to engage with the virtual assistant,” where the system may process the input “locally, remotely, or using a combination of both.”; White, ¶ [0078], [0026]); wherein determining to generate the response to the gesture of the user is further based on determining that the user is a recognized user (“ the virtual assistant engagement system may be receiving multiple different dialogues from various people within the room, but once the engagement system detects the face of the user (e.g., owner) of the virtual assistant device, the engagement system may focus on that user’s facial expressions in addition to any dialog received from the user.”; White, ¶ [0078]).

Regarding claim 20, the rejection of claim 19 is incorporated. White and Klein disclose all of the elements of the current invention as stated above. However, White fails to expressly recite wherein determining to generate the response to the gesture of the user is further based on: determining that the user is the recognized user, and determining that the same recognized user initiated providing of the content being rendered by the client device.
The relevance of Klein is described above with relation to claim 17. Regarding claim 20, Klein teaches wherein determining to generate the response to the gesture of the user is further based on: determining that the user is the recognized user (Discloses a first command such as “skip forward a little” which provides disambiguating information which is “stored as a part of a user profile 314 for each user” thus determining that the user is the recognized user, such that the appropriate user profile is affected by the disambiguating information.; Klein, ¶ [0020]), and determining that the same recognized user initiated providing of the content being rendered by the client device (In the above example, “a user trying to skip through a video content item may specify a series of commands, such as ‘skip forward a little,’ followed sometime later by ‘a little more.’ The first command specifies a definite action (skip forward) but an indefinite quantity (‘a little’), while the follow-up command specifies no action and an indefinite quantity (‘a little more’), but implicitly refers to the previous command specifying an action.” In the context of the above described hand gesture for holding the hands apart to indicate an amount, when used to clarify the phrase “a little more,” the system determines the user is the same user who provided the command which initiated providing the content being rendered by the client device, both for association with the phrase “skip forward a little” and for proper storage of the collected disambiguating information..; Klein, ¶ [0013], [0018], [0020]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the non-verbal activation systems of White to incorporate the teachings of Klein to include wherein determining to generate the response to the gesture of the user is further based on: determining that the user is the recognized user, and determining that the same recognized user initiated providing of the content being rendered by the client device. The related input disambiguation systems of Klein can clarify indefinite quantitative inputs, such as “indefinite speech input” or “one or more non-speech input modes, such as gesture inputs,” which enables the system to provide increased functionality with regards to ambiguous commands and more effectively provide a desired response in light of said ambiguities, as recognized by Klein. (Klein, ¶ [0002]-[0003], and [0015]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Non-patent literature to Boutellaa et al. (Boutellaa, E., Boulkenafet, Z., Komulainen, J. and Hadid, A., 2016. Audiovisual synchrony assessment for replay attack detection in talking face biometrics. Multimedia Tools and Applications, 75(9), pp.5329-5343.) discloses systems and methods for audiovisual speech synchrony detection as a liveness check for talking face verification systems including the use of space-time auto-correlation of gradients (STACOG) for measuring the audiovisual synchrony.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627. The examiner can normally be reached 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Sean E Serraguard/            Patent Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Apr 17, 2023
Application Filed
Sep 27, 2024
Non-Final Rejection — §103
Dec 30, 2024
Response after Non-Final Action
Dec 30, 2024
Response Filed
May 30, 2025
Response Filed
Sep 08, 2025
Final Rejection — §103
Nov 12, 2025
Response after Non-Final Action
Dec 04, 2025
Request for Continued Examination
Dec 16, 2025
Response after Non-Final Action
Feb 07, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/154,549
Patent 12603095
Stereo Audio Signal Delay Estimation Method and Apparatus
2y 5m to grant Granted Apr 14, 2026
17/648,548
Patent 12598250
SYSTEMS AND METHODS FOR COHERENT AND TIERED VOICE ENROLLMENT
2y 5m to grant Granted Apr 07, 2026
18/004,197
Patent 12597429
PACKET LOSS CONCEALMENT
2y 5m to grant Granted Apr 07, 2026
16/529,456
Patent 12512093
Sensor-Processing Systems Including Neuromorphic Processing Modules and Methods Thereof
2y 5m to grant Granted Dec 30, 2025
17/640,303
Patent 12505835
HOME APPLIANCE AND SERVER
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
69%
Grant Probability
99%
With Interview (+33.6%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 134 resolved cases by this examiner. Grant probability derived from career allow rate.
HOT-WORD FREE ADAPTATION OF AUTOMATED ASSISTANT FUNCTION(S)

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email