Last updated: April 19, 2026
Application No. 17/906,197
Methods and Systems for Tracking User Attention in Conversational Agent Systems

Final Rejection §103
Filed
Sep 13, 2022
Examiner
SHAIKH, ZEESHAN MAHMOOD
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Solventum Intellectual Properties Company
OA Round
4 (Final)
Interview Optional

— +55.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 31 resolved cases, 2023–2026
Examiner Intelligence

SHAIKH, ZEESHAN MAHMOOD View full profile →
Grants 52% of resolved cases
Career Allow Rate
16 granted / 31 resolved
-10.4% vs TC avg
Strong +55% interview lift
Without
With
+55.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
32 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
25.7%
-14.3% vs TC avg
§103
45.8%
+5.8% vs TC avg
§102
17.3%
-22.7% vs TC avg
§112
5.8%
-34.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 31 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This communication is responsive to the applicant’s amendment dated 9/16/2025.  The applicant amended claims 1, 2, 18, and 23.  

Response to Arguments
Applicant’s arguments (see Remarks, pg. 10, line 1 – pg. 12, line 16) with respect to claims 1-16 and 18-24 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
First the applicant argues that Sztuk fails to teach “receiving, by an attention tracking component, a non-audio input from a user based at least in part on a gaze of the user directed at a visual focal point proximate to the user, wherein the gaze indicates an intention to interact with the conversational agent system, wherein the visual focal point is an avatar representing a conversational agent associated with the conversational agent system, wherein the conversational agent is an agent of the attention tracking component configured to continuously listen for a wake word, and wherein the avatar is integrated with an input device of the attention tracking component configured to detect the gaze of the user”.  Given the amendments to the claims, a new ground of rejection is provided below. 
Next, the applicant argues that “there would have been no motivation to combine Kienzle and Sztuk because they address fundamentally different problems and achieve different results”.  The examiner respectfully disagrees.  While Kienzle deals with detecting and responding to natural human movements and conversational queries and Sztuk takes place in the augmented/virtual reality environment, the fundamental connection is that both references are tracking a user’s attention. Therefore, the examiner believes there is a motivation to combine Kienzle and Sztuk.  As a result, the 35 U.S.C. 103 rejection is maintained.  
  
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8-16, 18-20, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Kienzle et. al. US 20180046851 A1 (hereinafter Kienzle) in view of Sztuk et al. US 20200209957 A1 (hereinafter Sztuk) in view of Lovitt et al. US 20180366118 A1 (hereinafter Lovitt).

Regarding claims 1, 18, and 23 Kienzle teaches a method for tracking user attention in a conversational agent system; a system for tracking user attention, a non-transitory, computer-readable medium encoded with computer-executable instructions that, when executed on a computing device, cause the computing device to carry out a method for tracking user attention in a conversational agent system, the method comprising:
receiving, by an attention tracking component, a non-audio input from a user based at least in part on a gaze of the user directed at a visual focal point proximate to the user, wherein the gaze indicates an intention to interact with the conversational agent system (FIG. 1 [0027] “the individual whose head 140, eyes and hand 142 are captured by the gaze detectors 150 and gesture detectors 154…it may be possible using the collected signals to determine the time at which a particular gesture was made, and/or to arrange events such as a head or neck movement (a nod or shake of the head), a torso movement (such as a bend of the body towards or away from some object), a change of gaze direction, and a vocalized query in temporal order”, examiner interprets 140, and 142 as non-audio input from the user; [0028] “The gaze detectors 150 may capture information regarding the directions in which the individual's eyes are pointing at various points in time in the depicted embodiment. In some embodiments, the gaze detectors may also capture specific types of eye movements such as smooth pursuit (in which the eye follows a moving visual target), voluntary saccades (in which the eye rapidly moves between fixation points), and/or vergence (in which the angle between the orientation of the two eyes is changed to maintain single binocular vision with respect to a particular set of objects)”, 
the examiner interprets the gaze detectors to capture visual information proximate to the user indicating an intention to interact with the system; [0032] “In the depicted embodiment, one or more command processing devices (CPDs) 185 may be responsible for analyzing the collected signals from the various sources to generate responses to the command/queries issued by the individual”, examiner interprets 185 as both the attention tracking and speech processing component);
receiving, by a speech processing component, audio input from the user at a first time, the first time being substantially a same time as when the non-audio input is received; (FIG. 1 [0029] “The command/query detectors 152 may capture voiced communications emanating from the individual such as the depicted query “What was that?” 144 in the depicted embodiment”; FIG. 3, [0039] “signals collected over a rolling window 360 of the previous five seconds are buffered, and can be used to respond to queries/commands which may refer to objects or scenes encountered or viewed during the buffered signal window”, examiner interprets the buffered signal window to be a time window of 0-5 seconds between when non-audio and audio signals can be received.  Examiner interprets T2 to be the first time and T1 to be when the non-audio signal is received which is substantially the same time as T2); 
determining, by a results processing component, that the attention tracking component received the non-audio input at the first time when the audio input was received indicating the intention to interact with the conversational agent system without the use of a wake word (FIG. 3, 310A ,310B, 360 [0039] “FIG. 3 illustrates an example timeline showing periods during which signals may be buffered in order to respond to queries directed to objects which may no longer be visible at the time that the queries are processed, according to at least some embodiments. Elapsed time increases from left to right along timeline 305. In the depicted example, signals collected over a rolling window 360 of the previous five seconds are buffered, and can be used to respond to queries/commands which may refer to objects or scenes encountered or viewed during the buffered signal window”; [0040] “Multimodal signal analysis may enable the command processor to determine that at time T1 (approximately one second after T0), the individual whose signals are being analyzed had a gaze direction D1 (which was in the direction of a llama), a physical position P1 (close to the llama), and had made a gesture G1 (e.g. a pointing gesture) towards the llama”, examiner interprets 310A as the non-audio input at T1 and 310B as the audio input received at T2; FIG. 2, 222B, [0037] “Respective sets of internal-facing cameras and microphones (IFCMs) 222, such as IFCN 222A-222D, may be configured to capture movements from the occupants”, [0006] “In some cases, the operation may simply comprise naming the selected object—e.g., if the command comprises the voiced query “What was that?””, examiner interprets the cited voice query as an intention to interact with the system without using a  wake word); 
and performing, by the results processing component, at least one actionable command identified within the audio input, based upon the determination that the attention tracking component received the non-audio input substantially the same time as when the non-audio input is received and a determination that the received audio input includes the at least one actionable command (FIG. 9, 925, [0029] “The command/query detectors 152 may capture voiced communications emanating from the individual such as the depicted query “What was that?” 144 in the depicted embodiment. Command/query interfaces which are not voice-based may also or instead be used in some embodiments—e.g., a command may be issued via a touch-screen interface or the like. In much of the subsequent discussion, the term “command” may be considered to subsume the term “query” with respect to the interactions originating at the individual and directed to the components responsible for responding to the interaction”; [0074] “The particular object with respect to which the operation is performed may be selected based at least in part on the interest/relevance score assigned to it in some embodiments. Feedback regarding the performed operation(s)—e.g., whether the target object was selected correctly or not—may be collected in some embodiments and used to improve the functioning and/or performance of the system over time. In one embodiment, if and/when the individual whose command or query was processed indicates that the system chose an incorrect object as the target object of interest, one or more additional objects of interest may be identified (e.g., from the original list of candidates or from a newly-generated list of candidates), and the requested operation may be performed in order on the additional objects until the command/query response is acceptable or the command/query is abandoned by the individual or the system”, [0039] “signals collected over a rolling window 360 of the previous five seconds are buffered, and can be used to respond to queries/commands which may refer to objects or scenes encountered or viewed during the buffered signal window”, examiner interprets T1 of FIG. 3 to be when the non-audio signal is received).
	Kienzle fails to teach wherein the visual focal point is an avatar representing a conversational agent associated with the conversational agent system, wherein the conversational agent is an agent of the attention tracking component configured to continuously listen for a wake word, and wherein the avatar is integrated with an input device of the attention tracking component configured to detect the gaze of the user;
	However, Sztuk teaches wherein the visual focal point is an avatar representing a conversational agent associated with the conversational agent system (FIG. 1, 136, 170, 100, 112 [0026] “The VR generated image may represent multiple users and, in particular, may include an image 136 representing Bob. Herein, the image 136 representing Bob is referred to as Bob's avatar 136. Bob's avatar 136 may be a still image or a dynamic image, an icon, a graphic representation, an animated image, etc.”, examiner interprets 136 as the avatar that represents a conversation agent (112) in the visual focal point (170) in the conversational agent system (100)), 
and wherein the avatar is integrated with an input device of the attention tracking component configured to detect the gaze of the user (FIG. 1, 128, 161, 142, [0028] “the first wearable display 128 of the AR/VR system 100 includes an eye tracking system 142, which collects data about the eyes of the first user, and provides the obtained data to an attention monitor 161”, examiner interprets 128 as the input device and the combination of 142 and 161 as the attention/gaze tracking component. [0030] “The attention monitor 161 synchronizes the information obtained by the eye tracking system 142 and the information related to the image currently displayed on the electronic display 121 to identify whether Ann looks at Bob, e.g. looks at Bob directly in AR applications, or looks at Bob's avatar 136 in VR applications. In FIG. 1, a line 170 indicates a particular direction of Ann's gaze when Ann looks at Bob's avatar 136.”)
Kienzle in view of Sztuk are considered to be analogous to the claimed invention because both are in the same field of tracking a user’s attention.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems designed to detect and respond to natural human movements and conversational queries of Kienzle with the technique using an avatar as a visual focal point taught by Sztuk in order to notify a user about attention from another user in an augmented reality/virtual reality (AR/VR) system (see Sztuk [0006])
	Kienzle in view of Sztuk fails to teach wherein the conversational agent is an agent of the attention tracking component configured to continuously listen for a wake word
	However, Lovitt teaches wherein the conversational agent is an agent of the attention tracking component configured to continuously listen for a wake word ([0054], [0084] “in some implementations, participants can invoke the services of the virtual assistant during the virtual session 600, either with a trigger phrase or a directed gaze at a virtual assistant avatar. For example, an utterance may be spoken by the first participant (represented by first participant avatar 605a) “Hey Cortana, show us the sales for this quarter”, examiner interprets trigger phrases as wake word and the virtual assistant avatar as the conversational agent)
Kienzle in view of Sztuk in view of Lovitt are considered to be analogous to the claimed invention because both are in the same field of digital communication systems.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the attention tracking systems of Kienzle in view of Sztuk with the technique of using conversational agents to listen for wake words taught by Lovitt in order to integrate a virtual assistant into a spoken conversation session (see Lovitt [Abstract])
Regarding claim 2, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 2 depends.  
Additionally, Kienzle teaches wherein receiving the non-audio input comprises determining that the user gazed at the visual focal point while in physical proximity to the visual focal point,  ([0028] “The gaze detectors 150 may capture information regarding the directions in which the individual's eyes are pointing at various points in time in the depicted embodiment”;[0062] “the functionality of the command processor may be implemented using a distributed combination of local and remote (with respect to proximity to the individual(s) whose commands are being processed) computing resources in at least some embodiments…the gathering of the gaze and gesture signals and the query/command signals may be performed within a vehicle occupied by the individual”; [0040] “Multimodal signal analysis may enable the command processor to determine that at time T1 (approximately one second after T0), the individual whose signals are being analyzed had a gaze direction D1 (which was in the direction of a llama), a physical position P1 (close to the llama), and had made a gesture G1 (e.g. a pointing gesture) towards the llama”).
	Additionally, Lovitt teaches wherein the visual focal point corresponds to a smart speaker (FIG. 5, 510, [0071] “spoken conversation session 500 in which multiple participants 504a, 504b, and 504c are together in a single location (for example, a conference room) in which they can speak directly to one another and interact with a virtual assistant via a virtual assistant interface device 510”, examiner interprets 510 as a smart speaker; [0001] “Virtual assistants, such as Siri™ Google Now™, Amazon Echo™, and Cortana™, are examples of a shift in human computer interaction”)

Regarding claim 3, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 3 depends.  
Additionally, Kienzle teaches wherein receiving, the non- audio input comprises receiving physical input to a user interface element of the attention tracking component ([0025] “Command and/or queries may be detected using signals other than voice/speech in some embodiments—e.g., sign language may be used for a command, or a touch screen interface may be used to indicate at least a portion of a command”; [0028] “A number of different types of gestures may be detected in the depicted embodiment, including hand or finger pointing gestures, head nods or turns, body bends, eyebrow or forehead movements, and so on”).

Regarding claim 4, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 4 depends.  
Additionally, Kienzle teaches transmitting, by the attention tracking component, to the results processing component, an indication of the time when the attention tracking component received the non-audio input (FIG 1, [0032] “In the depicted embodiment, one or more command processing devices (CPDs) 185 may be responsible for analyzing the collected signals from the various sources to generate responses to the command/queries issued by the individual”; FIG. 9, 913 [0027] “At least some of the signal detectors may store timestamps or other timing information as well as the raw signals themselves—e.g., it may be possible using the collected signals to determine the time at which a particular gesture was made, and/or to arrange events such as a head or neck movement (a nod or shake of the head), a torso movement (such as a bend of the body towards or away from some object), a change of gaze direction, and a vocalized query in temporal order”).

Regarding claim 5, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 5 depends.  
Additionally, Kienzle teaches wherein determining, that the attention tracking component received the non-audio input at the first time further comprises:  
determining, by the results processing component, that the first time followed a time interval during which the speech processing component received a first portion of audio input that does not include the at least one actionable command that precedes the audio input that includes the at least one actionable command (FIG. 3, 310A ,310B, 360 [0039] “FIG. 3 illustrates an example timeline showing periods during which signals may be buffered in order to respond to queries directed to objects which may no longer be visible at the time that the queries are processed, according to at least some embodiments. Elapsed time increases from left to right along timeline 305. In the depicted example, signals collected over a rolling window 360 of the previous five seconds are buffered, and can be used to respond to queries/commands which may refer to objects or scenes encountered or viewed during the buffered signal window” ;[0040] “Multimodal signal analysis may enable the command processor to determine that at time T1 (approximately one second after T0), the individual whose signals are being analyzed had a gaze direction D1 (which was in the direction of a llama), a physical position P1 (close to the llama), and had made a gesture G1 (e.g. a pointing gesture) towards the llama”, examiner interprets 310A as the non-audio input at T1 and 310B as the audio input received at T2; [0035] “If the response is not satisfactory, in at least some embodiments further rounds of interactions may occur between the individual and the components of the system. For example, the individual may say something like “No, I didn't mean the animal, I meant the building” or simply “No, I didn't mean the llama”. In such a scenario, the command processor(s) may attempt to find another candidate object of interest which meets the narrowed criterion indicated by the individual (e.g., either using the original list of candidates, or by generating a new list) and may cause a second operation to correct/replace the original response to the query 144. Several such iterations may be performed in various embodiments, e.g., until a satisfactory response (from the perspective of the command issuer) is provided or until further interactions are terminated/aborted by one of the parties (the individual or the command processors)”, examiner interprets further rounds of interactions to include clarification communication (does not include actionable command) followed by an actionable command);  
and storing, by the results processing component, the first portion of the audio input (FIG. 9 [0092] “In some embodiments, main memory 9020 may be one embodiment of a computer-accessible medium configured to store program instructions and data as described above for FIG. 1 through FIG. 11 for implementing embodiments of the corresponding methods and apparatus”).

Regarding claim 6, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 6 depends.  
Additionally, Kienzle teaches wherein determining, that the attention tracking component received the non-audio input at the first time further comprises: 
determining, by the results processing component, that the first time preceded a time interval during which the speech processing component received a first portion of audio input that does not include the at least one actionable command that precedes the audio input that includes the at least one actionable command (FIG. 3, 310A ,310B, 360 [0039] “FIG. 3 illustrates an example timeline showing periods during which signals may be buffered in order to respond to queries directed to objects which may no longer be visible at the time that the queries are processed, according to at least some embodiments. Elapsed time increases from left to right along timeline 305. In the depicted example, signals collected over a rolling window 360 of the previous five seconds are buffered, and can be used to respond to queries/commands which may refer to objects or scenes encountered or viewed during the buffered signal window” ;[0040] “Multimodal signal analysis may enable the command processor to determine that at time T1 (approximately one second after T0), the individual whose signals are being analyzed had a gaze direction D1 (which was in the direction of a llama), a physical position P1 (close to the llama), and had made a gesture G1 (e.g. a pointing gesture) towards the llama”, examiner interprets 310A as the non-audio input at T1 and 310B as the audio input received at T2; [0035] “If the response is not satisfactory, in at least some embodiments further rounds of interactions may occur between the individual and the components of the system. For example, the individual may say something like “No, I didn't mean the animal, I meant the building” or simply “No, I didn't mean the llama”. In such a scenario, the command processor(s) may attempt to find another candidate object of interest which meets the narrowed criterion indicated by the individual (e.g., either using the original list of candidates, or by generating a new list) and may cause a second operation to correct/replace the original response to the query 144. Several such iterations may be performed in various embodiments, e.g., until a satisfactory response (from the perspective of the command issuer) is provided or until further interactions are terminated/aborted by one of the parties (the individual or the command processors)”, examiner interprets further rounds of interactions to include clarification communication (does not include actionable command) followed by an actionable command));
 and storing, by the speech processing component, the first portion of the audio input (FIG. 9 [0092] “In some embodiments, main memory 9020 may be one embodiment of a computer-accessible medium configured to store program instructions and data as described above for FIG. 1 through FIG. 11 for implementing embodiments of the corresponding methods and apparatus”).

Regarding claim 8, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 8 depends.  
Additionally, Kienzle teaches wherein performing the at least one actionable command further comprises performing an actionable command included in audio input received at a time subsequent to the first time (FIG. 3 [0040] “The command processor being used (not shown in FIG. 3) may analyze the gaze, gesture and voice signals collected during the buffered signal window 360”, examiner interprets FIG 3 to show the non-audio input occurring before or during the audio input).

Regarding claim 9, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 9 depends.  
Additionally, Kienzle teaches determining, by the speech processing component, that the received audio input includes at least one actionable command (FIG. 9, 922 [0053] “the interaction interfaces used for disambiguation (e.g., whether a disambiguation query and the corresponding response involve the use of a visual display 610, vocalized interactions are used, or both visual and vocalized interactions are used) with respect to a given query or command may be selected by the command processor depending on various factors”, examiner interprets disambiguation as a means for the command processor to determine the command from the user).

	Regarding claim 10, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 10 depends.  
Additionally, Kienzle teaches transmitting, by the attention tracking component, to the speech processing component, an audio signal indicating receipt of the non-audio input (FIG 1, [0032] “In the depicted embodiment, one or more command processing devices (CPDs) 185 may be responsible for analyzing the collected signals from the various sources to generate responses to the command/queries issued by the individual”; FIG. 9, 913 [0027] “At least some of the signal detectors may store timestamps or other timing information as well as the raw signals themselves—e.g., it may be possible using the collected signals to determine the time at which a particular gesture was made, and/or to arrange events such as a head or neck movement (a nod or shake of the head), a torso movement (such as a bend of the body towards or away from some object), a change of gaze direction, and a vocalized query in temporal order”).
	
	Regarding claim 11, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 11 depends.  
Additionally, Kienzle teaches transmitting, by the attention tracking component, to the speech processing component, an indication of receipt of the non-audio input (FIG 1, [0032] “In the depicted embodiment, one or more command processing devices (CPDs) 185 may be responsible for analyzing the collected signals from the various sources to generate responses to the command/queries issued by the individual”; FIG. 9, 913 [0027] “At least some of the signal detectors may store timestamps or other timing information as well as the raw signals themselves—e.g., it may be possible using the collected signals to determine the time at which a particular gesture was made, and/or to arrange events such as a head or neck movement (a nod or shake of the head), a torso movement (such as a bend of the body towards or away from some object), a change of gaze direction, and a vocalized query in temporal order”).
	
	Regarding claim 12 Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 12 depends.  
Additionally, Kienzle teaches transmitting, by the attention tracking component, to the results processing component, an indication of receipt of the non-audio input ((FIG 1, [0032] “In the depicted embodiment, one or more command processing devices (CPDs) 185 may be responsible for analyzing the collected signals from the various sources to generate responses to the command/queries issued by the individual”; FIG. 9, 913 [0027] “At least some of the signal detectors may store timestamps or other timing information as well as the raw signals themselves—e.g., it may be possible using the collected signals to determine the time at which a particular gesture was made, and/or to arrange events such as a head or neck movement (a nod or shake of the head), a torso movement (such as a bend of the body towards or away from some object), a change of gaze direction, and a vocalized query in temporal order”)).

	Regarding claim 13, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 13 depends.  
Additionally, Kienzle teaches storing, by the speech processing component, the received audio input for subsequent processing ([0031] “Data from the various signal detectors (those focused on the individual's movements/behaviors, such as the gaze, gesture and command detectors, as well as those focused on the external environment) may be buffered temporarily in at least some embodiments”; FIG. 9 [0092] “In some embodiments, main memory 9020 may be one embodiment of a computer-accessible medium configured to store program instructions and data as described above for FIG. 1 through FIG. 11 for implementing embodiments of the corresponding methods and apparatus”).	

Regarding claim 14, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 14 depends.  
Additionally, Kienzle teaches wherein determining, by the speech processing component, that the received audio input includes at least one actionable command further comprises processing a subset of the received audio input ([0032] “In the depicted embodiment, one or more command processing devices (CPDs) 185 may be responsible for analyzing the collected signals from the various sources to generate responses to the command/queries issued by the individual”).

Regarding claim 15, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 15 depends.  
Additionally, Kienzle teaches wherein determining, by the speech processing component, that the received audio input includes at least one actionable command further comprises processing all of the received audio input (FIG. 9, 922 [0053] “the interaction interfaces used for disambiguation (e.g., whether a disambiguation query and the corresponding response involve the use of a visual display 610, vocalized interactions are used, or both visual and vocalized interactions are used) with respect to a given query or command may be selected by the command processor depending on various factors”, examiner interprets disambiguation as a means for the command processor to determine the command from the user).

Regarding claim 16, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 16 depends.  
Additionally, Kienzle teaches determining, by the speech processing component, that the received audio input includes a keyword ([0074] “the rankings (e.g., interest scores or relevance scores) predicted for several different objects happen to be close to one another, or if a wrong selection of a target object may have substantial negative side effects, the system may request the individual from whom the query or command was detected to disambiguate or confirm the choice of the target made by the system (element 922). The requested action or operation (which may for example be something as simple as naming the target of the word “that” in the query “What was that?”, or something more substantial such as parking a vehicle in response to the command “Park over there”)”, examiner interprets the target as the keyword).

Regarding claim 19, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 18, upon which claim 19 depends.  
Additionally, Kienzle teaches wherein the attention tracking component comprises a gaze tracker (FIG. 1[0025] “system 100 may comprise several types of signal detectors for detecting human movements and other human behaviors, including one or more gaze detectors 150”).

Regarding claim 20, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 18, upon which claim 20 depends.  
Additionally, Kienzle teaches wherein the attention tracking component comprises functionality for identifying when the user gazes at the visual focal point ([0028] “The gaze detectors 150 may capture information regarding the directions in which the individual's eyes are pointing at various points in time in the depicted embodiment”).

Claims 7 and 24 is rejected under 35 U.S.C. 103 as being unpatentable over Kienzle in view of Sztuk in view of Lovitt, as shown above for claim 1, in further view of in view of Li (US 20190050196 A1).

Regarding claim 7, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 7 depends.  
Kienzle in view of Sztuk in view of Lovitt fails to teach wherein performing the at least one actionable command further comprises performing an actionable command included in audio input received at a time that preceded the first time.
However, Li teaches wherein performing the at least one actionable command further comprises performing an actionable command included in audio input received at a time that preceded the first time ([0042] “When both a gaze sensor and a voice recognition system are turned on from the beginning, a method may be arranged where either a gazing act or a voice input act may happen first… assume that a user gazes at a device during a first time period from time-A1 to time-A2 and issues a voice command during a second time period from time-B1 to time-B2. The device may be arranged to implement the command if the two time periods overlap either fully or partially or a gap value between the two time periods along a timeline is smaller than a given value, say five to ten seconds, where it doesn't matter which period happens first”).
	Kienzle in view of Sztuk in view of Lovitt in view of Li are considered to be analogous to the claimed invention because both are in the same field of performing actions based on user signal inputs.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems designed to identify and act upon entities of interest to an individual using potentially imprecise cues obtained from a combination of several types of signals such as gestures and gaze directions of Kienzle in view of Sztuk in view of Lovitt with the technique of presenting information when an audio signal is detected before a non-audio signal taught by Li in order to present information utilizing gaze detection (see Li [0004])

Regarding claim 24, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 1, upon which claim 24 depends.  
Kienzle in view of Sztuk in view of Lovitt fails to teach wherein performing the at least one actionable command comprises: directing the at least one actionable command to be executed by an application running on a different computing device.
	However, Li teaches wherein performing the at least one actionable command comprises: directing the at least one actionable command to be executed by an application running on a different computing device ([0051] “When multiple devices are involved, an on-site or remote control system may be arranged. The control system may receive, collect, and analyze data sent from gaze sensors and voice sensing detectors of the devices… When a user gazes at the first device and mentions a name of a second device, the control system may carry out the command either at the first device or the second device depending on set-up selection. It may be arranged that a user may choose a mode or switch from a mode to another one”).
	Kienzle in view of Sztuk in view of Lovitt in view of Li are considered to be analogous to the claimed invention because both are in the same field of performing actions based on user signal inputs.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems designed to identify and act upon entities of interest to an individual using potentially imprecise cues obtained from a combination of several types of signals such as gestures and gaze directions of Kienzle in view of Sztuk in view of Lovitt with the technique of directing actionable commands to a different computing device taught by Li in order to present information utilizing gaze detection (see Li [0004]). 

Claims 21 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Kienzle in view of Sztuk in view of Lovitt, as shown above for claim 1, in further view of in view of Wen (US 20190235710 A1).

Regarding claim 21, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 18, upon which claim 21 depends.  
Kienzle in view of Sztuk in view of Lovitt fails to teach wherein the attention tracking component comprises a button for receiving non-audio input.
However, Wen teaches wherein the attention tracking component comprises a button for receiving non-audio input (Figure 1B, 130 [0018] “a user's foot 128 operates an electro-mechanical button fashioned in the form of a foot pedal 130. Foot pedal 130 is connected to electronic device 120 through a wireless data connection represented by signals 132”).
	Kienzle in view of Sztuk in view of Lovitt in view of Wen are considered to be analogous to the claimed invention because both are in the same field of performing actions based on user signal inputs.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems designed to identify and act upon entities of interest to an individual using potentially imprecise cues obtained from a combination of several types of signals such as gestures and gaze directions of Kienzle in view of Sztuk in view of Lovitt with the technique of using a button to trigger an update of the content of the displayed page taught by Wen in order to improve a method and system for traversing through digital content displayed on a digital device in a piecewise manner (see Wen [Abstract]).

	Regarding claim 22, Kienzle in view of Sztuk in view of Lovitt teaches all of the limitations of claim 18, upon which claim 22 depends.  
Kienzle in view of Sztuk in view of Lovitt fails to teach wherein the attention tracking component comprises a foot-pedal for receiving non-audio input.
However, Wen teaches wherein the attention tracking component comprises a foot-pedal for receiving non-audio input (Figure 1B, 130 [0018] “a user's foot 128 operates an electro-mechanical button fashioned in the form of a foot pedal 130. Foot pedal 130 is connected to electronic device 120 through a wireless data connection represented by signals 132”).
Kienzle in view of Sztuk in view of Lovitt in view of Wen are considered to be analogous to the claimed invention because both are in the same field of performing actions based on user signal inputs.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the systems designed to identify and act upon entities of interest to an individual using potentially imprecise cues obtained from a combination of several types of signals such as gestures and gaze directions of Kienzle in view of Sztuk in view of Lovitt with the technique of using a foot pedal to trigger an update of the content of the displayed page taught by Wen in order to improve a method and system for traversing through digital content displayed on a digital device in a piecewise manner (see Wen [Abstract]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Keating et al. (US 20140028712 A1) teaches a method and apparatus for controlling an augmented reality interface are disclosed. In one embodiment, a method for use with an augmented reality enabled device (ARD) comprises receiving image data for tracking a plurality of objects, identifying an object to be selected from the plurality of objects, determining whether the object has been selected based at least in part on a set of selection criteria, and causing an augmentation to be rendered with the object if it is determined that the object has been selected.
Huber et al. (US 20190311718 A1) teaches a voice-interaction device includes a plurality of input and output components configured to facilitate interaction between the voice-interaction device and a target user. The plurality of input and output components may include a microphone configured to sense sound and generate an audio input signal, a speaker configured to output an audio signal to the target user, and an input component configured to sense at least one non-audible interaction from the target user. A context controller monitors the plurality of input and output components and determines a current use context. A virtual assistant module facilitates voice communications between the voice-interaction device and the target user and configures one or more of the input and output components in response to the current use context. The current use context may include whisper detection, target user proximity, gaze direction tracking and other use contexts.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ZEESHAN SHAIKH whose telephone number is (703)756-1730. The examiner can normally be reached Monday-Friday 7:30AM-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ZEESHAN MAHMOOD SHAIKH/Examiner, Art Unit 2658                                                                                                                                                                                                        

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Sep 13, 2022
Application Filed
Jul 24, 2024
Non-Final Rejection — §103
Sep 11, 2024
Interview Requested
Oct 22, 2024
Examiner Interview Summary
Oct 22, 2024
Applicant Interview (Telephonic)
Oct 28, 2024
Response Filed
Feb 22, 2025
Final Rejection — §103
Mar 27, 2025
Interview Requested
Apr 09, 2025
Examiner Interview Summary
Apr 09, 2025
Applicant Interview (Telephonic)
Apr 25, 2025
Request for Continued Examination
Apr 28, 2025
Response after Non-Final Action
Jun 11, 2025
Non-Final Rejection — §103
Aug 26, 2025
Interview Requested
Sep 04, 2025
Applicant Interview (Telephonic)
Sep 04, 2025
Examiner Interview Summary
Sep 16, 2025
Response Filed
Oct 09, 2025
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/992,340
Patent 12579373
SYSTEM AND METHOD FOR SYNTHETIC TEXT GENERATION TO SOLVE CLASS IMBALANCE IN COMPLAINT IDENTIFICATION
2y 5m to grant Granted Mar 17, 2026
17/915,465
Patent 12555575
Wakeup Indicator Monitoring Method, Apparatus and Electronic Device
2y 5m to grant Granted Feb 17, 2026
17/682,177
Patent 12518090
LOGICAL ROLE DETERMINATION OF CLAUSES IN CONDITIONAL CONSTRUCTIONS OF NATURAL LANGUAGE
2y 5m to grant Granted Jan 06, 2026
17/820,285
Patent 12511318
MULTI-SYSTEM-BASED INTELLIGENT QUESTION ANSWERING METHOD AND APPARATUS, AND DEVICE
2y 5m to grant Granted Dec 30, 2025
17/914,033
Patent 12512088
METHOD AND SYSTEM FOR USER-INTERFACE ADAPTATION OF TEXT-TO-SPEECH SYNTHESIS
2y 5m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
52%
Grant Probability
99%
With Interview (+55.0%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 31 resolved cases by this examiner. Grant probability derived from career allow rate.
Methods and Systems for Tracking User Attention in Conversational Agent Systems

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email