Last updated: April 19, 2026
Application No. 17/942,950
USING SCENE-AWARE CONTEXT FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Non-Final OA §103§112
Filed
Sep 12, 2022
Examiner
SCHMIEDER, NICOLE A K
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Nvidia Corporation
OA Round
5 (Non-Final)
Interview Optional

— +34.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 167 resolved cases, 2023–2026
Examiner Intelligence

SCHMIEDER, NICOLE A K View full profile →
Grants 68% — above average
Career Allow Rate
113 granted / 167 resolved
+5.7% vs TC avg
Strong +34% interview lift
Without
With
+34.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
25 currently pending
Career history
192
Total Applications
across all art units
Statute-Specific Performance

§101
21.9%
-18.1% vs TC avg
§103
46.7%
+6.7% vs TC avg
§102
13.0%
-27.0% vs TC avg
§112
13.9%
-26.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 167 resolved cases
Office Action

§103 §112
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/02/2025 has been entered.
This communication is in response to the Amendments and Arguments filed on   12/02/2025. 
Claims 1-7, 10-15, 17-20, and 22-24 are pending and have been examined.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the examiner. 
Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 
Response to Arguments
Applicant's arguments filed 12/02/2025 have been fully considered but they are not persuasive and/or are moot. 
Applicant’s arguments with respect to claim(s) 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Please see the updated mappings citing Panchaksharaiah for further detail.
Applicant’s arguments with respect to claims 11 and 18 have been fully considered but they are not persuasive.
Regarding claim 11, Applicant asserts on pgs 17-18 that Panchaksharaiah does not teach or suggest a gaze direction. The Examiner respectfully disagrees with this assertion. Panchaksharaiah teaches capturing a user in a zoom mode to track either eye movements to determine that a user is directing attention at a target (see [0016],[0018-9],[0036],[0044],[0048]), and using the eye movements in combination with images from a first person gaze that shows what the user is looking at to identify the target object (see [0016],[0018-9],[0036],[0044],[0048]), which reads on the BRI of determining a gaze direction using image data, and projecting the gaze direction to determine a virtual representation of the gaze direction  directed towards a POI. 
Regarding claim 18, Applicant asserts on pgs 15-16 that El Dokor does not teach an intent or parameters for one or more slots. The Examiner respectfully disagrees with this assertion. El Dokor teaches the identification of voice commands, which reads on the determination of intent, and further identifying terms that are used in a character string that can be further used to narrow the results of a POI search, such as the term “building”, which reads on parameters for one or more slots (see (3:66-4:10),(6:12-19),(7:28-43),(8:1-19),(8:59-9:27),(10:11-25)). El Dokor is further combined with Kale, which teaches that the NLU component parses user inputs to determine the user intent and intent-related parameters such as attributes and attribute values related to a particular object (see (17:23-18:28)). Thus, the combination of El Dokor and Kale teaches the claims as recited. 
Hence, Applicant’s arguments are not persuasive.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-7, 10, and 22 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites “one or more machine learning models” in the second-to-last limitation, after having recited “one or more first machine learning models” in the first limitation. This leads to a lack of clear antecedent basis as to whether the “one or more machine learning models” refers to a second set of machine learning models or the first machine learning models. In the interest of compact prosecution, the Examiner will interpret the “one or more machine learning models” as –one or more second machine learning models--, however, the Examiner suggests amending the claim to clarify the relationship between the different machine learning models.
Claims 2-7, 10, and 22 are rejected as being dependent upon a rejected base claim.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 18, 20, and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over El Dokor et al. (U.S. Patent No. 8,818,716), hereinafter El Dokor, in view of Kale et al. (U.S. Patent No. 11804035), hereinafter Kale.

Regarding claim 18, El Dokor teaches
One or more processors comprising (multiple processors may be included in each device (4:48-59)): 
processing circuitry to (the processors include various architectures that can process signals (4:48-59)):
receive audio data obtained using one or more microphones of a vehicle (the voice recognition module receives an output signal from the microphone in the vehicle that captures voice signals, i.e. receive audio data obtained using one or more microphones of a vehicle (3:66-4:10));
determine, based at least on the audio data, at least an intent associated with the speech and one or more parameters for one or more slots corresponding to the intent (the voice recognition module receives an output signal from the microphone in the vehicle that captures voice signals, i.e. audio data, and performs a voice recognition algorithm on the received signal to identify voice commands that provide additional information about the object being identified, i.e. determine an intent associated with the speech, such as a skyscraper in the city the vehicle is driving through, and that the person spoke the word “building” to identify the particular object, i.e. one or more parameters for one or more slots corresponding to the intent (3:66-4:10),(6:12-19),(7:28-43),(8:1-19),(8:59-9:27),(10:11-25));
determine, based at least on image data obtained using one or more image sensors of the vehicle, a direction indicated by a user located within an interior of the vehicle (a camera system captures images of a gesture performed by a user in the vehicle, i.e. determining based at least on image data obtained using one or more image sensors of the vehicle, and the gesture recognition module analyzes the gesture to determine a direction vector representing the direction of the gesture, i.e. determine…a direction indicated by a user located within an interior of the vehicle (5:64-6:11));
determine, based at least on the direction indicated by the user being associated with a point of interest (POI) as represented by a map, contextual information from the map that is associated with the POI (the input analysis module receives input data from the gesture recognition module and the location module to identify a target region and micromap with POIs, i.e. based at least on the direction indicated by the user being associated with a point of interest (POI) as represented by a map, and perform a POI search for information related to a POI and stored in the micromap, i.e. determine…contextual information from the map that is associated with the POI (5:64-6:11),(6:20-67),(7:1-17,28-43),(8:20-62),(9:3-7),(10:11-51));
apply, to one or more…models, first data representative of text describing at least the (the input analysis module receives input data, including from the voice recognition module, i.e. first data representative of text describing at least the intent and the one or more parameters for the one or more slots as determined using the speech, and the POI search module queries a databases to retrieve information related to POI, i.e. second data representative of second text describing at least the second contextual information from the map, which can be merged by an artificial intelligence unit, i.e. apply to one or more models, and which can be received by the data output module for output to the user (6:12-19),(6:31-51),(7:6-17),(8:47-9:27),(9:27-67));
determine, based at least on the one or more … models processing the first data and the second data, an output associated with the speech that includes at least a portion of the contextual information (the input analysis module receives input data, including from the voice recognition module, i.e. first data, and the POI search module queries a databases to retrieve information related to POI, i.e. second data, which can be merged by an artificial intelligence unit, i.e. based at least on the one or more models processing, and which can be received by the data output module for output to the user, i.e. determine…an output associated with the speech that includes at least a portion of the contextual information (6:12-19),(6:31-51),(7:6-17),(8:47-9:27),(9:27-67)); and 
cause the vehicle to provide the output associated with the speech (the data output module receives information related to the POI from the search modules, and sends the information to the in-vehicle communications system or display, i.e. causing the vehicle to provide the output, in response to the voice command and gesture, i.e. associated with the speech (5:64-6:19),(7:6-17),(9:51-60)). 
While El Dokor provides recognizing voice commands using an algorithm and providing information to a user in response to the command, El Dokor does not specifically teach identifying using a machine learning model to process two types of data to determine an output, and thus does not teach
apply, to one or more machine learning models, first data representative of text describing at least the 
determine, based at least on the one or more machine learning models processing the first data and the second data, an output associated with the speech that includes at least a portion of the contextual information.
Kale, however, teaches apply, to one or more machine learning models, first data representative of text describing at least the (the NLG component generates language for a textual or spoken reply based on the decisions made by the artificial intelligence framework, i.e. apply to one or more machine learning models, where the NLU component may parse user inputs to determine the user intent and intent-related parameters, such as a dominant object and a variety of attributes related to that object, such as looking for a dress for a wedding in June in Italy, i.e. first data representative of the intent and the one or more parameters for the one or more slots as determined using the speech, where the result of the search may include an item list of candidate products, i.e. second text describing at least the contextual information, and the output can include the search results, i.e. output that includes at least a portion of the contextual information (1:60-2:7),(10:12-27,50-65),(17:23-18:28),(20:2-5,14-36));
determine, based at least on the one or more machine learning models processing the first data and the second data, an output associated with the speech that includes at least a portion of the contextual information (the NLG component generates language for a textual or spoken reply, i.e. determine…an output, based on the decisions made by the artificial intelligence framework, i.e. based at least on the one or more machine learning models, where the NLU component may parse user inputs to determine the user intent and intent-related parameters, such as a dominant object and a variety of attributes related to that object, such as looking for a dress for a wedding in June in Italy, i.e. processing the first data, where the result of the search may include an item list of candidate products, i.e. processing…the second data, and the output can include the search results, i.e. output associated with the speech that includes at least a portion of the contextual information (1:60-2:7),(10:12-27,50-65),(17:23-18:28),(20:2-5,14-36)).
El Dokor and Kale are analogous art because they are from a similar field of endeavor in providing responses to users based on multimodal input. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the recognizing voice commands using an algorithm and providing information to a user in response to the command teachings of El Dokor with the use of an artificial intelligence framework to determine a reply and generate a response using the intent and various parameters as taught by Kale. It would have been obvious to combine the references to learn from user intents to enhance user understanding and provide an improved user experience (Kale (1:60-2:7)).

Regarding claim 20, El Dokor in view of Kale teaches claim 18, and El Dokor further teaches
the system is comprised in at least one of:
a control system for an autonomous or semi-autonomous machine;
a perception system for the autonomous or semi-autonomous machine;
a system for performing simulation operations;
a system for presenting virtual reality (VR) content;
a system for presenting augmented reality (AR) content;
a system for presenting mixed reality (MR) content;
a system for performing digital twin operations;
a system for performing light transport simulation;
a system for performing collaborative content creation for 3D assets;
a system for performing deep learning operations;
a system implemented using an edge device (the system may be performed by a device in the operating environment, such as an in-vehicle communications system, a wireless mobile communication device, and/or a remote server, i.e. edge devices (2:54-3:13));
a system implemented using a robot;
a system for performing conversational AI operations;
a system for generating synthetic data;
a system incorporating one or more virtual machines (VMs);
a system implemented at least partially in a data center (the system may be performed by a device in the operating environment, such as an in-vehicle communications system, a wireless mobile communication device, and/or a remote server, and may further access a database on the server or a service operating on a third-party server, such as Yelp or Google, i.e. at least partially in a data center (2:54-3:13),(6:58-51));
or a system implemented at least partially using cloud computing resources (the system may be performed by a device in the operating environment, such as an in-vehicle communications system, a wireless mobile communication device, and/or a remote server, i.e. at least partially using cloud computing resources (2:54-3:13). 

Regarding claim 24, El Dokor in view of Kale teaches claim 18, and El Dokor further teaches
determine the POI from the map based at least on the direction indicated by the user being associated with the POI, the intent, and the one or more parameters for the one or more slots (the input analysis module receives input data from the gesture recognition module and the location module to identify a target region and micromap with POIs, i.e. based at least on the direction indicated by the user being associated with a point of interest (POI) as represented by a map, and using the character string representing the voice command to narrow the results of the POI search, such as only searching for “buildings” if the character string includes the term “building”, i.e. the intent and the one or more parameters for the one or more slots (5:64-6:11),(6:20-67),(7:1-17,28-43),(8:20-62),(8:63-9:7),(10:11-51)),
wherein the contextual information from the map is determined based at least on the determination of the POI from the map (when the single POI is determined from the target region of the micromap, i.e. determined based at least on the determination of the POI from the map, the POI information stored in the micromap is sent to the data output module to be provided to the user, i.e. contextual information from the map (8:47-9:56)).

Claim(s) 1, 3-6, 10-15, 17, 19, and 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over El Dokor, in view of Kale, and further in view of Panchaksharaiah et al. (U.S. PG Pub No. 2024/0073518), hereinafter Panchaksharaiah.

Regarding claims 1 and 11, El Dokor teaches
(claim 1) A method comprising (method of the present invention (12:45-54)):
(claim 11) A system comprising (a system (2:54-3:7)):
(claim 11) one or more processors to (multiple processors may be included in each device (4:48-59)):

(claims 11) receive audio data obtained using one or more microphones of a vehicle, the audio data representing speech (the voice recognition module receives an output signal from the microphone in the vehicle that captures voice signals, i.e. receive audio data obtained using one or more microphones of the vehicle…representing speech (3:66-4:10));
determining, using one or more first … models and based at least on audio data obtained using one or more microphones of a vehicle, that speech represented by the audio data indicates an intent associated with a point of interest (POI) located external to the vehicle and one or more words that describe the POI (the voice recognition module receives an output signal from the microphone in the vehicle that captures voice signals, i.e. audio data obtained using one or more microphones of a vehicle that speech represented by the audio data, and performs a voice recognition algorithm on the received signal to identify voice commands that provide additional information about the object being identified, i.e. determining, using one or more first …models…an intent associated with a point of interest (POI), such as a skyscraper in the city the vehicle is driving through, i.e. a point of interest (POI) located external to the vehicle, and that the person spoke the word “building” to identify the particular object, i.e. one or more words that describe the POI/one or more parameters for one or more slots corresponding to the intent (3:66-4:10),(6:12-19),(7:28-43),(8:1-19),(10:11-25));
obtaining image data (a camera system captures images of a gesture performed by a user in the vehicle, i.e. obtaining image data using one or more image sensors of the vehicle…a user located within an interior of the vehicle (5:64-6:11)); 
determining, based at least on the image data, a … direction of the user within the interior of the vehicle (a camera system captures images of a gesture performed by a user in the vehicle, i.e. determining based at least on the image data, and the gesture recognition module analyzes the gesture to determine a direction vector representing the direction of the gesture, i.e. determining…a direction of the user within the interior of the vehicle (5:64-6:11));
projecting the … direction to determine a virtual representation of the … direction that is relative to the vehicle (the input analysis module receives input data from the gesture recognition module and the location module to identify a target region and micromap with POIs, using the location, orientation, and speed of the vehicle to automatically identify micromaps, i.e. projecting the direction to determine a virtual representation of the…direction that is relative to the vehicle (5:64-6:11),(6:20-67),(7:1-17,28-43),(8:20-62),(9:3-7),(10:11-51)); 
determining, based at least the virtual representation of the … direction being directed towards the POI as represented by a map, contextual information from at least the map that is associated with the POI (the input analysis module receives input data from the gesture recognition module and the location module to identify a target region and micromap with POIs, i.e. based at least the virtual representation of the…direction being directed towards the POI as represented by a map, and perform a POI search for information related to a POI and stored in the micromap, i.e. contextual information from at least the map that is associated with the POI (5:64-6:11),(6:20-67),(7:1-17,28-43),(8:20-62),(9:3-7),(10:11-51));
determining, based at least on one or more second … models processing first data representative of the intent and the one or more words describing the POI and second data representative of the contextual information associated with the POI, an output that includes at least a portion of the contextual information (the input analysis module receives input data, including from the voice recognition module, i.e. first data representative of the intent and the one or more words describing the POI, and the POI search module queries a databases to retrieve information related to POI, i.e. second data representative of the contextual information associated with the POI, which can be merged by an artificial intelligence unit, and which can be received by the data output module for output to the user, i.e. determining…based at least on processing first data…and second data…an output that includes at least a portion of the contextual information (6:12-19),(6:31-51),(7:6-17),(8:47-9:27),(9:27-67)); and 
causing the vehicle to provide the output associated with the speech (the data output module receives information related to the POI from the search modules, and sends the information to the in-vehicle communications system or display, i.e. causing the vehicle to provide the output, in response to the voice command and gesture, i.e. associated with the speech (5:64-6:19),(7:6-17),(9:51-60)). 
While El Dokor provides recognizing voice commands using an algorithm and providing information to a user in response to the command, El Dokor does not specifically teach identifying intent using a machine learning model or using a machine learning model to process two types of data to determine an output, and thus does not teach
one or more first machine learning models…;
determining, based at least on one or more second
Kale, however, teaches one or more first machine learning models…(the artificial intelligence framework includes an NLU component, i.e. one or more first machine learning models, that operates to extract user intent and various intent parameters (1:60-2:7),(7:63-67),(17:34-45));
determining, based at least on one or more second machine learning models processing first data representative of the intent and the one or more words describing the POI and second data representative of the contextual information associated with the POI, an output that includes at least a portion of the contextual information (the NLG component generates language for a textual or spoken reply, i.e. determining…an output, based on the decisions made by the artificial intelligence framework, i.e. based at least on the one or more second machine learning models, where the NLU component may parse user inputs to determine the user intent and intent-related parameters, such as a dominant object and a variety of attributes related to that object, such as looking for a dress for a wedding in June in Italy, i.e. first data representative of the intent and the one or more words describing the POI, where the result of the search may include an item list of candidate products, i.e. second data representative of the contextual information associated with the POI, and the output can include the search results, i.e. output that includes at least a portion of the contextual information (1:60-2:7),(10:12-27,50-65),(17:23-18:28),(20:2-5,14-36)).
El Dokor and Kale are analogous art because they are from a similar field of endeavor in providing responses to users based on multimodal input. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the recognizing voice commands using an algorithm and providing information to a user in response to the command teachings of El Dokor with the use of an artificial intelligence framework to determine a reply and generate a response using the intent and various parameters as taught by Kale. It would have been obvious to combine the references to learn from user intents to enhance user understanding and provide an improved user experience (Kale (1:60-2:7)).
While El Dokor in view of Kale provides determining a direction of a user gesture based on images, El Dokor in view of Kale does not specifically teach determining a gaze direction, and thus does not teach
determining, based at least on the image data, a gaze direction of the user…;
projecting the gaze direction to determine a virtual representation of the gaze direction that is relative…;
determining, based at least the virtual representation of the gaze direction being directed towards the POI as represented by a map, contextual information from at least the map that is associated with the POI.
(claim 11) determine, based at least on second image data obtained using one or more external image sensors of the vehicle, that the virtual representation is towards a point of interest (POI) represented by the second image data.
Panchaksharaiah, however, teaches determining, based at least on the image data, a gaze direction of the user…(images captured from multiple directions can be parsed, such a first camera capturing a user in zoom in mode, i.e. based at least on the image data, to track their eye movements, where the user is directing attention at a target, i.e. determining…a gaze direction of the user [0016],[0018-9],[0036],[0044],[0048]);
projecting the gaze direction to determine a virtual representation of the gaze direction that is relative…(images captured from multiple directions can be parsed, such a first camera capturing a user in zoom in mode to track their eye movements, where the user is directing attention at a target, i.e. gaze direction, and a second camera may capture a first person gaze showing what the user is looking at, where the supplemental data including an image of the target is used to identify the target object, i.e. projecting the gaze direction to determine a virtual representation of the gaze direction that is relative [0016],[0018-9],[0036],[0044],[0048]);
determining, based at least the virtual representation of the gaze direction being directed towards the POI as represented by a map, contextual information from at least the map that is associated with the POI (the detected target based on the images, i.e. based at least the virtual representation of the gaze direction being directed towards the POI, can be processed to identify contextual information, i.e. determining…contextual information…that is associated with the POI, based on the location of the user, such as a specific room, i.e. as represented by a map [0016],[0018-21],[0042]);
(claim 11) determine, based at least on second image data obtained using one or more external image sensors of the vehicle, that the virtual representation is towards a point of interest (POI) represented by the second image data (images captured from multiple directions can be parsed, such a first camera capturing a user in zoom in mode to track their eye movements, i.e. first image, where the user is directing attention at a target, i.e. the virtual representation is towards a point of interest (POI), and a second camera may capture a first person gaze, i.e. second image data obtained using one or more external image sensors, showing what the user is looking at, where the supplemental data including an image of the target is used to identify the target object, i.e. determine based at least on second image data obtained using one or more external image sensors of the vehicle that the virtual representation is towards a point of interest (POI) represented by the second image data [0016],[0018-9],[0036],[0044], [0048]).
Where El Dokor specifically teaches that the POI information is stored in a micromap (6:52-64).
El Dokor, Kale, and Panchaksharaiah are analogous art because they are from a similar field of endeavor in providing responses to users based on multimodal input. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the determining a direction of a user gesture based on images teachings of El Dokor, as modified by Kale, with the use of multiple cameras taking images in different modes to identify a user directing attention to a target through tracking eye movements and then to identify the target as taught by Panchaksharaiah. It would have been obvious to combine the references to captured frames of the user and environment based on query type and select modes in a manner that optimize extraction of specific features and associated metadata (Panchaksharaiah [0017-9]).

Regarding claim 3, El Dokor in view of Kale and Panchaksharaiah teaches claim 1, and El Dokor further teaches
the contextual information includes at least an identifier associated with a landmark as represented by the map (the POI information, i.e. contextual information, may include a name for the POI, such as a skyscraper, i.e. at least an identifier associated with a landmark (6:52-64),(8:47-62),(10:11-25)).

Regarding claim 4, El Dokor in view of Kale and Panchaksharaiah teaches claim 1, and El Dokor further teaches
determining at least one of a geographic area associated with the user or a time period (location sensors are physical sensors of an in-vehicle system, and the output data used by the location module to determine the current location and orientation of the vehicle that has a driver or passengers, i.e. user, where the location, orientation, and speed are related to a micromap the vehicle is likely to enter, i.e. determining at least one of a geographic area associated with the user/environment (3:14-24,53-65),(6:20-30),(6:65-7:5)), wherein the determining of the output associated with the speech is further based at least on the at least one of the geographic area or the time period (the input analysis module receives input data, including from the voice recognition module, i.e. associated with the speech, and the POI or micromap search module queries a databases to retrieve information related to POI within the target region for output, i.e. determining of the output…is further based at least on the at least one of the geographic area (6:12-19,31-51),(7:6-17)).

Regarding claim 5, El Dokor in view of Kale and Panchaksharaiah teaches claim 1, and Panchaksharaiah further teaches
receiving second image data representing an image depicting an environment (a second camera may capture a frame, i.e. receiving second image data representing an image, of the environment, i.e. depicting an environment [0018],[0046-8]); and
determining, based at least on the second image data, additional information associated with the POI, wherein the second data further represents the additional information (the initial query directed at a target, i.e. associated with the POI, and images, i.e. based at least on the second image data, are received and processed to generate a data structure relating the context and information presented in the content, where an analysis of the images includes a list of objects and/or actions that were detected, such as that the user is pointing to a “Pompeian” oil bottle or the Mona Lisa, i.e. additional information associated with the POI wherein the second data further represents the additional information [0019-21]).
Where the motivation to combine is the same as previously presented.

Regarding claim 6, El Dokor in view of Kale and Panchaksharaiah teaches claim 1, and El Dokor further teaches 
the one or more or more words are associated with one or more parameters for one or more slots associated with the intent (the voice recognition module receives an output signal from the microphone in the vehicle that captures voice signals and performs a voice recognition algorithm on the received signal to identify voice commands, i.e. intent, that provide additional information about the object being identified, i.e. at least one or more parameters associated with the intent, and outputs the words as a character string that can be used by the search modules in a query to obtain more accurate results, such as ignoring information for non-building objects when the user says “building”, i.e. one or more words are associated with one or more parameters for one or more slots associated with the intent (2:42-52),(3:66-4:10),(6:12-19,41-51),(8:1-12)).

Regarding claim 10, El Dokor in view of Kale and Panchaksharaiah teaches claim 1, and El Dokor further teaches
the output associated with the speech comprises at least one of:
audio data representing one or more second words that provide the contextual information (the data output module speaks out information related to the point of interest, such as the name of, i.e. audio data representing one or more second words that provide, the restaurant or building, i.e. contextual information (2:42-53),(6:12-19,41-51),(7:6-17),(8:1-12),(9:50-67));
or content data representing one or more images depicting content associated with the contextual information (the data output module may display information, such as reviews and photos of the restaurant, or information about the building, i.e. content data representing one or more images depicting content associated with the contextual information (2:42-53),(6:12-19,41-51),(7:6-17),(8:1-12),(9:50-67),(10:1-25)). 

Regarding claim 12, El Dokor in view of Kale and Panchaksharaiah teaches claim 11, and El Dokor further teaches
determine, based at least on the virtual representation being towards the POI, the contextual information using the second data, the contextual information including at least an identifier associated with the POI (the input analysis module receives input data from the gesture recognition module and the location module to identify a target region and micromap with POIs, i.e. based at least the virtual representation being towards the POI, and perform a POI search for information related to a POI and stored in the micromap, i.e. determine the contextual information using the second data, where the POI information may include a name for the POI, such as a skyscraper, i.e. at least an identifier associated with the POI (5:64-6:11),(6:20-67),(7:1-17,28-43),(8:20-62),(9:3-7),(10:11-51)). 

Regarding claim 13, El Dokor in view of Kale and Panchaksharaiah teaches claim 11, and Kale further teaches
determine, using one or more second machine learning models and based at least on the audio data, an intent associated with the speech (the artificial intelligence framework includes an NLU component, i.e. one or more second machine learning models, that operates to extract user intent and various intent parameters, i.e. determine an intent associated with the speech (1:60-2:7),(7:63-67),(17:34-45));
append the contextual information to the intent (the extracted data includes the intent and intent-related parameters such as the attributes and attribute values related to a dominant object, i.e. contextual information, and the extracted data is provided to the dialog manager as a formal, machine-readable, structured representation of the query including the user query and further data, i.e. append the contextual information to the intent (17:23-59),(17:60-18:28)); and
apply, as an input to the one or more machine learning models, the contextual information appended to the intent (the NLG component generates language for a textual or spoken reply based on the decisions made by the artificial intelligence framework, i.e. the one or more machine learning models, where the NLU component may parse user inputs to determine the user intent and intent-related parameters, such as a dominant object and a variety of attributes related to that object, such as looking for a dress for a wedding in June in Italy, where the result of the search may include an item list of candidate products, and the output can include the search results and be based on the information from the context manager including the relevant intent and all parameters and related results, i.e. apply as an input…the contextual information appended to the intent (1:60-2:7),(10:12-27,50-65),(17:23-18:28),(20:2-5,14-36)). 
Where the motivation to combine is the same as previously presented.

Regarding claim 14, El Dokor in view of Kale and Panchaksharaiah teaches claim 11, and El Dokor further teaches
determine, using one or more second …models and based at least on the audio data, at least one of an intent associated with the speech or one or more parameters for one or more slots associated with the intent (the voice recognition module receives an output signal from the microphone in the vehicle that captures voice signals, i.e. audio data, and performs a voice recognition algorithm on the received signal to identify voice commands that provide additional information about the object being identified, i.e. determine an intent associated with the speech, such as a skyscraper in the city the vehicle is driving through, and that the person spoke the word “building” to identify the particular object, i.e. one or more parameters for one or more slots corresponding to the intent (3:66-4:10),(6:12-19),(7:28-43),(8:1-19),(10:11-25));
wherein the determination of the output associated with the speech is based at least on the contextual information and the at least one of the intent or the one or more parameters (the input analysis module receives input data, including from the voice recognition module, i.e. the at least one of the intent or the one or more parameters, and the POI search module queries a databases to retrieve information related to POI, i.e. contextual information, which can be merged by an artificial intelligence unit, and which can be received by the data output module for output to the user, i.e. determination of the output associated with the speech is based at least on (6:12-19),(6:31-51),(7:6-17),(8:47-9:27),(9:27-67)). 
Where Kale further teaches determine, using one or more second machine learning models and based at least on the audio data, at least one of an intent associated with the speech or one or more parameters for one or more slots associated with the intent (the NLG component generates language for a textual or spoken reply based on the decisions made by the artificial intelligence framework, i.e. using one or more second machine learning models, where the NLU component may parse user inputs to determine the user intent and intent-related parameters, such as a dominant object and a variety of attributes related to that object, such as looking for a dress for a wedding in June in Italy, i.e. determine…at least one of an intent associated with the speech or one or more parameters for one or more slots associated with the intent (1:60-2:7),(10:12-27,50-65),(17:23-18:28),(20:2-5,14-36));
And where the motivation to combine is the same as previously presented.

Regarding claim 15, El Dokor in view of Kale and Panchaksharaiah teaches claim 11, and Panchaksharaiah further teaches
determining, based at least on the second image data, that one or more images represented by the second image data depict the POI (images captured from multiple directions can be parsed, such a first camera capturing a user in zoom in mode to track their eye movements, i.e. first image, where the user is directing attention at a target, and a second camera may capture a first person gaze, i.e. second image data, showing what the user is looking at, where the supplemental data includes an image of only the target when other objects may be present in the room, i.e. determining…that one or more images…depict the POI Fig. 1, [0016],[0018-21],[0036],[0044],[0048]); and
determining the POI that the virtual representation is towards the POI as depicted by the one or more images (images captured from multiple directions can be parsed, such a first camera capturing a user in zoom in mode to track their eye movements, i.e. first image, where the user is directing attention at a target, and a second camera may capture a first person gaze, i.e. one or more images, showing what the user is looking at, i.e. determining the POI that the virtual representation is towards the POI, where the supplemental data includes an image of only the target when other objects may be present in the room, i.e. towards the POI as depicted by the one or more images Fig. 1, [0016],[0018-21],[0036],[0044],[0048]). 
Where the motivation to combine is the same as previously presented.

Regarding claim 17, El Dokor in view of Kale and Panchaksharaiah teaches claim 11, and El Dokor further teaches
the system is comprised in at least one of:
a control system for an autonomous or semi-autonomous machine;
a perception system for the autonomous or semi-autonomous machine;
a system for performing simulation operations;
a system for presenting virtual reality (VR) content;
a system for presenting augmented reality (AR) content;
a system for presenting mixed reality (MR) content;
a system for performing digital twin operations;
a system for performing light transport simulation;
a system for performing collaborative content creation for 3D assets;
a system for performing deep learning operations;
a system implemented using an edge device (the system may be performed by a device in the operating environment, such as an in-vehicle communications system, a wireless mobile communication device, and/or a remote server, i.e. edge devices (2:54-3:13));
a system implemented using a robot;
a system for performing conversational AI operations;
a system for generating synthetic data;
a system incorporating one or more virtual machines (VMs);
a system implemented at least partially in a data center (the system may be performed by a device in the operating environment, such as an in-vehicle communications system, a wireless mobile communication device, and/or a remote server, and may further access a database on the server or a service operating on a third-party server, such as Yelp or Google, i.e. at least partially in a data center (2:54-3:13),(6:58-51));
or a system implemented at least partially using cloud computing resources (the system may be performed by a device in the operating environment, such as an in-vehicle communications system, a wireless mobile communication device, and/or a remote server, i.e. at least partially using cloud computing resources (2:54-3:13). 
Regarding claim 19, El Dokor in view of Kale teaches claim 18. 
While El Dokor in view of Kale provides using cameras and sensors to determine a POI, El Dokor in view of Kale does not specifically teach the use of image data to determine the speech is associated with a user of one or more users, and thus does not teach
determine, based at least on the image data, that the speech is associated with the user of one or more users located within the interior of the vehicle, wherein the determination of the direction indicated by the user is based at least on the speech being associated with the user. 
Panchaksharaiah, however, teaches determine, based at least on the image data, that the speech is associated with the user of one or more users located within the interior of the vehicle (the camera may capture initial frames in a standard mode, i.e. based at least on the image data, to identify the user as the source of the voice query, i.e. determine…that the speech is associated with the user of one or more users [0018],[0032],[0044]), 
wherein the determination of the direction indicated by the user is based at least on the speech being associated with the user (subsequent frames are captured to extract details about the target, such as a user’s pointing gesture being directed to the target object, i.e. wherein the determination of the directed indicated by the user, where multiple users may be looking at different objects, but only one user issues a query and is identified as the source of the voice query, i.e. based at least on the speech being associated with the user [0018-9],[0044]). 
El Dokor, Kale, and Panchaksharaiah are analogous art because they are from a similar field of endeavor in providing responses to users based on multimodal input. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the using cameras and sensors to determine a POI teachings of El Dokor, as modified by Kale, with the use of multiple cameras taking images in different modes to identify a user directing attention to a target and then to identify the target as taught by Panchaksharaiah. It would have been obvious to combine the references to captured frames of the user and environment based on query type and select modes in a manner that optimize extraction of specific features and associated metadata (Panchaksharaiah [0017-9]).

Regarding claim 22, El Dokor in view of Kale and Panchaksharaiah teaches claim 1, and El Dokor further teaches
at least one of the determining the gaze direction of the user or the determining the contextual information using the map is further based at least on the intent associated with the speech (the voice recognition module receives an output signal from the microphone in the vehicle that captures voice signals and performs a voice recognition algorithm on the received signal to identify voice commands that provide additional information about the object being identified, i.e. the intent associated with the speech, and outputs the words as a character string that can be used by the search modules in a query to obtain more accurate results, such as ignoring information for non-building objects when the user says “building”, when the POI information gathered is stored in a micromap, i.e. determining the contextual information using the map is further based at least on the intent associated with the speech (2:42-52),(3:66-4:10),(6:5-19,41-64),(8:1-12)).

Claim(s) 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over El Dokor, in view of Kale, in view of Panchaksharaiah, and further in view of Zhao et al. (U.S. PG Pub No. 2021/0248376), hereinafter Zhao.

Regarding claim 2, El Dokor in view of Kale and Panchaksharaiah teaches claim 1.
While El Dokor in view of Kale and Panchaksharaiah provides the use of vectors, El Dokor in view of Kale and Panchaksharaiah does not specifically teach that the first data and second data represent vectors, and thus does not teach
the first data represents one or more first vectors corresponding to at least the intent and the one or more words that describe the POI and the second data represents one or more second vectors corresponding to at least the contextual information associated with the POI.
Zhao, however, teaches the first data represents one or more first vectors corresponding to at least the intent and the one or more words that describe the POI and the second data represents one or more second vectors corresponding to at least the contextual information associated with the POI (the system extracts a query vector, generates textual-context vectors, i.e. first data represents one or more first vectors corresponding to at least the intent and the one or more words, as well as generates candidate-response vectors representing candidate responses to the question, i.e. the second data represents one or more second vectors corresponding to at least the contextual information [0023-5]).
Where Kale teaches that the information is related to the intent, parameters, and results (1:60-2:7),(10:12-27,50-65), and El Dokor teaches that all the information is related specifically to a POI (2:24-41).
El Dokor, Kale, Panchaksharaiah, and Zhao are analogous art because they are from a similar field of endeavor in providing responses to users based on multimodal input. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of vectors teachings of El Dokor, as modified by Kale and Panchaksharaiah, with the use of vectors to describe information processed by the system as taught by Zhao. It would have been obvious to combine the references to enable analyzing both audio and visual cues to provide answers to user questions with accuracy and multiple contextual modes for questions (Zhao [0021]).

Claim(s) 7 and 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over El Dokor, in view of Kale, in view of Panchaksharaiah, and further in view of Powderly et al. (U.S. PG Pub No. 2018/0307303), hereinafter Powderly.

Regarding claims 7 and 23, El Dokor in view of Kale and Panchaksharaiah teaches claims 1 and 11.
While El Dokor in view of Kale and Panchaksharaiah provides identifying a target object using a gaze direction, El Dokor in view of Kale and Panchaksharaiah does not specifically teach the identification of a threshold period of time related to the gaze direction, and thus does not teach
determining that the user includes the gaze direction for at threshold period of time,
wherein the determining the contextual information from the at least the map is further based at least on the user including the gaze direction for the threshold period of time. 
Powderly, however, teaches determining that the user includes the gaze direction for at threshold period of time (the user’s eyes can be tracked to determine if the user’s gaze at an object is for longer than a threshold time, where cone casting techniques are used to determine objects along the direction of the user’s eye pose [0125-6]),
wherein the determining the contextual information from the at least the map is further based at least on the user including the gaze direction for the threshold period of time (the user’s eyes can be tracked to determine if the user’s gaze at an object is for longer than a threshold time, at which point the object is selected as the user input or target object with which the user would like to interact, such as a POI, i.e. determining…information…is further based at least on the user including the gaze direction for the threshold period of time [0125-6],[0148],[0200-2]). 
Where El Dokor teaches that the POI has associated information stored within the micromap and received once the POI is identified (8:59-9:60).
El Dokor, Kale, Panchaksharaiah, and Powderly are analogous art because they are from a similar field of endeavor in using multi-modal user input to interact with objects in an environment. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the identifying a target object using a gaze direction teachings of El Dokor, as modified by Kale and Panchaksharaiah, with the use of a gaze being longer than a threshold time to determine a target object as taught by Powderly. It would have been obvious to combine the references to improve interactions with objects in 3D space while reducing user fatigue (Powderly [0147]).
Conclusion
	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NICOLE A K SCHMIEDER/Primary Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Sep 12, 2022
Application Filed
Aug 16, 2024
Non-Final Rejection — §103, §112
Nov 07, 2024
Applicant Interview (Telephonic)
Nov 07, 2024
Examiner Interview Summary
Nov 21, 2024
Response Filed
Jan 16, 2025
Final Rejection — §103, §112
Mar 18, 2025
Response after Non-Final Action
Apr 09, 2025
Request for Continued Examination
Apr 10, 2025
Response after Non-Final Action
Apr 17, 2025
Non-Final Rejection — §103, §112
Jul 02, 2025
Examiner Interview Summary
Jul 02, 2025
Applicant Interview (Telephonic)
Jul 03, 2025
Response Filed
Aug 28, 2025
Final Rejection — §103, §112
Oct 24, 2025
Response after Non-Final Action
Dec 02, 2025
Request for Continued Examination
Dec 16, 2025
Response after Non-Final Action
Jan 06, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/219,339
Patent 12572751
ELECTRONIC DEVICE AND CONTROLLING METHOD OF ELECTRONIC DEVICE
2y 5m to grant Granted Mar 10, 2026
17/626,617
Patent 12567408
MULTI-MODAL SMART AUDIO DEVICE SYSTEM ATTENTIVENESS EXPRESSION
2y 5m to grant Granted Mar 03, 2026
17/938,173
Patent 12554930
TRANSFORMER-BASED TEXT ENCODER FOR PASSAGE RETRIEVAL
2y 5m to grant Granted Feb 17, 2026
17/418,679
Patent 12542131
SYSTEM AND METHOD FOR COMMUNICATING WITH A USER WITH SPEECH PROCESSING
2y 5m to grant Granted Feb 03, 2026
17/667,487
Patent 12531071
PACKET LOSS CONCEALMENT METHOD AND APPARATUS, STORAGE MEDIUM, AND COMPUTER DEVICE
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
68%
Grant Probability
99%
With Interview (+34.0%)
2y 10m
Median Time to Grant
High
PTA Risk
Based on 167 resolved cases by this examiner. Grant probability derived from career allow rate.
USING SCENE-AWARE CONTEXT FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email