Last updated: April 19, 2026
Application No. 17/886,747
GAZE AND AWARENESS PREDICTION USING A NEURAL NETWORK MODEL

Final Rejection §102§103
Filed
Aug 12, 2022
Examiner
LINHARDT, LAURA E
Art Unit
3663
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Waymo LLC
OA Round
4 (Final)
Interview Optional

— +22.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 223 resolved cases, 2023–2026
Examiner Intelligence

LINHARDT, LAURA E View full profile →
Grants 70% — above average
Career Allow Rate
155 granted / 223 resolved
+17.5% vs TC avg
Strong +23% interview lift
Without
With
+22.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
51 currently pending
Career history
274
Total Applications
across all art units
Statute-Specific Performance

§101
5.4%
-34.6% vs TC avg
§103
72.8%
+32.8% vs TC avg
§102
5.4%
-34.6% vs TC avg
§112
14.4%
-25.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 223 resolved cases
Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Claims 1-20 are pending in this application.
Claim 10 is amended.
Claims 1-20 are presented for examination. 

Response to Amendments
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

 
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.

Claims 1, 10-11, 13-14, 16 and 20 are rejected under 35 U.S.C. 102(a)(2) as being unpatentable over Austin et al. (US Publication 2021/0394793 A1). 
Regarding claim 1, Austin teaches a method performed by one or more computers, the method comprising: obtaining sensor data (i) that is captured by multiple sensors of an autonomous vehicle (Austin: Para. 38, 53, 63; the autonomous vehicle continuously takes at least one of camera, LiDAR and radar images of the surrounding environment to monitor for road users;  images may be timestamped by an image processor and analyzed for changes in motion, head pose and body posture in order to determine a gaze direction of the road user; computer vision techniques may be applied to the image data to identify road users, such as pedestrians, bicyclists and non-autonomous vehicles) and (ii) that characterizes an agent that is in a vicinity of the autonomous vehicle in an environment at a current time point, wherein the sensor data comprises at least an image patch from a camera image captured by a camera sensor and at least a portion of a point cloud captured by a laser sensor (Austin: Para. 38, 53, 63; the autonomous vehicle continuously takes at least one of camera, LiDAR and radar images of the surrounding environment to monitor for road users;  images may be timestamped by an image processor and analyzed for changes in motion, head pose and body posture in order to determine a gaze direction of the road user; computer vision techniques may be applied to the image data to identify road users, such as pedestrians, bicyclists and non-autonomous vehicles); and processing the sensor data comprising at least the image patch from the camera image captured by the camera and at least the portion of the point cloud captured by the laser sensor using a gaze prediction neural network to generate a gaze prediction that predicts a gaze of the agent at the current time point (Austin: Para. 42, 66; determine the position, body posture and head pose of the road user in order to determine the gaze direction of the road user; determine the gaze direction, the computer system may use the cameras images, LiDAR data; gaze recognition examples through machine learning based techniques for example, convolutional neural networks), wherein the gaze prediction neural network comprises: an embedding subnetwork that is configured to process at least the image patch from the camera image captured by the camera sensor and at least the portion of the point cloud captured by the laser sensor to generate an embedding characterizing the agent (Austin: Para. 43, 66; use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; LiDAR and camera recognition processes can be performed based on trained and/or predefined libraries of data, with known and recognizable shapes and edges of obstacles; machine learning based techniques for example, convolutional neural networks); and a gaze subnetwork that is configured to process the embedding generated by processing at least the image patch from the camera image captured by the camera sensor and at least the portion of the point cloud captured by the laser sensor to generate the gaze prediction (Austin: Para. 38, 43, 63, 66; computer of the vehicle is configured to use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; gaze recognition examples through machine learning based techniques for example, convolutional neural networks).
Regarding claim 10, Austin teaches the method of claim 1, wherein the embedding subnetwork is configured to: process at least the image patch from the camera image captured by the camera sensor to generate a respective initial camera embedding (Austin: Para. 43; computer of the vehicle is configured to use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; Both LiDAR and camera recognition processes can be performed based on trained and/or predefined libraries of data; a camera, computer vision techniques are used to detect road users); process at least the portion of the point cloud captured by the laser sensor to generate a respective initial point cloud embedding (Austin: Para. 42; LiDAR enables the vehicle to differentiate between a person on a bicycle or a person walking; 3D representation may be used to determine the position, body posture and head pose of the road user in order to determine the gaze direction of the road user); and combine the respective initial embeddings to generate the embedding characterizing the agent (Austin: Para. 43; computer of the vehicle is configured to use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; Both LiDAR and camera recognition processes can be performed based on trained and/or predefined libraries of data), wherein the combining the respective initial embeddings comprises summing, averaging, or concatenating the respective initial embeddings (Austin: Para. 97; correlation module is configured to concatenate the views from each autonomous vehicle, correlate these views with the trajectories of each autonomous vehicle and each road user).
Regarding claim 11, Austin teaches the method of claim 10, wherein the sensor data comprises an image patch depicting the agent generated from an image of the environment captured by the camera sensor and a portion of a point cloud generated by the laser sensor (Austin: Para. 43; computer of the vehicle is configured to use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; Both LiDAR and camera recognition processes can be performed based on trained and/or predefined libraries of data).
Regarding claim 13, Austin teaches the method of claim 1, wherein the gaze prediction neural network has been trained on one or more auxiliary tasks (Austin: Para. 66; gaze recognition examples through machine learning based techniques, for example, convolutional neural networks; once a person's eyes are located, a sufficiently powerful camera may track the center of the pupil to detect gaze direction).
Regarding claim 14, Austin teaches the method of claim 13, wherein the one or more auxiliary tasks include a heading prediction task (Austin: Para. 82; an environment mapping module incorporates a global model and then focuses on determining the gaze direction, wherein the trajectory module predicts the future path of the road user).
Regarding claim 16, Austin teaches a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: obtaining sensor data (i) that is captured by multiple sensors of an autonomous vehicle (Austin: Para. 43; computer of the vehicle is configured to use data gathered by camera image analysis, LiDAR 3D point cloud analysis) and (ii) that characterizes an agent that is in a vicinity of the autonomous vehicle in an environment at a current time point, wherein the sensor data comprises at least an image patch from a camera image captured by a camera and a laser sensor (Austin: Para. 38, 53, 63; the autonomous vehicle continuously takes at least one of camera, LiDAR and radar images of the surrounding environment to monitor for road users; images may be timestamped by an image processor and analyzed for changes in motion, head pose and body posture in order to determine a gaze direction of the road user; computer vision techniques may be applied to the image data to identify road users, such as pedestrians, bicyclists and non-autonomous vehicles); and processing the sensor data comprising at least the image patch from the camera image captured by the camera sensor and at least the portion of the point cloud captured by the laser sensor using a gaze prediction neural network to generate a gaze prediction that predicts a gaze of the agent at the current time point (Austin: Para. 42, 66; determine the position, body posture and head pose of the road user in order to determine the gaze direction of the road user; determine the gaze direction, the computer system may use the cameras images, LiDAR data; gaze recognition examples through machine learning based techniques for example, convolutional neural networks), wherein the gaze prediction neural network comprises: an embedding subnetwork that is configured to process at least the image patch from the camera image captured by the camera sensor and at least the portion of the point cloud captured by the laser sensor to generate an embedding characterizing the agent (Austin: Para. 43, 66; use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; LiDAR and camera recognition processes can be performed based on trained and/or predefined libraries of data, with known and recognizable shapes and edges of obstacles; machine learning based techniques for example, convolutional neural networks); and a gaze subnetwork that is configured to process the embedding generated by processing at least the image patch from the camera image captured by the camera sensor and at least the portion of the point cloud and captured by the laser sensor to generate the gaze prediction (Austin: Para. 38, 43, 63, 66; computer of the vehicle is configured to use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; gaze recognition examples through machine learning based techniques for example, convolutional neural networks).
Regarding claim 20, Austin teaches one or more non-transitory computer storage media encoded with computer program instructions that when executed by a plurality of computers cause the plurality of computers to perform operations comprising: obtaining sensor data (i) that is captured by multiple sensors of an autonomous vehicle (Austin: Para. 43; computer of the vehicle is configured to use data gathered by camera image analysis, LiDAR 3D point cloud analysis) and (ii) that characterizes an agent that is in a vicinity of the autonomous vehicle in an environment at a current time point, wherein the sensor data comprises at least an image patch from a camera image captured by a camera sensor and at least a portion of a point cloud captured by a laser sensor (Austin: Para. 38, 53, 63; the autonomous vehicle continuously takes at least one of camera, LiDAR and radar images of the surrounding environment to monitor for road users; images may be timestamped by an image processor and analyzed for changes in motion, head pose and body posture in order to determine a gaze direction of the road user; computer vision techniques may be applied to the image data to identify road users, such as pedestrians, bicyclists and non-autonomous vehicles); and processing the sensor data comprising at least the image patch from the camera image captured by the camera sensor and at least the portion of the point cloud captured by the laser sensor using a gaze prediction neural network to generate a gaze prediction that predicts a gaze of the agent at the current time point (Austin: Para. 42, 66; determine the position, body posture and head pose of the road user in order to determine the gaze direction of the road user; determine the gaze direction, the computer system may use the cameras images, LiDAR data; gaze recognition examples through machine learning based techniques for example, convolutional neural networks), wherein the gaze prediction neural network comprises: an embedding subnetwork that is configured to process at least the image patch from the camera image captured by the camera sensor and at least the portion of the point cloud captured by the laser sensor to generate an embedding characterizing the agent (Austin: Para. 43, 66; use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; LiDAR and camera recognition processes can be performed based on trained and/or predefined libraries of data, with known and recognizable shapes and edges of obstacles; machine learning based techniques for example, convolutional neural networks); and a gaze subnetwork that is configured to process the embedding generated by processing at least the image patch from the camera image captured by the camera sensor and at least the portion of the point cloud captured by the laser sensor to generate the gaze prediction (Austin: Para. 38, 43, 63, 66; computer of the vehicle is configured to use data gathered by camera image analysis, LiDAR 3D point cloud analysis and radar and/or millimeter wave radar images to determine the gaze direction of the road user; gaze recognition examples through machine learning based techniques for example, convolutional neural networks).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 2-9, 12, 15 and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Austin et al. (US Publication 2021/0394793 A1) in view of Benou et al. (US Publication 2022/0332349 A1).
Regarding claim 2, Austin teaches the method of claim 1, further comprising: determining, from the gaze prediction, an awareness signal that indicates whether the agent is aware of a presence of one or more entities in the environment (Austin: Para. 48; a controller of a vehicle computing system in order to provide a road user the eHMI notification in the field of view indicated by his/her gaze direction).
Austin doesn’t explicitly teach using the awareness signal to determine a future trajectory of the autonomous vehicle after the current time point.
However Benou, in the same field of endeavor, teaches using the awareness signal to determine a future trajectory of the autonomous vehicle after the current time point (Benou: Para. 434; cause the host vehicle to perform at least one of: maintaining a current speed of the host vehicle or maintaining a current heading direction of the host vehicle).
It would have been obvious to one having ordinary skill in the art to modify the road user’s gaze direction determination through camera image analysis and LIDAR point cloud analysis (Austin: Para. 43) by adding the future autonomous vehicle action determination based on the direction of the pedestrian’s gaze (Benou: Para. 434) with a reasonable expectation of success because when the pedestrian is looking away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle as taught by Benou (Benou: Para. 397).
Regarding claim 3, Austin teaches the method of claim 2, wherein the awareness signal indicates whether the agent is aware of a presence of the autonomous vehicle (Austin: Para. 48; a controller of a vehicle computing system in order to provide a road user the eHMI notification in the field of view indicated by his/her gaze direction).
Regarding claim 4, Austin teaches the method of claim 2, wherein the awareness signal indicates whether the agent is aware of a presence of one or more other agents in the environment (Austin: Para. 48; a controller of a vehicle computing system in order to provide a road user the eHMI notification in the field of view indicated by his/her gaze direction).
Regarding claim 5, Austin doesn’t explicitly teach providing an input comprising the awareness signal to a machine learning model that is used by a planning system of the autonomous vehicle to plan the future trajectory of the autonomous vehicle.
However Benou, in the same field of endeavor, teaches providing an input comprising the awareness signal to a machine learning model that is used by a planning system of the autonomous vehicle to plan the future trajectory of the autonomous vehicle (Benou: Para. 434; the recognized gesture may be indicative of the pedestrian turning to look toward the host vehicle; cause the host vehicle to perform at least one of: maintaining a current speed of the host vehicle or maintaining a current heading direction of the host vehicle).
It would have been obvious to one having ordinary skill in the art to modify the road user’s gaze direction determination through camera image analysis and LIDAR point cloud analysis (Austin: Para. 43) by adding the future autonomous vehicle action determination based on the direction of the pedestrian’s gaze (Benou: Para. 434) with a reasonable expectation of success because when the pedestrian is looking away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle as taught by Benou (Benou: Para. 397).
Regarding claim 6, Austin doesn’t explicitly teach wherein the gaze prediction comprises a predicted gaze direction in a horizontal plane and a predicted gaze direction in a vertical axis.
However Benou, in the same field of endeavor, teaches wherein the gaze prediction comprises a predicted gaze direction in a horizontal plane and a predicted gaze direction in a vertical axis (Benou: Para. 397; looking direction of the pedestrian may be estimated based on the rotational angle and pitch angle; head pose may be represented by a rotational angle (also called yaw angle) in a horizontal plane; a pitch angle in a vertical plane).
It would have been obvious to one having ordinary skill in the art to modify the road user’s gaze direction determination through camera image analysis and LIDAR point cloud analysis (Austin: Para. 43) by representing the head pose by the horizontal and the vertical planes (Benou: Para. 397) with a reasonable expectation of success because when the pedestrian is looking away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle as taught by Benou (Benou: Para. 397).
Regarding claim 7, Austin doesn’t explicitly teach wherein determining, from the gaze prediction, the awareness signal of the presence of the one or more entities in the environment comprises: determining that the predicted gaze direction in the vertical axis is horizontal; determining that the one or more entities is within a predetermined range centered at the predicted gaze direction in the horizontal plane.
However Benou, in the same field of endeavor, teaches wherein determining, from the gaze prediction, the awareness signal of the presence of the one or more entities in the environment comprises: determining that the predicted gaze direction in the vertical axis is horizontal (Benou: Para. 397; head pose may be represented by a rotational angle (also called yaw angle) in a horizontal plane parallel to the ground surface); determining that the predicted gaze direction in the vertical axis is horizontal; determining that the one or more entities is within a predetermined range centered at the predicted gaze direction in the horizontal plane (Benou: Para. 397; a pitch angle in a vertical plane that extends from the pedestrian's nose and the back of the head).
It would have been obvious to one having ordinary skill in the art to modify the road user’s gaze direction determination through camera image analysis and LIDAR point cloud analysis (Austin: Para. 43) by representing the head pose by the horizontal and the vertical planes (Benou: Para. 397) with a reasonable expectation of success because when the pedestrian is looking away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle as taught by Benou (Benou: Para. 397).
In the following limitation, Austin teaches in response, determining that the agent is aware of the presence of the one or more entities in the environment (Austin: Para. 48; a controller of a vehicle computing system in order to provide a road user the eHMI notification in the field of view indicated by his/her gaze direction).
Regarding claim 8, Austin teaches the method of claim 2, wherein the awareness signal comprises one or more of an active awareness signal and a historical awareness signal (Austin: Para. 73; environment mapping module may access a database of stored sets of images associated with poses, body posture, walking speeds, and the like, and may match each stitched image to a stored image to determine the gaze direction; predict the trajectory of the road user from the gaze direction), wherein the active awareness signal indicates whether the agent is aware of the presence of the one or more entities in the environment at the current time point (Austin: Para. 48; a controller of a vehicle computing system in order to provide a road user the eHMI notification in the field of view indicated by his/her gaze direction) wherein the historical awareness signal (i) is determined from one or more gaze predictions at one or more previous time points in a previous time window that precedes the current time point (Austin: Para. 73; environment mapping module may access a database of stored sets of images associated with poses, body posture, walking speeds, and the like, and may match each stitched image to a stored image to determine the gaze direction; predict the trajectory of the road user from the gaze direction) and (ii) indicates whether the agent is aware of the presence of the one or more entities in the environment during the previous time window (Austin: Para. 73; environment mapping module may access a database of stored sets of images associated with poses, body posture, walking speeds, and the like, and may match each stitched image to a stored image to determine the gaze direction; predict the trajectory of the road user from the gaze direction).
Regarding claim 9, Austin doesn’t explicitly teach using both the gaze prediction and the awareness signal to determine a future trajectory of the autonomous vehicle after the current time point.
However Benou, in the same field of endeavor, teaches using both the gaze prediction and the awareness signal to determine a future trajectory of the autonomous vehicle after the current time point (Benou: Para. 434; the recognized gesture may be indicative of the pedestrian turning to look toward the host vehicle; cause the host vehicle to perform at least one of: maintaining a current speed of the host vehicle or maintaining a current heading direction of the host vehicle).
It would have been obvious to one having ordinary skill in the art to modify the road user’s gaze direction determination through camera image analysis and LIDAR point cloud analysis (Austin: Para. 43) by adding the future autonomous vehicle action determination based on the direction of the pedestrian’s gaze (Benou: Para. 434) with a reasonable expectation of success because when the pedestrian is looking away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle as taught by Benou (Benou: Para. 397).
Regarding claim 12, Austin teaches the method of claim 10, wherein the gaze prediction neural network has been trained on one or more auxiliary tasks (Austin: Para. 66; gaze recognition examples through machine learning based techniques, for example, convolutional neural networks).
Austin doesn’t explicitly teach wherein the one or more auxiliary tasks include one or more auxiliary tasks that measure respective initial gaze predictions made directly from each of the initial embeddings.
However Benou, in the same field of endeavor, teaches wherein the one or more auxiliary tasks include one or more auxiliary tasks that measure respective initial gaze predictions made directly from each of the initial embeddings (Benou: Para. 397; navigating a host vehicle based on a pedestrian's head pose (i.e., head orientation) detected from image data; when the looking direction of the pedestrian is determined to be away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle).
It would have been obvious to one having ordinary skill in the art to modify the road user’s gaze direction determination through camera image analysis and LIDAR point cloud analysis (Austin: Para. 43) by representing the head pose by the horizontal and the vertical planes (Benou: Para. 397) with a reasonable expectation of success because when the pedestrian is looking away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle as taught by Benou (Benou: Para. 397).
Regarding claim 15, Austin teaches the method of claim 1, wherein the gaze prediction neural network comprises a regression output layer and a classification output layer (Austin: Para. 43, 66; gaze recognition examples through machine learning based techniques, for example, convolutional neural networks; LiDAR and camera recognition processes can be performed based on trained and/or predefined libraries of data, with known and recognizable shapes and edges of obstacles (e.g. vehicles, cyclists, etc.)).
Austin doesn’t explicitly teach wherein the regression output layer is configured to generate a predicted gaze direction in a horizontal plane and the classification output layer is configured to generate a predicted gaze direction in a vertical axis.
However Benou, in the same field of endeavor, teaches wherein the regression output layer is configured to generate a predicted gaze direction in a horizontal plane (Benou: Para. 397; head pose may be represented by a rotational angle (also called yaw angle) in a horizontal plane parallel to the ground surface) and the classification output layer is configured to generate a predicted gaze direction in a vertical axis (Benou: Para. 397; a pitch angle in a vertical plane that extends from the pedestrian's nose and the back of the head).
It would have been obvious to one having ordinary skill in the art to modify the road user’s gaze direction determination through camera image analysis and LIDAR point cloud analysis (Austin: Para. 43) by representing the head pose by the horizontal and the vertical planes (Benou: Para. 397) with a reasonable expectation of success because when the pedestrian is looking away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle as taught by Benou (Benou: Para. 397).
Regarding claim 17, Austin teaches the system of claim 16, the operations further comprise: determining, from the gaze prediction, an awareness signal that indicates whether the agent is aware of a presence of one or more entities in the environment (Austin: Para. 48; a controller of a vehicle computing system in order to provide a road user the eHMI notification in the field of view indicated by his/her gaze direction).
Austin doesn’t explicitly teach using the awareness signal to determine a future trajectory of the autonomous vehicle after the current time point.
However Benou, in the same field of endeavor, teaches using the awareness signal to determine a future trajectory of the autonomous vehicle after the current time point (Benou: Para. 434; cause the host vehicle to perform at least one of: maintaining a current speed of the host vehicle or maintaining a current heading direction of the host vehicle).
It would have been obvious to one having ordinary skill in the art to modify the road user’s gaze direction determination through camera image analysis and LIDAR point cloud analysis (Austin: Para. 43) by adding the future autonomous vehicle action determination based on the direction of the pedestrian’s gaze (Benou: Para. 434) with a reasonable expectation of success because when the pedestrian is looking away from the host vehicle, the host vehicle may navigate by using a greater margin of safety than it otherwise might use if the pedestrian was looking at or in the direction of the host vehicle as taught by Benou (Benou: Para. 397).
Regarding claim 18, Austin teaches the system of claim 17, wherein the awareness signal indicates whether the agent is aware of a presence of the autonomous vehicle (Austin: Para. 48; a controller of a vehicle computing system in order to provide a road user the eHMI notification in the field of view indicated by his/her gaze direction).
Regarding claim 19, Austin teaches the system of claim 17, wherein the awareness signal indicates whether the agent is aware of a presence of one or more other agents in the environment (Austin: Para. 48; a controller of a vehicle computing system in order to provide a road user the eHMI notification in the field of view indicated by his/her gaze direction).





Response to Arguments
Applicant’s arguments, filed 8 September 2025, with respect to the rejection of claims 1, 10-11, 13-14, 16, and 20 under 35 U.S.C. 102(a) have been fully considered, but they are not persuasive.
The applicant’s attorney argues that Austin fails to describe generating an embedding characterizing an agent from at least both an “image patch from the camera image captured by the camera sensor” and a “portion of the point cloud captured by the laser sensor,” much less generating a gaze prediction from this embedding.
In response to the applicant’s argument above, there are two ways to look at embedding in the claimed invention. There is a embedded machine learning that is deployed on a local location with limited computer resources or the mathematical embedding process where a group or subgroup is defined and traced between two different instances. 
Austin teaches a computing system that located a face, matches images from a plurality of sources to determine the gaze through machine learning based techniques. This is with the incorporation of US 2018/0276986 A1 (Austin: Para. 66). Delp US 2018/0276986 A1 teaches a processor on a vehicle control unit (Delp: Para. 80) with a memory storing operational instruction embedded in the system (Delp: Para. 85) for detecting a gaze direction (Delp: Para. 102). This would be an example of memory and instructions on a vehicle using only the stored machine learning to detect a person and determine the gaze direction.
Austin teaches locating a face and determining the gaze through machine learning based techniques (Austin: Para. 66). Austin teaches an autonomous vehicle identifies a road user by an image recorded by vehicle camera (Austin: Para. 38). The vehicle takes the camera image and uses computer vision and object recognition to identify road users, such as a pedestrian (Austin: Para. 63). The applicant’s specification includes “the embedding subnetwork can be configured to generate a first initial embedding, e.g., the camera embedding 212, characterizing the pedestrian from an image patch 207 depicting the pedestrian” (Para. 63). The specification includes “The image patch 207 can depict a torso portion of the pedestrian, e.g., the upper 50% of the pedestrian detected in the camera image 208” (Para. 55). The image patch is just the relevant part of the camera image. It is well known in the art of computer vision to identify a pedestrian from a camera image taken by an autonomous vehicle when the pedestrian is only in a portion of the image. Austin teaches a computer system, using machine learning based technique of convolutional neural networks, performing gaze detection by matching gaze patterns to the detected facial area, locating the person’s eyes, and center of pupil, to determine the gaze direction (Austin: Para. 66). 
Austin does use a locally stored machine learning process to determine the gaze as cited from Austin with help from incorporated Delp. Austin also uses embedding as a subgroup of a data, the person’s face, that is traced through other data to determine the gaze direction. Therefore, without using the word embedded, Austin does use an embedding subnetwork to generate an embedding characterizing the agent. Austin teaches both reasonable definitions of embedding. 
Austin defines the field of view for a person as 114 degrees horizontally centered about the eye position (Austin: Para. 77). The applicant’s specification includes “the gaze prediction can be in 2.5D, and if the vertical gaze direction of the agent is horizontal and the entity is within a predetermined range centered at the predicted gaze direction in the horizontal plane at the current time point, e.g., within 120 degrees vision span centered at the gaze direction, the system can determine that the agent is aware of the entity in the environment at the current time point.” (Para. 42). Based on the applicant’s specification, Austin does teach “at least the image patch from the camera image captured by the camera sensor … using a gaze prediction neural network to generate a gaze prediction that predicts a gaze of the agent at the current time point”.
Austin teaches a LIDAR system, that sends thousands of laser pulses every second, to create a 3D point cloud that detect pedestrian candidate regions via a bounded point cloud that are used to determine the position, body posture, and head pose of the road user in order to determine the gaze direction of the road user (Austin: Para. 41-42). Therefore Austin teaches using a camera and LIDAR data in order to determine the gaze direction of the road user. 
The applicant next argues that Austin does not describe generating any embedding from any portion of a point cloud captured by a laser sensor.
In response to the applicant’s argument above, Austin teaches a LIDAR system, that sends thousands of laser pulses every second, to create a 3D point cloud that detect pedestrian candidate regions via a bounded point cloud that are used to determine the position, body posture, and head pose of the road user in order to determine the gaze direction of the road user (Austin: Para. 41-42). Austin uses an embedded machine learning techniques to determined the head pose of the point cloud in a hardware embedding case. Austin teaches a bounded point cloud that defines the subregion of the head pose of the road user in order to determine the gaze direction of the road user (Austin: Para. 41-42) as the tracked subgroup embedding. 
The applicant next argues that Austin also does not describe generating any embedding from any image patch of an image captured by a camera sensor.
In response to the applicant’s argument above, the arguments of embedding are responded to above by the examiner. Austin teaches an autonomous vehicle identifies a road user by an image recorded by vehicle camera (Austin: Para. 38). The vehicle takes the camera image and uses computer vision and object recognition to identify road users, such as a pedestrian (Austin: Para. 63). The image patch is just the relevant part of the camera image. It is well known in the art of computer vision to identify a pedestrian from a camera image taken by an autonomous vehicle when the pedestrian is only in a portion of the image. Austin teaches a computer system, using machine learning based technique of convolutional neural networks, performing gaze detection by matching gaze patterns to the detected facial area, locating the person’s eyes, and center of pupil, to determine the gaze direction (Austin: Para. 66). 
The applicant next argues that Austin fails to explain generating any embeddings, much less an embedding from “at least the portion of the point cloud captured by the laser sensor.”
In response to the applicant’s argument above, the arguments of embedding are responded to above by the examiner. Austin teaches a LIDAR system, that sends thousands of laser pulses every second, to create a 3D point cloud that detect pedestrian candidate regions via a bounded point cloud that are used to determine the position, body posture, and head pose of the road user in order to determine the gaze direction of the road user (Austin: Para. 41-42). Austin does teach “generating an embedding of any point cloud captured by a laser sensor”.
The applicant next argues Austin fails to disclose how such determinations are used to generate gaze predictions.
In response to the applicant’s argument above, Austin teaches a LIDAR system, that detect pedestrian candidate regions via a bounded point cloud that are used to determine the position, body posture, and head pose of the road user in order to determine the gaze direction of the road user (Austin: Para. 41-42). Austin teaches a computer system, using machine learning based technique of convolutional neural networks, performing gaze detection by matching gaze patterns to the detected facial area, locating the person’s eyes, and center of pupil, to determine the gaze direction (Austin: Para. 66). Austin defines the field of view for a person as 114 degrees horizontally centered about the eye position (Austin: Para. 77). The applicant’s specification includes “the gaze prediction can be in 2.5D, and if the vertical gaze direction of the agent is horizontal and the entity is within a predetermined range centered at the predicted gaze direction in the horizontal plane at the current time point, e.g., within 120 degrees vision span centered at the gaze direction, the system can determine that the agent is aware of the entity in the environment at the current time point” (Para. 42). Austin takes the facial area of a camera image to predict a pedestrian’s gaze by the location of the pedestrian’s eyes through machine learning neural networks.
The applicant next argues that Austin does not describe generating any embeddings, much less combining two different embeddings generated from two different types of sensor data.
In response to the applicant’s argument above, the combing two different embeddings generated from two different types of sensor data would be supported by the mathematical definition of embedding where a subgroup is identified, correlated with the same subgroup from a different source, and tracked. Austin teaches a LIDAR system detecting the head pose of the road user in order to determine the gaze direction of the road user (Austin: Para. 41-42). Austin teaches an autonomous vehicle identifies a road user by an image recorded by vehicle camera (Austin: Para. 38). The vehicle takes the camera image and uses computer vision and object recognition to identify road users, such as a pedestrian (Austin: Para. 63). Austin teaches locating a face and determining the gaze through machine learning based techniques (Austin: Para. 66) using the known data. Austin collects and identifies the face region of a person by both LIDAR and camera.
The applicant next argues that Austin does not disclose combining “the respective initial embeddings to generate the embedding characterizing the agent, wherein the combining the respective initial embeddings comprises summing, averaging, or concatenating the respective initial embeddings.”
In response to the applicant’s argument above, Austin teaches a LIDAR system detecting the head pose of the road user in order to determine the gaze direction of the road user (Austin: Para. 41-42). Austin also teaches identifying vehicle camera images to located the face and determine the gaze (Austin: Para. 38, 63). Austin teaches correlating the eye position and body postured to determine the gaze direction of the road user (Austin: Para. 94). Austin teaches locating a face and determining the gaze through machine learning based techniques (Austin: Para. 66) using the known data. Austin collects and identifies the face region of a person by both LIDAR and camera and correlates the face region through concatenating the images.
The applicant next argues that Austin does not use the concatenation described in Austin to predict gaze.
In response to the applicant’s argument above, Webster dictionary defines concatenate as to link together in a series or chain. The same dictionary defines concatenation, as the noun to the concatenate verb, as a group of things linked together in a series. Austin teaches a correlation module is configured to concatenate the views from a plurality of sources and determine the road user needing a notification based on gaze direction of the road user (Austin: Para. 97). The correlation module configured to concatenate is the concatenation of the images by the correlation module. Austin teaches correlating the eye position and body postured to determine the gaze direction of the road user (Austin: Para. 94). Therefore Austin does teach the concatenation of images to determine the gaze direction of the road user.  
The applicant’s arguments have failed to point out the distinguishing characteristics of the amended claim language over the prior art. For the above reasons, Austin’s gaze prediction reads on applicant’s gaze and awareness prediction. The rejection is maintained. 




Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LAURA E LINHARDT whose telephone number is (571)272-8325.  The examiner can normally be reached on M-TR, M-F: 8am-4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Angela Ortiz can be reached on (571) 272-1206.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/L.E.L./Examiner, Art Unit 3663  

/ANGELA Y ORTIZ/Supervisory Patent Examiner, Art Unit 3663
Read full office action
Prosecution Timeline

Aug 12, 2022
Application Filed
Sep 24, 2024
Non-Final Rejection — §102, §103
Dec 05, 2024
Applicant Interview (Telephonic)
Dec 05, 2024
Examiner Interview Summary
Dec 17, 2024
Response Filed
Feb 03, 2025
Final Rejection — §102, §103
Apr 14, 2025
Request for Continued Examination
Apr 15, 2025
Response after Non-Final Action
Apr 29, 2025
Non-Final Rejection — §102, §103
Sep 04, 2025
Applicant Interview (Telephonic)
Sep 04, 2025
Examiner Interview Summary
Sep 15, 2025
Response Filed
Dec 17, 2025
Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/151,905
Patent 12586463
DETERMINATION DEVICE, DETERMINATION METHOD, AND PROGRAM
2y 5m to grant Granted Mar 24, 2026
18/174,229
Patent 12578197
Tandem Riding Detection on Personal Mobility Vehicles
2y 5m to grant Granted Mar 17, 2026
18/092,966
Patent 12540822
WATER AREA OBJECT DETECTION SYSTEM AND MARINE VESSEL
2y 5m to grant Granted Feb 03, 2026
17/765,674
Patent 12517275
SUBMARINE EXPLORATION SYSTEM COMPRISING A FLEET OF DRONES
2y 5m to grant Granted Jan 06, 2026
17/524,341
Patent 12459564
ELECTRONIC STEERING APPARATUS OF VEHICLE AND CONTROL METHOD THEREOF
2y 5m to grant Granted Nov 04, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
70%
Grant Probability
92%
With Interview (+22.7%)
3y 1m
Median Time to Grant
High
PTA Risk
Based on 223 resolved cases by this examiner. Grant probability derived from career allow rate.