Last updated: April 19, 2026
Application No. 18/054,321
SYSTEMS AND METHODS TO INCREASE ENVIRONMENT AWARENESS ASSOCIATED WITH REMOTE DRIVING APPLICATIONS

Non-Final OA §103
Filed
Nov 10, 2022
Examiner
NGUYEN, STEVEN VU
Art Unit
3668
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Vay Technology GmbH
OA Round
3 (Non-Final)
Interview Optional

— +7.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 160 resolved cases, 2023–2026
Examiner Intelligence

NGUYEN, STEVEN VU View full profile →
Grants 78% — above average
Career Allow Rate
125 granted / 160 resolved
+26.1% vs TC avg
Moderate +8% lift
Without
With
+7.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
25 currently pending
Career history
185
Total Applications
across all art units
Statute-Specific Performance

§101
14.3%
-25.7% vs TC avg
§103
44.6%
+4.6% vs TC avg
§102
17.3%
-22.7% vs TC avg
§112
18.9%
-21.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 160 resolved cases
Office Action

§103
DETAILED ACTION
This office action is in response to the RCE filed on 10/21/2025.
Claims 1 – 11, 13- 18, 20 remain pending for examination.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/21/2025 has been entered.
 

Response to Arguments
Applicant’s arguments with respect to the 103 rejection of claims 1 has been considered but are moot in view of new ground of rejection necessitated by Applicant’s amendment.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1 – 11, 13 – 18, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chebiyyam et al. (Patent No. 11480961 B1; hereinafter Chebiyyam) in view of Kim, Young Eon (Publication No. US 20230074274 A1; hereinafter Kim).

Regarding to claim 1, Chebiyyam teaches A method to increase environment awareness in a remote driving system, comprising: 
receiving, by a processor at a teleoperator station via a communication network, video data from an imaging device associated with a vehicle and having a field of view, the vehicle being positioned within an environment remote from the teleoperator station; ([Col. 22, line 31 – 37], “the camera sensors can generate visual data. Such visual data can be images, videos, or other visual content that represent the environment of the vehicle 502. In some examples, such visual data can include subsets or interpretations of images, videos, or other visual content (e.g., bounding boxes, classifications, etc.). Like with the audio sensors, the camera sensors can include multiple cameras positioned at various locations about the exterior and/or interior of the vehicle 502. Each camera sensor can output its own captured visual data, which can depict the environment of the vehicle 502 from the position of the camera sensor on the vehicle 502.”; [Col. 22, line 55 – 60], “causing the processed visual data to be output via a display device proximate the teleoperator. In at least one example, the teleoperator management system 558 associated with the teleoperator computing device(s) 550 can receive the processed visual data and can cause the processed visual data to be output via the input/output component(s) 556.”)
receiving, by the processor, audio data captured by a microphone array associated with the vehicle; ([Col. 19, line 40 – 51], “the audio data 202 can be processed by the audio data processing system 114 in near real-time (e.g., as the audio data 202 is received from the microphones112). In other examples, the audio data 202 can be processed by the audio data processing system114 upon detection of an event that invokes teleoperator services. As described above, in at least one example, an event can be detected in the audio data 202 based at least in part on analyzing the audio data 202 using one or more machine learned models that are trained to detect particular events.”)
processing, by the processor, the audio data to identify a known sound and a relative location of the known sound with respect to the vehicle; ([Col. 19, line 40 – 51], “the audio data 202 can be processed by the audio data processing system 114 in near real-time (e.g., as the audio data 202 is received from the microphones112). In other examples, the audio data 202 can be processed by the audio data processing system114 upon detection of an event that invokes teleoperator services. As described above, in at least one example, an event can be detected in the audio data 202 based at least in part on analyzing the audio data 202 using one or more machine learned models that are trained to detect particular events.”; [Col. 20, line 1 – 20], “when the processed audio data is output, the resulting sound can be localized for the teleoperator (e.g., relative to the vehicle 502). That is, the resulting sound can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed with sound. The resulting sound can simulate the real-world environment within which the vehicle 502 is positioned. That is, as a result of causing the processed audio data to be output via respective speakers, the resulting sound can be localized for the teleoperator such that the teleoperator can perceive sound in the environment of the vehicle 502 from the perspective of the vehicle 502 (e.g., the resulting sound can be localized relative to the vehicle 502 for the teleoperator). As such, the teleoperator can perceive sound as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. In some examples, as the teleoperator changes his or her position, orientation, etc., the resulting sound can be recast such to track the position, orientation, etc. of the teleoperator (e.g., in near real-time).” 
This should be understood as the detected sounds must be localized by the processing system in order to simulate localized sounds to the teleoperator.)
determining, by the processor, a relevant object associated with the known sound at the relative location; ([Col. 20, line 40 – 55], “determining whether a resulting sound is associated with an event. In at least one example, the teleoperator can listen to the output to determine whether the resulting sound is associated with an event. In at least one example, the teleoperator can identify when an emergency vehicle is present in an environment of the vehicle 502, whether a pedestrian or other object yells at(or otherwise interacts with) the vehicle 502, whether the vehicle 502 is driving in a construction zone or other area where a construction worker, police officer, or other individual is directing traffic, whether another vehicle honks at the vehicle 502, whether passengers are entering or exiting the vehicle, or the like. In at least one example, the resulting sound can additionally convey information associated with the event (e.g., contextual information).”)
generating, by the processor, a visualization of the relevant object; ([Col. 22, line 38 – 50], “Block 708 illustrates processing the visual data. In at least one example, the visual data processing system 562 can receive the visual data and perform one or more image processing techniques to generate processed visual data, which can be used for creating a virtual environment based at least in part on the visual data. In some examples, such image processing can include blending individual images, videos, or other visual content to avoid duplication in the virtual environment and/or aligning individual frames of the visual data. In additional or alternative examples, such image processing can perform transformations and/or apply filters in an effort to generate a virtual environment that is consistent and accurate. The visual data processing system 562 can output processed visual data.”) 
causing presentation, via a presentation device at the teleoperator station, of the visualization of the relevant object overlaid onto the video data in a position within the field of view based on the relative location of the sound relative to the vehicle; ([Col. 22, line 55 -  67, “Block 712 illustrates causing the processed visual data to be output via a display device proximate the teleoperator. In at least one example, the teleoperator management system 558associated with the teleoperator computing device(s) 550 can receive the processed visual data and can cause the processed visual data to be output via the input/output component(s) 556. For example, the teleoperator management system 558 can cause the processed visual data to be output via a display device, such as a VR display device”; [Col. 23, line 31 – 55], “Block 714 illustrates determining whether a resulting output is associated with an event. In at least one example, the teleoperator can listen to and/or view the output to determine whether the resulting output is associated with an event. In at least one example, the teleoperator can identify when an emergency vehicle is present in an environment of the vehicle 502, whether a pedestrian or other object yells at (or otherwise interacts with) the vehicle 502, whether the vehicle 502 is driving in a construction zone or other area where a construction worker, police officer, or other individual is directing traffic, whether another vehicle honks at the vehicle 502, whether passengers are entering or exiting the vehicle, or the like.”
Wherein the mapping is interpreted as the teleoperator management system causes processed visual data—derived from video input including bounding boxes and classifications ([Col. 22, lines 31–37])—to be presented via a display device such as a VR headset ([Col. 22, lines 55–67]). This processed data enables the teleoperator to identify relevant objects in the environment, such as emergency vehicles or pedestrians ([Col. 23, lines 31–55]). The immersive scene is constructed from the underlying video data and visually highlights relevant entities, which under the broadest reasonable interpretation constitutes a visualization of the object overlaid onto the video data.) and 
causing emission, via the presentation device, of the known sound associated with the relevant object. ([Col. 20, line 1 – 20], “when the processed audio data is output, the resulting sound can be localized for the teleoperator (e.g., relative to the vehicle 502). That is, the resulting sound can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed with sound. The resulting sound can simulate the real-world environment within which the vehicle 502 is positioned. That is, as a result of causing the processed audio data to be output via respective speakers, the resulting sound can be localized for the teleoperator such that the teleoperator can perceive sound in the environment of the vehicle 502 from the perspective of the vehicle 502 (e.g., the resulting sound can be localized relative to the vehicle 502 for the teleoperator). As such, the teleoperator can perceive sound as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. In some examples, as the teleoperator changes his or her position, orientation, etc., the resulting sound can be recast such to track the position, orientation, etc. of the teleoperator (e.g., in near real-time).”

	Chebiyyam teaches to process sounds and visual information to cause a 3D presentation of the scene for the teleoperator as described above, but does not explicitly disclose determining, by the processor, a relevant object associated with the known sound at the relative location, the relevant object being outside the field of view of the imaging device; generating, by the processor, a visualization of the relevant object;

	However, Kim teaches determining, by the processor, a relevant object associated with the known sound at the relative location, the relevant object being outside the field of view of the imaging device; ([Par. 0041], “The abnormality context extraction unit 151 extracts matched abnormality context information based on the detailed information of an auditory abnormality signal including the type, magnitude, and source of the detected auditory abnormality signal. The conversion unit 153 converts the auditory abnormality signal into a visual output object or an auditory output object matched based on the detailed information of the detected auditory abnormality signal. For example, when a collision sound is generated around a vehicle, the generated collision sound is not output into the vehicle without change, but an abnormality context and a visual object matching the frequency and magnitude of the collision sound are extracted and displayed, and an auditory abnormality signal matching the collision sound is extracted and output. In an embodiment, the cycle of a signal to be output may be adjusted according to the magnitude and frequency of a recognized sound.” 

	Note: Paragraph [0041] teaches that the system detects an auditory abnormality signal occurring around the vehicle and extracts an abnormality context including type, magnitude, and source of the sound. The system then converts the auditory abnormality into a corresponding “visual output object,” which represents the object or event associated with the sound. Because the detection and determination of this object are based solely on the external sound—without reliance on any imaging input—it is understood that the system determines the relevant object irrespective of the field of view of an imaging device. A POSITA would therefore reasonably interpret the disclosed operation as determining an object located outside the imaging device’s field of view.)
generating, by the processor, a visualization of the relevant object; ([Par. 0026], “a table in which visual objects, such as icons, representing the meanings of external sounds are stored to be matched to auditory abnormality signals generated outside vehicles in order to provide visual notification is generated and stored in the database, as shown in Table 2below: external sound caused icon meaning or collision icon by the collision representing of a vehicle a “collision” 002 sound caused icon meaning or drone image by a drone representing a “drone” or icon 003 sound caused by a icon meaning or motorcycle motorcycle representing a image or icon “motorcycle””; [Par. 0042], “The notification unit 155 outputs the visual output object or auditory output object obtained through the conversion.”)

Note: Cheby is relied upon as the primary reference for teaching a remote driving system in which audio data from a plurality of microphones on the vehicle is processed to detect events in the vehicle’s environment (e.g., sirens, honks, pedestrians yelling) and to determine contextual information such as the type of event and the direction of arrival of the sound relative to the vehicle [Col. 19, line 40 – 51; Col.20, line 40 – 55]). Cheby further teaches providing to a teleoperator synchronized audio and visual data so that the teleoperator can perceive the environment from the perspective of the vehicle and decide how to control the vehicle based on such events ([Col. 20, line 1 – 20; Col. 23, line 31 – 55]. However, Cheby does not expressly disclose determining a relevant object associated with a detected sound that is outside the field of view of the imaging device and generating a visual representation of that object for display. Kim cures this deficiency. Kim discloses recognizing external sounds around a vehicle (e.g., collision sounds, drone sounds, motorcycle sounds, sirens) using multiple external microphones, determining detailed sound information such as type, magnitude, and source location, and then converting the detected sound into a corresponding “visual output object” or icon that represents the external sound on a display (Par. [0024 – 0026]). Kim further teaches that the notification module can visually indicate the type and source location of the external sound on a display, independent of what is captured by any imaging device, by placing an icon representing the external object (e.g., collision icon, drone icon, motorcycle icon) at a position that conveys its direction and distance relative to the vehicle (Par. [0026, 0041]). 
It would have been obvious to a person of ordinary skill in the art to incorporate Kim’s sound-based visual notification technique into Cheby’s teleoperation environment so that, when Cheby detects an event based on audio data and determines the direction of arrival of the sound relative to the vehicle, the system also generates and displays a visual object associated with that sound (as taught by Kim) on the teleoperator’s display. Doing so would have predictably improved the teleoperator’s environmental awareness by providing an explicit visual indication of sound-based events, including events originating from regions outside the current camera field of view, while still overlaying that indication within the teleoperator’s displayed visual scene based on the relative location of the sound. This is a simple substitution and integration of Kim’s known sound-to-visual-object mapping and display into Cheby’s known teleoperation visualization framework to solve the known problem of making off-camera or sound-only events more readily apparent to the remote operator, and thus would have been obvious.
Regarding to claim 2, the combination of Chebiyyam and Kim teaches the method of claim 1.
Chebiyyam further teaches wherein the known sound is identified using at least one of an audio pattern matching algorithm, a machine learning model, or a neural network. ([Col. 19, line 40 – 51], “the audio data 202 can be processed by the audio data processing system 114 in near real-time (e.g., as the audio data 202 is received from the microphones112). In other examples, the audio data 202 can be processed by the audio data processing system114 upon detection of an event that invokes teleoperator services. As described above, in at least one example, an event can be detected in the audio data 202 based at least in part on analyzing the audio data 202 using one or more machine learned models that are trained to detect particular events.”)

Regarding to claim 3, the combination of Chebiyyam and Kim teaches the method of claim 1.
Chebiyyam further teaches instructing, by the processor, beamforming of the microphone array to capture the audio data at the relative location with respect to the vehicle. ([Col. 6, line 55 – 60), “the audio data processing system 114 can perform one or more beamforming techniques such to cause the resulting sound to be output from a particular source and in a particular direction.”)

Regarding to claim 4, the combination of Chebiyyam and Kim teaches the method of claim 1.
Chebiyyam further teaches wherein the visualization of the relevant object comprises at least one of a visual indicator, a symbol, a color, or a bounding box. ([Col. 22, line 26 – 37], “at least one example, the camera sensors can generate visual data. Such visual data can be images, videos, or other visual content that represent the environment of the vehicle 502. In some examples, such visual data can include subsets or interpretations of images, videos, or other visual content (e.g., bounding boxes, classifications, etc.). Like with the audio sensors, the camera sensors can include multiple cameras positioned at various locations about the exterior and /or interior of the vehicle 502. Each camera sensor can output its own captured visual data, which can depict the environment of the vehicle 502 from the position of the camera sensor on the vehicle 502.”)

Regarding to claim 5, the combination of Chebiyyam and Kim teaches the method of claim 1.
Chebiyyam further teaches amplifying, by the processor, the known sound associated with the relevant object. ([Col. 8, line 38 – 45], “The output components can provide information to the teleoperator 118. In at least one example, a speaker can receive audio data 202 and/or processed audio data 204 (e.g., an audio channel) and can transform such data into sound waves, that when connected to an amplifier, outputs a sound. In FIG. 2, two speakers are depicted in association with a headset 214 or pair of earphones worn by the teleoperator 118.”; ([Col. 20, line 1 – 20], “when the processed audio data is output, the resulting sound can be localized for the teleoperator (e.g., relative to the vehicle 502). That is, the resulting sound can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed with sound. The resulting sound can simulate the real-world environment within which the vehicle 502 is positioned. That is, as a result of causing the processed audio data to be output via respective speakers, the resulting sound can be localized for the teleoperator such that the teleoperator can perceive sound in the environment of the vehicle 502 from the perspective of the vehicle 502 (e.g., the resulting sound can be localized relative to the vehicle 502 for the teleoperator). As such, the teleoperator can perceive sound as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. In some examples, as the teleoperator changes his or her position, orientation, etc., the resulting sound can be recast such to track the position, orientation, etc. of the teleoperator (e.g., in near real-time).”)
Regarding to claim 6, the combination of Chebiyyam and Kim teaches A method, comprising: 
receiving, by a processor associated with a teleoperator station via a network, imaging data from an imaging device associated with a vehicle and having a field of view, the vehicle being positioned within an environment remote from the teleoperator station; ([Col. 22, line 31 – 37], “the camera sensors can generate visual data. Such visual data can be images, videos, or other visual content that represent the environment of the vehicle 502. In some examples, such visual data can include subsets or interpretations of images, videos, or other visual content (e.g., bounding boxes, classifications, etc.). Like with the audio sensors, the camera sensors can include multiple cameras positioned at various locations about the exterior and/or interior of the vehicle 502. Each camera sensor can output its own captured visual data, which can depict the environment of the vehicle 502 from the position of the camera sensor on the vehicle 502.”; [Col. 22, line 55 – 60], “causing the processed visual data to be output via a display device proximate the teleoperator. In at least one example, the teleoperator management system 558 associated with the teleoperator computing device(s) 550 can receive the processed visual data and can cause the processed visual data to be output via the input/output component(s) 556.”)
receiving, by the processor, audio data captured by an audio sensor associated with the vehicle; ([Col. 19, line 40 – 51], “the audio data 202 can be processed by the audio data processing system 114 in near real-time (e.g., as the audio data 202 is received from the microphones112). In other examples, the audio data 202 can be processed by the audio data processing system114 upon detection of an event that invokes teleoperator services. As described above, in at least one example, an event can be detected in the audio data 202 based at least in part on analyzing the audio data 202 using one or more machine learned models that are trained to detect particular events.”)
processing, by the processor, the audio data to identify a sound; ([Col. 19, line 40 – 51], “the audio data 202 can be processed by the audio data processing system 114 in near real-time (e.g., as the audio data 202 is received from the microphones112). In other examples, the audio data 202 can be processed by the audio data processing system114 upon detection of an event that invokes teleoperator services. As described above, in at least one example, an event can be detected in the audio data 202 based at least in part on analyzing the audio data 202 using one or more machine learned models that are trained to detect particular events.”; [Col. 20, line 1 – 20], “when the processed audio data is output, the resulting sound can be localized for the teleoperator (e.g., relative to the vehicle 502). That is, the resulting sound can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed with sound. The resulting sound can simulate the real-world environment within which the vehicle 502 is positioned. That is, as a result of causing the processed audio data to be output via respective speakers, the resulting sound can be localized for the teleoperator such that the teleoperator can perceive sound in the environment of the vehicle 502 from the perspective of the vehicle 502 (e.g., the resulting sound can be localized relative to the vehicle 502 for the teleoperator). As such, the teleoperator can perceive sound as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. In some examples, as the teleoperator changes his or her position, orientation, etc., the resulting sound can be recast such to track the position, orientation, etc. of the teleoperator (e.g., in near real-time).” 
This should be understood as the detected sounds must be localized by the processing system in order to simulate localized sounds to the teleoperator.)
determining, by the processor, an object associated with the sound; ([Col. 20, line 40 – 55], “determining whether a resulting sound is associated with an event. In at least one example, the teleoperator can listen to the output to determine whether the resulting sound is associated with an event. In at least one example, the teleoperator can identify when an emergency vehicle is present in an environment of the vehicle 502, whether a pedestrian or other object yells at(or otherwise interacts with) the vehicle 502, whether the vehicle 502 is driving in a construction zone or other area where a construction worker, police officer, or other individual is directing traffic, whether another vehicle honks at the vehicle 502, whether passengers are entering or exiting the vehicle, or the like. In at least one example, the resulting sound can additionally convey information associated with the event (e.g., contextual information).”)
generating, by the processor, a visualization of the object; ([Col. 22, line 38 – 50], “Block 708 illustrates processing the visual data. In at least one example, the visual data processing system 562 can receive the visual data and perform one or more image processing techniques to generate processed visual data, which can be used for creating a virtual environment based at least in part on the visual data. In some examples, such image processing can include blending individual images, videos, or other visual content to avoid duplication in the virtual environment and/or aligning individual frames of the visual data. In additional or alternative examples, such image processing can perform transformations and/or apply filters in an effort to generate a virtual environment that is consistent and accurate. The visual data processing system 562 can output processed visual data.”)
causing presentation, via a presentation device associated with the teleoperator station, of the visualization of the object overlaid onto the imaging data at a position within the field of view based on a location of the sound relative to the vehicle; ([Col. 22, line 55 - 67], “Block 712 illustrates causing the processed visual data to be output via a display device proximate the teleoperator. In at least one example, the teleoperator management system 558 associated with the teleoperator computing device(s) 550 can receive the processed visual data and can cause the processed visual data to be output via the input/output component(s) 556. For example, the teleoperator management system 558 can cause the processed visual data to be output via a display device, such as a VR display device”; [Col. 23, line 31 – 55], “Block 714 illustrates determining whether a resulting output is associated with an event. In at least one example, the teleoperator can listen to and/or view the output to determine whether the resulting output is associated with an event. In at least one example, the teleoperator can identify when an emergency vehicle is present in an environment of the vehicle 502, whether a pedestrian or other object yells at (or otherwise interacts with) the vehicle 502, whether the vehicle 502 is driving in a construction zone or other area where a construction worker, police officer, or other individual is directing traffic, whether another vehicle honks at the vehicle 502, whether passengers are entering or exiting the vehicle, or the like.”
Wherein the mapping is interpreted as the teleoperator management system causes processed visual data—derived from video input including bounding boxes and classifications ([Col. 22, lines 31–37])—to be presented via a display device such as a VR headset ([Col. 22, lines 55–67]). This processed data enables the teleoperator to identify relevant objects in the environment, such as emergency vehicles or pedestrians ([Col. 23, lines 31–55]). The immersive scene is constructed from the underlying video data and visually highlights relevant entities, which under the broadest reasonable interpretation constitutes a visualization of the object overlaid onto the video data.) and 
causing emission, via the presentation device, of the sound associated with the object. ([Col. 20, line 1 – 20], “when the processed audio data is output, the resulting sound can be localized for the teleoperator (e.g., relative to the vehicle 502). That is, the resulting sound can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed with sound. The resulting sound can simulate the real-world environment within which the vehicle 502 is positioned. That is, as a result of causing the processed audio data to be output via respective speakers, the resulting sound can be localized for the teleoperator such that the teleoperator can perceive sound in the environment of the vehicle 502 from the perspective of the vehicle 502 (e.g., the resulting sound can be localized relative to the vehicle 502 for the teleoperator). As such, the teleoperator can perceive sound as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. In some examples, as the teleoperator changes his or her position, orientation, etc., the resulting sound can be recast such to track the position, orientation, etc. of the teleoperator (e.g., in near real-time).”)

Chebiyyam teaches to process sounds and visual information to cause a 3D presentation of the scene for the teleoperator as described above, but does not explicitly disclose determining, by the processor, an object associated with the sound, the object being outside the field of view of the imaging device; generating, by the processor, a visualization of the object;

	However, Kim teaches determining, by the processor, an object associated with the sound, the object being outside the field of view of the imaging device; ([Par. 0041], “The abnormality context extraction unit 151 extracts matched abnormality context information based on the detailed information of an auditory abnormality signal including the type, magnitude, and source of the detected auditory abnormality signal. The conversion unit 153 converts the auditory abnormality signal into a visual output object or an auditory output object matched based on the detailed information of the detected auditory abnormality signal. For example, when a collision sound is generated around a vehicle, the generated collision sound is not output into the vehicle without change, but an abnormality context and a visual object matching the frequency and magnitude of the collision sound are extracted and displayed, and an auditory abnormality signal matching the collision sound is extracted and output. In an embodiment, the cycle of a signal to be output may be adjusted according to the magnitude and frequency of a recognized sound.” 

	Note: Paragraph [0041] teaches that the system detects an auditory abnormality signal occurring around the vehicle and extracts an abnormality context including type, magnitude, and source of the sound. The system then converts the auditory abnormality into a corresponding “visual output object,” which represents the object or event associated with the sound. Because the detection and determination of this object are based solely on the external sound—without reliance on any imaging input—it is understood that the system determines the relevant object irrespective of the field of view of an imaging device. A POSITA would therefore reasonably interpret the disclosed operation as determining an object located outside the imaging device’s field of view.)
generating, by the processor, a visualization of the object; ([Par. 0026], “a table in which visual objects, such as icons, representing the meanings of external sounds are stored to be matched to auditory abnormality signals generated outside vehicles in order to provide visual notification is generated and stored in the database, as shown in Table 2below: external sound caused icon meaning or collision icon by the collision representing of a vehicle a “collision” 002 sound caused icon meaning or drone image by a drone representing a “drone” or icon 003 sound caused by a icon meaning or motorcycle motorcycle representing a image or icon “motorcycle””; [Par. 0042], “The notification unit 155 outputs the visual output object or auditory output object obtained through the conversion.”)

Note: Cheby is relied upon as the primary reference for teaching a remote driving system in which audio data from a plurality of microphones on the vehicle is processed to detect events in the vehicle’s environment (e.g., sirens, honks, pedestrians yelling) and to determine contextual information such as the type of event and the direction of arrival of the sound relative to the vehicle [Col. 19, line 40 – 51; Col.20, line 40 – 55]). Cheby further teaches providing to a teleoperator synchronized audio and visual data so that the teleoperator can perceive the environment from the perspective of the vehicle and decide how to control the vehicle based on such events ([Col. 20, line 1 – 20; Col. 23, line 31 – 55]. However, Cheby does not expressly disclose determining a relevant object associated with a detected sound that is outside the field of view of the imaging device and generating a visual representation of that object for display. Kim cures this deficiency. Kim discloses recognizing external sounds around a vehicle (e.g., collision sounds, drone sounds, motorcycle sounds, sirens) using multiple external microphones, determining detailed sound information such as type, magnitude, and source location, and then converting the detected sound into a corresponding “visual output object” or icon that represents the external sound on a display (Par. [0024 – 0026]). Kim further teaches that the notification module can visually indicate the type and source location of the external sound on a display, independent of what is captured by any imaging device, by placing an icon representing the external object (e.g., collision icon, drone icon, motorcycle icon) at a position that conveys its direction and distance relative to the vehicle (Par. [0026, 0041]). 
It would have been obvious to a person of ordinary skill in the art to incorporate Kim’s sound-based visual notification technique into Cheby’s teleoperation environment so that, when Cheby detects an event based on audio data and determines the direction of arrival of the sound relative to the vehicle, the system also generates and displays a visual object associated with that sound (as taught by Kim) on the teleoperator’s display. Doing so would have predictably improved the teleoperator’s environmental awareness by providing an explicit visual indication of sound-based events, including events originating from regions outside the current camera field of view, while still overlaying that indication within the teleoperator’s displayed visual scene based on the relative location of the sound. This is a simple substitution and integration of Kim’s known sound-to-visual-object mapping and display into Cheby’s known teleoperation visualization framework to solve the known problem of making off-camera or sound-only events more readily apparent to the remote operator, and thus would have been obvious.

Regarding to claim 7, the combination of Chebiyyam and Kim teaches the method of claim 6.
Chebiyyam further teaches wherein the audio sensor comprises an audio sensor array. ([Col. 18, line 45 – 67], “Block 602 illustrates receiving audio data from a plurality of audio sensors associated with a vehicle… in at least one example, the vehicle502 can have a plurality of audio sensors (e.g., microphones) that can be located at various positions on the corners, front, back, sides, and/or top of the exterior of the vehicle 502. Audio sensors can additionally be positioned at various locations about the interior of the vehicle 502. As described above with reference to FIGS. 1-4, in at least one example, each audio sensor can be associated with its own captured audio channel. One or more captured audio channels can be referred to herein as “audio data.”)

Regarding to claim 8, the combination of Chebiyyam and Kim teaches the method of claim 7.
Chebiyyam further teaches processing, by the processor, the audio data from the audio sensor array to determine the location of the sound relative to the vehicle; ([Col. 20, line 1 – 20], “when the processed audio data is output, the resulting sound can be localized for the teleoperator (e.g., relative to the vehicle 502). That is, the resulting sound can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed with sound. The resulting sound can simulate the real-world environment within which the vehicle 502 is positioned. That is, as a result of causing the processed audio data to be output via respective speakers, the resulting sound can be localized for the teleoperator such that the teleoperator can perceive sound in the environment of the vehicle 502 from the perspective of the vehicle 502 (e.g., the resulting sound can be localized relative to the vehicle 502 for the teleoperator). As such, the teleoperator can perceive sound as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. In some examples, as the teleoperator changes his or her position, orientation, etc., the resulting sound can be recast such to track the position, orientation, etc. of the teleoperator (e.g., in near real-time).” This should be understood as the detected sounds must be localized in order to simulate localized sounds to the teleoperator.)
wherein the object associated with the sound is determined based at least in part on the location of the sound. ([Col. 20, line 40 – 55], “determining whether a resulting sound is associated with an event. In at least one example, the teleoperator can listen to the output to determine whether the resulting sound is associated with an event. In at least one example, the teleoperator can identify when an emergency vehicle is present in an environment of the vehicle 502, whether a pedestrian or other object yells at(or otherwise interacts with) the vehicle 502, whether the vehicle 502 is driving in a construction zone or other area where a construction worker, police officer, or other individual is directing traffic, whether another vehicle honks at the vehicle 502, whether passengers are entering or exiting the vehicle, or the like. In at least one example, the resulting sound can additionally convey information associated with the event (e.g., contextual information).”)

Regarding to claim 9, the combination of Chebiyyam and Kim teaches the method of claim 7.
Chebiyyam further teaches instructing, by the processor, beamforming of the audio sensor array to capture the audio data at the location relative to the vehicle. ([Col. 6, line 55 – 60), “the audio data processing system 114 can perform one or more beamforming techniques such to cause the resulting sound to be output from a particular source and in a particular direction.”)

Regarding to claim 10, the combination of Chebiyyam and Kim teaches the method of claim 6.
Chebiyyam further teaches wherein the sound is identified using at least one of an audio pattern matching algorithm, a machine learning model, or a neural network. ([Col. 19, line 40 – 51], “the audio data 202 can be processed by the audio data processing system 114 in near real-time (e.g., as the audio data 202 is received from the microphones112). In other examples, the audio data 202 can be processed by the audio data processing system114 upon detection of an event that invokes teleoperator services. As described above, in at least one example, an event can be detected in the audio data 202 based at least in part on analyzing the audio data 202 using one or more machine learned models that are trained to detect particular events.”)

Regarding to claim 11, the combination of Chebiyyam and Kim teaches the method of claim 6.
Chebiyyam further teaches wherein the visualization of the object comprises at least one of a visual indicator, a symbol, a color, or a bounding box. ([Col. 22, line 26 – 37], “at least one example, the camera sensors can generate visual data. Such visual data can be images, videos, or other visual content that represent the environment of the vehicle 502. In some examples, such visual data can include subsets or interpretations of images, videos, or other visual content (e.g., bounding boxes, classifications, etc.). Like with the audio sensors, the camera sensors can include multiple cameras positioned at various locations about the exterior and /or interior of the vehicle 502. Each camera sensor can output its own captured visual data, which can depict the environment of the vehicle 502 from the position of the camera sensor on the vehicle 502.”)


Regarding to claim 13, the combination of Chebiyyam and Kim teaches the method of claim 6.
Chebiyyam further teaches prior to causing emission of the sound associated with the object: amplifying, by the processor, the sound associated with the object. ([Col. 8, line 38 – 45], “The output components can provide information to the teleoperator 118. In at least one example, a speaker can receive audio data 202 and/or processed audio data 204 (e.g., an audio channel) and can transform such data into sound waves, that when connected to an amplifier, outputs a sound. In FIG. 2, two speakers are depicted in association with a headset 214 or pair of earphones worn by the teleoperator 118.”; ([Col. 20, line 1 – 20], “when the processed audio data is output, the resulting sound can be localized for the teleoperator (e.g., relative to the vehicle 502). That is, the resulting sound can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed with sound. The resulting sound can simulate the real-world environment within which the vehicle 502 is positioned. That is, as a result of causing the processed audio data to be output via respective speakers, the resulting sound can be localized for the teleoperator such that the teleoperator can perceive sound in the environment of the vehicle 502 from the perspective of the vehicle 502 (e.g., the resulting sound can be localized relative to the vehicle 502 for the teleoperator). As such, the teleoperator can perceive sound as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. In some examples, as the teleoperator changes his or her position, orientation, etc., the resulting sound can be recast such to track the position, orientation, etc. of the teleoperator (e.g., in near real-time).”)

Regarding to claim 14, the combination of Chebiyyam and Kim teaches the method of claim 6.
Chebiyyam further teaches prior to causing emission of the sound associated with the object: generating, by the processor, a synthetic sound associated with the object; 
wherein causing emission of the sound comprises causing emission of the synthetic sound.
([Col. 20, line 1 – 20], “when the processed audio data is output, the resulting sound can be localized for the teleoperator (e.g., relative to the vehicle 502). That is, the resulting sound can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed with sound. The resulting sound can simulate the real-world environment within which the vehicle 502 is positioned. That is, as a result of causing the processed audio data to be output via respective speakers, the resulting sound can be localized for the teleoperator such that the teleoperator can perceive sound in the environment of the vehicle 502 from the perspective of the vehicle 502 (e.g., the resulting sound can be localized relative to the vehicle 502 for the teleoperator). As such, the teleoperator can perceive sound as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. In some examples, as the teleoperator changes his or her position, orientation, etc., the resulting sound can be recast such to track the position, orientation, etc. of the teleoperator (e.g., in near real-time).”)

Regarding to claim 15, the combination of Chebiyyam and Kim teaches the method of claim 6.
Chebiyyam further teaches processing, by the processor, the imaging data to identify the object; ([Col. 22, line 31 – 37], “the camera sensors can generate visual data. Such visual data can be images, videos, or other visual content that represent the environment of the vehicle 502. In some examples, such visual data can include subsets or interpretations of images, videos, or other visual content (e.g., bounding boxes, classifications, etc.). Like with the audio sensors, the camera sensors can include multiple cameras positioned at various locations about the exterior and/or interior of the vehicle 502. Each camera sensor can output its own captured visual data, which can depict the environment of the vehicle 502 from the position of the camera sensor on the vehicle 502.”; [Col. 22, line 55 – 60], “causing the processed visual data to be output via a display device proximate the teleoperator. In at least one example, the teleoperator management system 558 associated with the teleoperator computing device(s) 550 can receive the processed visual data and can cause the processed visual data to be output via the input/output component(s) 556.”) and 
processing, by the processor, the imaging data to determine a location of the object. ([Col. 22, line 55 -  67, “Block 712 illustrates causing the processed visual data to be output via a display device proximate the teleoperator. In at least one example, the teleoperator management system 558associated with the teleoperator computing device(s) 550 can receive the processed visual data and can cause the processed visual data to be output via the input/output component(s) 556. For example, the teleoperator management system 558 can cause the processed visual data to be output via a display device, such as a VR display device… That is, the resulting image can be output as a spatialized, three-dimensional scene such that the teleoperator is immersed in a virtual environment that simulates the real-world environment within which the vehicle 502 is positioned. As such, the teleoperator can perceive the environment as if they were located in the vehicle 502, even though the teleoperator is remotely located from the vehicle 502. As a result, the teleoperator can identify and/or determine information associated with events occurring in and/or around the vehicle 502, as if the teleoperator where present in the vehicle 502.” This means that the image data identifies the object's location, enabling the system to render its position and simulate the real-world environment within the vehicle for the teleoperator.)

Regarding to claim 16, Chebiyyam teaches A remote driving system, comprising: 
a vehicle within an environment, the vehicle comprising an imaging device having a field of view and an audio sensor; ([Col. 18, line 38 – 45], “receiving audio data from a plurality of audio sensors associated with a vehicle. As described above, in at least one example, a vehicle 502 can be associated with one or more sensor systems 506 and one or more vehicle computing device(s) 504. In at least one example, the sensor system(s) 506 can include lidar sensors, radar sensors, ultrasonic transducers, sonar sensors, location sensors (e.g., GPS, compass, etc.), inertial sensors (e.g., inertial measurement units, accelerometers, magnetometers, gyroscopes, etc.), cameras (e.g., RGB, IR, intensity, depth, etc.), wheel encoders, audio sensors, environment sensors (e.g., temperature sensors, humidity sensors, light sensors, pressure sensors, etc.), ToF sensors, etc.”) and 
a teleoperator station that is remote from the vehicle,
Read full office action
Prosecution Timeline

Nov 10, 2022
Application Filed
Dec 05, 2024
Non-Final Rejection — §103
Apr 24, 2025
Response Filed
Jun 19, 2025
Final Rejection — §103
Sep 24, 2025
Response after Non-Final Action
Oct 21, 2025
Request for Continued Examination
Oct 30, 2025
Response after Non-Final Action
Dec 06, 2025
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/355,841
Patent 12589883
FLIGHT RECORDER SYSTEM AND METHOD
2y 5m to grant Granted Mar 31, 2026
17/829,525
Patent 12567291
SETTING A MODE OF A VEHICLE
2y 5m to grant Granted Mar 03, 2026
17/976,493
Patent 12565100
IMMERSIVE VEHICLE COMPONENT CONFIGURATION AND OPERATION USING OPERATIONAL PROFILES
2y 5m to grant Granted Mar 03, 2026
18/448,520
Patent 12565118
METHOD OF PROVIDING INFORMATION WHEN AN ELECTRIC VEHICLE IS USED AS A BATTERY PACK
2y 5m to grant Granted Mar 03, 2026
18/449,606
Patent 12534092
AUTOMATED ADJUSTMENT OF VEHICLE DIRECTION BASED ON ENVIRONMENT ANALYSIS
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
78%
Grant Probability
86%
With Interview (+7.9%)
2y 9m
Median Time to Grant
High
PTA Risk
Based on 160 resolved cases by this examiner. Grant probability derived from career allow rate.
SYSTEMS AND METHODS TO INCREASE ENVIRONMENT AWARENESS ASSOCIATED WITH REMOTE DRIVING APPLICATIONS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email