Last updated: April 18, 2026
Application No. 18/484,935
Immersive Teleconferencing within Shared Scene Environments

Final Rejection §103
Filed
Oct 11, 2023
Examiner
KIM, WESLEY LEO
Art Unit
2648
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
2 (Final)
Interview Optional

— +32.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 344 resolved cases, 2023–2026
Examiner Intelligence

KIM, WESLEY LEO View full profile →
Grants 60% of resolved cases
Career Allow Rate
208 granted / 344 resolved
-1.5% vs TC avg
Strong +33% interview lift
Without
With
+32.8%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
16 currently pending
Career history
360
Total Applications
across all art units
Statute-Specific Performance

§101
5.6%
-34.4% vs TC avg
§103
51.9%
+11.9% vs TC avg
§102
20.1%
-19.9% vs TC avg
§112
14.9%
-25.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 344 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after October 11, 2023, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements submitted on June 09, 2025 and April 18, 2024 have been considered by the Examiner and made of record in the application file.

Specification
The disclosure is objected to because of the following informalities:
  Page 5, [0023], lines 2 and 5, “participantdevice” should read “participant device”.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-5, 9, and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Roper (US 20230126108 A1) in view of Li (US 20210321157 A1) and Chaturvedi (US 10210664 B1).

Regarding claim 1, Roper discloses a computer-implemented method for immersive teleconferencing within a shared scene environment, the method comprising: receiving, by a computing system (Fig. 1, 100; Roper) comprising one or more computing devices (Fig. 1, 140, 150, 160; Roper), a plurality of streams for presentation (Fig. 4, 404; Roper) at a teleconference (Fig. 3B, 360; Roper), wherein each of the plurality of streams represents a participant of a respective plurality of participants of the teleconference (Fig. 4, 404; Roper); and for each of the plurality of participants of the teleconference: and based at least in part on the scene data (Fig. 5; 540; Roper) within the scene environment, modifying, by the computing system, the stream that represents the participant (Fig. 5, 570; Roper. Based on the characteristics of the video stream, the stream is modified by providing a virtual background that best suits the participant. This is different from the original video stream, hence why this teaches the disclosure of modification of the video stream.).  
Roper does not expressively teach “determining, by the computing system, scene data descriptive of a scene environment, the scene data comprising at least one of lighting characteristics, acoustic characteristics, or perspective characteristics of the scene environment”, “determining, by the computing system, a position of the participant within the scene environment”, and “the position of the participant”. 
However, Chaturvedi does teach determining, by the computing system, scene data descriptive of a scene environment, the scene data comprising at least one of lighting characteristics (“The intensity model, in an example, has other uses as well. For example, a photographer can capture an image of an object. With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects” Chaturvedi (28)), acoustic characteristics, or perspective characteristics of the scene environment.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper with Chaturvedi so as to better improve the viewing experience for all participants. In the case wherein a user is in a darker environment, it could be useful for the system to provide lighting correction to ensure the participant is clearly visible to all other participants.
Additionally, Li does teach determining, by the computing system, a position of the participant within the scene environment (“At block 102, positions of key points of a human body contained in each frame of the video stream are acquired.” Li [0029]) and the position of the participant (Fig. 1, 102, 103, 105; Li). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi, with Li so as to better calibrate a user’s appearance to a joint stream. Position within a conference is crucial to understanding how to better conform a video stream to a joint stream of multiple other participants. 

Regarding claim 2, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 1, wherein the stream that represents the participant comprises at least one of the following elements. Examiner chooses to reject element a.
video data that depicts the participant (Fig. 5, 520; Roper);
audio data that corresponds to the participant;
pose data indicative of a pose of the participant; 
or Augmented Reality (AR) / Virtual Reality (VR) data indicative of a three-dimensional representation of the participant.  

Regarding claim 3, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 2, wherein modifying the stream that represents the participant comprises modifying, by the computing system, the stream using one or more machine-learned models, wherein each of the machine-learned models are trained to process at least one of the following elements. Examiner chooses to reject element b.
scene data; 
video data (“Visual characteristics may be determined based on analyzing pixels in the video stream associated with the user. For example, the video conferencing application 360 executes a trained ML model to identify which portions of the video stream include pixels corresponding to the user.” Roper [0080]. The machine learning models are identifying portions of the video stream which correspond to the user. To do so, the model must be trained on video data.); 
audio data; 
pose data; 
or AR/VR data.

Regarding claim 4, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 3, wherein: the one or more machine-learned models comprises a machine-learned semantic segmentation model trained to perform semantic segmentation tasks (“Visual characteristics may be determined based on analyzing pixels in the video stream associated with the user. For example, the video conferencing application 360 executes a trained ML model to identify which portions of the video stream include pixels corresponding to the user.” Roper [0080]. The machine learning model is identifying the user on a pixel-by-pixel basis, thereby teaching semantic segmentation.); the stream that represents the participant comprises the video data that depicts the participant (Fig. 4, 404; Roper); and wherein modifying the stream that represents the participant comprises segmenting, by the computing system, the video data of the stream that represents the participant into a foreground portion and a background portion (“As a possible implementation, the frame may be used as a background, and the target virtual object corresponding to the frame may be used as a foreground.” Li [0050]) using the machine-learned semantic segmentation model (“Visual characteristics may be determined based on analyzing pixels in the video stream associated with the user. For example, the video conferencing application 360 executes a trained ML model to identify which portions of the video stream include pixels corresponding to the user.” Roper [0080]. The machine learning model is identifying the user on a pixel-by-pixel basis, thereby teaching semantic segmentation.).  
Regarding claim 5, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 2, wherein: the stream that represents the participant comprises the video data that depicts the participant (Fig. 4, 434; Roper); the scene data describes the lighting characteristics of the scene environment, the lighting characteristics comprising a location and intensity of one or more light sources within the scene environment (“The intensity model, in an example, has other uses as well. For example, a photographer can capture an image of an object. With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects” Chaturvedi (28)); and wherein modifying the stream that represents the participant comprises: based at least in part on the scene data and the position of the participant, applying, by the computing system, a lighting correction to the video data that represents the participant based at least in part on the position of the participant within the scene environment relative to the one or more light sources (“With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects, a computing device can determine the location of various shadows or shading and can make adjustments accordingly. For example, the computing device is able to utilize an algorithm, such as the occlusion process previously explained, to remove shadows, highlights, glint, or otherwise adjust the brightness or contrast of portions of an image digitally based upon the relative location of the light source.” Chaturvedi (28)).

Regarding claim 9, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 1, wherein receiving the plurality of streams further comprises receiving, by the computing system for each of the plurality of streams, scene environment data for the stream descriptive of lighting characteristics (“The intensity model, in an example, has other uses as well. For example, a photographer can capture an image of an object. With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects” Chaturvedi (28)), acoustic characteristics, or perspective characteristics of the participant represented by the stream; and wherein modifying the stream that represents the participant comprises: based at least in part on the scene data, the position of the participant within the scene environment, and the environment data for the stream, modifying, by the computing system, the stream that represents the participant (“With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects, a computing device can determine the location of various shadows or shading and can make adjustments accordingly. For example, the computing device is able to utilize an algorithm, such as the occlusion process previously explained, to remove shadows, highlights, glint, or otherwise adjust the brightness or contrast of portions of an image digitally based upon the relative location of the light source.” Chaturvedi (28)).

Regarding claim 15, Roper discloses a computing system for immersive teleconferencing within a shared scene environment, comprising: one or more processors (“The example computing device 600 includes a processor 610 which is in communication with the memory 620 and other components of the computing device 600 using one or more communications buses 602.” Roper [0095]); and one or more memory elements including instructions that when executed cause the one or more processors to (“The processor 610 is configured to execute processor-executable instructions stored in the memory 620 to perform one or more methods” Roper [0095]): receive a plurality of streams for presentation (Fig. 4, 404; Roper) at a teleconference (Fig. 3B, 360; Roper), wherein each of the plurality of streams represents a participant of a respective plurality of participants of the teleconference (Fig. 4, 404; Roper); and for each of the plurality of participants of the teleconference: and based at least in part on the scene data (Fig. 5; 540; Roper), modify the stream that represents the participant (Fig. 5, 570; Roper. Based on the characteristics of the video stream, the stream is modified by providing a virtual background that best suits the participant. This is different from the original video stream, hence why this teaches the disclosure of modification of the video stream.).
Roper does not expressively teach “determine scene data descriptive of a scene environment, the scene data comprising at least one of lighting characteristics, acoustic characteristics, or perspective characteristics of the scene environment”, “determine a position of the participant within the scene environment”, and “the position of the participant”.
However, Chaturvedi does teach determine scene data descriptive of a scene environment, the scene data comprising at least one of lighting characteristics (“The intensity model, in an example, has other uses as well. For example, a photographer can capture an image of an object. With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects” Chaturvedi (28)), acoustic characteristics, or perspective characteristics of the scene environment.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper with Chaturvedi so as to better improve the viewing experience for all participants. In the case wherein a user is in a darker environment, it could be useful for the system to provide lighting correction to ensure the participant is clearly visible to all other participants.
Additionally, Li does teach determine a position of the participant within the scene environment (“At block 102, positions of key points of a human body contained in each frame of the video stream are acquired.” Li [0029]) and the position of the participant within the scene environment (Fig. 1, 102, 103, 105; Li). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi, with Li so as to better calibrate a user’s appearance to a joint stream. Position within a conference is crucial to understanding how to better conform a video stream to a joint stream of multiple other participants. 

Regarding claim 16, Roper, in view of Chaturvedi and Li, discloses the computing system of claim 15, wherein the stream that represents the participant comprises at least one of the following elements. Examiner chooses to reject element a.
video data that depicts the participant (Fig. 5, 520; Roper);
audio data that corresponds to the participant;
pose data indicative of a pose of the participant; 
or Augmented Reality (AR) / Virtual Reality (VR) data indicative of a three-dimensional representation of the participant.  

Regarding claim 17, Roper, in view of Chaturvedi and Li, discloses the computing system of claim 16, wherein modifying the stream that represents the participant comprises modifying, by the computing system, the stream using one or more machine-learned models, wherein each of the machine-learned models are trained to process at least one of the following elements. Examiner chooses to reject element b.
scene data; 
video data (“Visual characteristics may be determined based on analyzing pixels in the video stream associated with the user. For example, the video conferencing application 360 executes a trained ML model to identify which portions of the video stream include pixels corresponding to the user.” Roper [0080]. The machine learning models are identifying portions of the video stream which correspond to the user. To do so, the model must be trained on video data.); 
audio data; 
pose data; 
or AR/VR data.

Regarding claim 18, Roper, in view of Chaturvedi and Li, discloses the computing system of claim 17, wherein: the one or more machine-learned models comprises a machine-learned semantic segmentation model trained to perform semantic segmentation tasks (“Visual characteristics may be determined based on analyzing pixels in the video stream associated with the user. For example, the video conferencing application 360 executes a trained ML model to identify which portions of the video stream include pixels corresponding to the user.” Roper [0080]. The machine learning model is identifying the user on a pixel-by-pixel basis, thereby teaching semantic segmentation.); the stream that represents the participant comprises the video data that depicts the participant (Fig. 4, 404; Roper); and wherein modifying the stream that represents the participant comprises segmenting the video data of the stream that represents the participant into a foreground portion and a background portion (“As a possible implementation, the frame may be used as a background, and the target virtual object corresponding to the frame may be used as a foreground.” Li [0050]) using the machine-learned semantic segmentation model (“Visual characteristics may be determined based on analyzing pixels in the video stream associated with the user. For example, the video conferencing application 360 executes a trained ML model to identify which portions of the video stream include pixels corresponding to the user.” Roper [0080]. The machine learning model is identifying the user on a pixel-by-pixel basis, thereby teaching semantic segmentation.).

Regarding claim 19, Roper, in view of Chaturvedi and Li, discloses the computing system of claim 16, wherein: the stream that represents the participant comprises the video data that depicts the participant (Fig. 4, 434; Roper); the scene data describes the lighting characteristics of the scene environment, the lighting characteristics comprising a location and intensity of one or more light sources within the scene environment (“The intensity model, in an example, has other uses as well. For example, a photographer can capture an image of an object. With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects” Chaturvedi (28)); and wherein modifying the stream that represents the participant comprises: based at least in part on the scene data and the position of the participant a lighting correction to the video data that represents the participant based at least in part on the position of the participant within the scene environment relative to the one or more light sources (“With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects, a computing device can determine the location of various shadows or shading and can make adjustments accordingly. For example, the computing device is able to utilize an algorithm, such as the occlusion process previously explained, to remove shadows, highlights, glint, or otherwise adjust the brightness or contrast of portions of an image digitally based upon the relative location of the light source.” Chaturvedi (28)).  

Regarding claim 20, Roper discloses a non-transitory computer readable medium that, when executed by a processor, cause the processor to (“In a tenth aspect, a non-transitory computer-readable medium includes processor-executable instructions configured to cause one or more processors to” Roper [0107]): receive a plurality of streams (Fig. 4, 404; Roper) for presentation at a teleconference (Fig. 3B, 360; Roper), wherein each of the plurality of streams represents a participant of a respective plurality of participants of the teleconference (Fig. 4, 404; Roper); and based at least in part on the scene data (Fig. 5; 540; Roper), modify the stream that represents the participant (Fig. 5, 570; Roper. Based on the characteristics of the video stream, the stream is modified by providing a virtual background that best suits the participant. This is different from the original video stream, hence why this teaches the disclosure of modification of the video stream.).  
Roper does not expressively teach “determine scene data descriptive of a scene environment, the scene data comprising at least one of lighting characteristics, acoustic characteristics, or perspective characteristics of the scene environment”, “determine a position of the participant within the scene environment”, and “the position of the participant”.
However, Chaturvedi does teach determine scene data descriptive of a scene environment, the scene data comprising at least one of lighting characteristics (“The intensity model, in an example, has other uses as well. For example, a photographer can capture an image of an object. With a determination of the direction of lighting and, potentially, the intensity of the lighting, and/or other such aspects” Chaturvedi (28)), acoustic characteristics, or perspective characteristics of the scene environment.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper with Chaturvedi so as to better improve the viewing experience for all participants. In the case wherein a user is in a darker environment, it could be useful for the system to provide lighting correction to ensure the participant is clearly visible to all other participants.
Additionally, Li does teach determine a position of the participant within the scene environment (“At block 102, positions of key points of a human body contained in each frame of the video stream are acquired.” Li [0029]) and the position of the participant within the scene environment (Fig. 1, 102, 103, 105; Li). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi, with Li so as to better calibrate a user’s appearance to a joint stream. Position within a conference is crucial to understanding how to better conform a video stream to a joint stream of multiple other participants. 

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Roper (US 20230126108 A1), in view of Li (US 20210321157 A1) and Chaturvedi (US 10210664 B1), and further in view of Sommerlade et. al (US 20220400228 A1, hereinafter Sommerlade).

Regarding claim 6, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 2. 
Roper, in view of Chaturvedi and Li, does not expressively teach “the stream that represents the participant comprises the video data that depicts the participant, wherein the video data further depicts a gaze of the participant; and wherein modifying the stream that represents the participant comprises: determining, by the computing system, a direction of a gaze of the participant; determining, by the computing system, a gaze correction for the gaze of the participant based at least in part on the position of the participant within the scene environment and the gaze of the participant; and applying, by the computing system, the gaze correction to the video data to adjust the gaze of the participant depicted by the video data.”
However, Sommerlade does teach the stream that represents the participant comprises the video data that depicts the participant, wherein the video data further depicts a gaze of the participant (Fig. 1, 110, 119; Sommerlade); and wherein modifying the stream that represents the participant comprises: determining, by the computing system, a direction of a gaze of the participant (“computing an eye gaze direction of the first participant based on the location displaying images of the second participant” Sommerlade [0005]); determining, by the computing system, a gaze correction for the gaze of the participant based at least in part on the position of the participant within the scene environment and the gaze of the participant (“Accordingly, the eye gaze, pose, head, and/or other body position adjustment techniques described herein may include an eye gaze, head pose, and/or other body position adjustment technique for each user at each of the one or more computing systems 102, 104, and/or 106.” Sommerlade [0037]); and applying, by the computing system, the gaze correction to the video data to adjust the gaze of the participant depicted by the video data (“generating gaze-adjusted images based on the eye gaze direction of the first participant, wherein the gaze-adjusted images include at least one of an adjusted eye gaze direction of the first participant or an adjusted head pose of the first participant; and replacing the images within the video stream with the gaze-adjusted images.” Sommerlade [0005]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi and Li, with Sommerlade so as to make the immersive teleconference more realistic. In the case where a user’s gaze is directed in a different direction relative to the point-of-view of the camera, it could be useful for the system to provide a gaze correction to depict to the other participants a gaze directed to the point-of-view of the camera. Consequently, all users will give a more realistic teleconference experience.

Claims 7 and 10-14 are rejected under 35 U.S.C. 103 as being unpatentable over Roper (US 20230126108 A1), in view of Li (US 20210321157 A1) and Chaturvedi (US 10210664 B1), and further in view of Makker et. al (US 20240242717 A1, hereinafter Makker).

Regarding claim 7, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 2, wherein: the stream that represents the participant comprises the video data that depicts the participant (Fig. 4, 434; Roper).
Roper, in view of Chaturvedi and Li, does not expressively teach “ the scene data comprises the perspective characteristics of the scene environment, wherein the perspective characteristics indicate a perspective from which the scene environment is viewed; and wherein modifying the stream that represents the participant comprises: based at least in part on the perspective characteristics and the position of the participant within the scene environment, determining, by the computing system, that a portion of the participant that is visible from the perspective from which the scene environment is viewed is not depicted in the video data; generating, by the computing system, a predicted rendering of the portion of the participant; and applying, by the computing system, the predicted rendering of the portion of the participant to the video data.”
However, Makker does teach the scene data comprises the perspective characteristics of the scene environment, wherein the perspective characteristics indicate a perspective from which the scene environment is viewed (“In some embodiments, physical furnishings are deployed in the local environment in ways that provide additional cues that further enhance the illusion. For example, a table or desk in the local environment placed in front of the media display may be oriented in an alignment that would extend into a plausible juxtaposition with the remote participant.” Makker [0138]. The feature essentially allows features from the actual scene environment (i.e. desk or chair) to be displayed in a participant’s background.); and wherein modifying the stream that represents the participant comprises: based at least in part on the perspective characteristics and the position of the participant within the scene environment, determining, by the computing system, that a portion of the participant that is visible from the perspective from which the scene environment is viewed is not depicted in the video data; generating, by the computing system, a predicted rendering of the portion of the participant; and applying, by the computing system, the predicted rendering of the portion of the participant to the video data (“In some embodiments, virtual overlays are added to the displayed media stream that are configured to imitate the local environment (e.g., a ledge, a plant, or any other object). The virtual overlay object may match the aesthetics of the local environment.” Makker [0138]. The virtual overlay, which is based on real objects in the participant’s background, could be selected by the user to better display an object which is partially or fully cut-off from the camera to the other meeting participants.).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi and Li, with Makker so as to provide participants with an “improved digital experience that provides users an enhanced immersive experience, which simulates common presence of a virtual participant (and optional related virtual auxiliary content) and physically present participants in a conference” (Makker [0002]). 
Regarding claim 10, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 1, wherein modifying the stream that represents the participant comprises: the position of the participant within the scene environment (“At block 102, positions of key points of a human body contained in each frame of the video stream are acquired.” Li [0029]).
Roper, in view of Chaturvedi and Li, does not expressively teach “based at least in part on the scene data” and “and a position of at least one other participant of the plurality of participants within the scene environment, modifying, by the computing system, the stream that represents the participant.”
However, Makker does teach based at least in part on the scene data (“For example, a table or desk in the local environment placed in front of the media display may be oriented in an alignment that would extend into a plausible juxtaposition with the remote participant.” Makker [0138]) and a position of at least one other participant of the plurality of participants within the scene environment, modifying, by the computing system, the stream that represents the participant (“As another example, by including matching furnishings at multiple endpoints of a video conference (e.g., including a real table or desk in front of each real media display), each combined field of view for each respective participant (e.g., their view of their local environment combined with the virtual objects generated on their media display) can include the same matching desk or table on both sides of a video conference for creating a convincing telepresence illusion.” Makker [0138]).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi and Li, with Makker so as to provide participants with an “improved digital experience that provides users an enhanced immersive experience, which simulates common presence of a virtual participant (and optional related virtual auxiliary content) and physically present participants in a conference” (Makker [0002]). 

Regarding claim 11, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 1.
Roper, in view of Chaturvedi and Li, does not expressively teach “wherein determining, by the computing system, the scene data descriptive of the scene environment comprises: determining, by the computing system, a plurality of participant scene environments for the plurality of streams; and based at least in part on the plurality of participant scene environments, selecting, by the computing system, the scene environment from a plurality of candidate scene environments.”
However, Makker does teach wherein determining, by the computing system, the scene data descriptive of the scene environment comprises: determining, by the computing system, a plurality of participant scene environments for the plurality of streams (“As another example, by including matching furnishings at multiple endpoints of a video conference (e.g., including a real table or desk in front of each real media display), each combined field of view for each respective participant (e.g., their view of their local environment combined with the virtual objects generated on their media display) can include the same matching desk or table on both sides of a video conference for creating a convincing telepresence illusion.” Makker [0138]); and based at least in part on the plurality of participant scene environments, selecting, by the computing system, the scene environment from a plurality of candidate scene environments (“In some embodiments, virtual overlays are added to the displayed media stream that are configured to imitate the local environment (e.g., a ledge, a plant, or any other object). The virtual overlay object may match the aesthetics of the local environment. The virtual overlay may be added automatically and/or per user's request.” Makker [0139]. The user or system can pick the scene environment of the user.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi and Li, with Makker so as to provide participants with an “improved digital experience that provides users an enhanced immersive experience, which simulates common presence of a virtual participant (and optional related virtual auxiliary content) and physically present participants in a conference” (Makker [0002]). 

Regarding claim 12, Roper, in view of Chaturvedi, Li, and Makker, discloses the the computer-implemented method of claim 11, wherein the plurality of candidate scene environments comprises at least some of the plurality of participant scene environments (“In some embodiments, virtual overlays are added to the displayed media stream that are configured to imitate the local environment (e.g., a ledge, a plant, or any other object). The virtual overlay object may match the aesthetics of the local environment.” Makker [0139]).

Regarding claim 13, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 1.
Roper, in view of Chaturvedi and Li, does not expressively teach “wherein: modifying the stream that represents the participant comprises, based at least in part on the scene data and the position of the participant within the scene environment, modifying, by the computing system, the stream that represents the participant in relation to a position of another participant of the plurality of participants; and wherein the method further comprises broadcasting, by the computing system, the stream to a participant device respectively associated with the other participant.”
However, Makker does teach wherein: modifying the stream that represents the participant comprises, based at least in part on the scene data and the position of the participant within the scene environment, modifying, by the computing system, the stream that represents the participant in relation to a position of another participant of the plurality of participants (“In some embodiments, a transparent media display is used to enhance the immersive experience of a collaborative digital communication (e.g., video conference), e.g., by emulating the virtual participant's image with the local environment of the local participant(s), while stripping away incongruous elements of the remote environment of the remote participant(s) and/or auxiliary content to be presented (e.g., presentation, data sheet, article, picture, video, or any other document or exhibit). The media stream from the remote participant(s) may be altered before being displayed on the local transparent media display, e.g., by having a portion of the incoming information surrounding the material to be communicated (e.g., the virtual participant's image and/or auxiliary presentable content) removed.” Makker [0138]); and wherein the method further comprises broadcasting, by the computing system, the stream to a participant device respectively associated with the other participant (“Pod 1600 includes at least two transparent media displays such as 1630 and 1640 for displaying media streams from respective remote users (e.g., at different remote locations).” Makker [0169]; Fig. 16, 1635, 1640 (This remote user is not numbered in the drawing.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi and Li, with Makker so as to provide participants with an “improved digital experience that provides users an enhanced immersive experience, which simulates common presence of a virtual participant (and optional related virtual auxiliary content) and physically present participants in a conference” (Makker [0002]). 

Regarding claim 14, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 1.
Roper, in view of Chaturvedi and Li, does not expressively teach “wherein the method further comprises: generating, by the computing system, a shared stream that comprises the plurality of streams depicted within a virtualized representation of the scene environment based at least in part on the position of each of the plurality of participants within the scene environment; and broadcasting, by the computing system, the shared stream to a plurality of participant devices respectively associated with the plurality of participants.”
	However, Makker does teach wherein the method further comprises: generating, by the computing system, a shared stream that comprises the plurality of streams depicted within a virtualized representation of the scene environment based at least in part on the position of each of the plurality of participants within the scene environment (“As another example, by including matching furnishings at multiple endpoints of a video conference (e.g., including a real table or desk in front of each real media display), each combined field of view for each respective participant (e.g., their view of their local environment combined with the virtual objects generated on their media display) can include the same matching desk or table on both sides of a video conference for creating a convincing telepresence illusion.” Makker [0138]); and broadcasting, by the computing system, the shared stream to a plurality of participant devices respectively associated with the plurality of participants (“Pod 1600 includes at least two transparent media displays such as 1630 and 1640 for displaying media streams from respective remote users (e.g., at different remote locations).” Makker [0169]; Fig. 16, 1635, 1640 (This remote user is not numbered in the drawing.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi and Li, with Makker so as to provide participants with an “improved digital experience that provides users an enhanced immersive experience, which simulates common presence of a virtual participant (and optional related virtual auxiliary content) and physically present participants in a conference” (Makker [0002]). 

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Roper (US 20230126108 A1), in view of Li (US 20210321157 A1) and Chaturvedi (US 10210664 B1), and further in view of Dunn et. al (US 5991385 A, hereinafter Dunn). 

Regarding claim 8, Roper, in view of Chaturvedi and Li, discloses the computer-implemented method of claim 2, wherein: the stream that represents the participant comprises the audio data that corresponds to the participant (“During video conferences, participants use client software provided by their video conference provider to share data streams of video and audio (collectively “multimedia streams”) to interact with each other.” Roper [0010]).
Roper, in view of Chaturvedi and Li, does not expressively teach “the scene data comprises the acoustic characteristics of the scene environment; and wherein modifying the stream that represents the participant comprises modifying, by the computing system, the audio data based at least in part on the position of the participant within the scene environment relative to the acoustic characteristics of the scene environment.”
However, Dunn does teach the scene data comprises the acoustic characteristics of the scene environment (“These and other objects, features and advantages of the invention are best achieved in audio teleconferencing apparatus in which each participant participates in a teleconference through standard telephone including at one least one an enhanced speakerphone including a programmable Digital Signal Processor (DSP) which receives a conference audio signal.” Dunn (23)); and wherein modifying the stream that represents the participant comprises modifying, by the computing system, the audio data based at least in part on the position of the participant (“Another object is enhanced audio teleconferencing apparatus and method generating positional information correlated in time with audio information to identify conference participants in groups seated about a virtual conference table.” Dunn (18)) within the scene environment relative to the acoustic characteristics of the scene environment (“In order to create a more realistic sense of a virtual conference among the participants, a need exists to add a sound field effect to the conference phone capability to create a sense of spatial location among the participants, as if all the teleconference participants were seated around a virtual conference table.” Dunn (7)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Roper, in view of Chaturvedi and Li, with Makker so as “to create a more realistic sense of a virtual conference among the participants” (Dunn (7)) by providing a more realistic acoustic field to better simulate the participants speaking together in-person. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SAAD AHMED SYED whose telephone number is (571) 272-6777. The examiner can normally be reached Monday - Friday 8:30 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen, can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SAAD AHMED SYED/             Examiner, Art Unit 2691          

/DUC NGUYEN/             Supervisory Patent Examiner, Art Unit 2691
Read full office action
Prosecution Timeline

Oct 11, 2023
Application Filed
Aug 08, 2025
Non-Final Rejection — §103
Oct 21, 2025
Applicant Interview (Telephonic)
Oct 22, 2025
Examiner Interview Summary
Nov 12, 2025
Response Filed
Apr 09, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/246,687
Patent 12598543
SYSTEMS AND METHODS FOR RETRIEVING RAN INFORMATION
2y 5m to grant Granted Apr 07, 2026
18/243,784
Patent 12550217
METHOD FOR NETWORK CONFIGURATION, NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM, BASE STATION, CLEANING DEVICE AND CLEANING SYSTEM
2y 5m to grant Granted Feb 10, 2026
18/296,332
Patent 12505558
METHOD, COMPUTER PROGRAM, DEVICE, AND SYSTEM FOR TRACKING A TARGET OBJECT
2y 5m to grant Granted Dec 23, 2025
18/098,194
Patent 12341920
VEHICLE IMMERSIVE COMMUNICATION SYSTEM
2y 5m to grant Granted Jun 24, 2025
14/909,651
Patent 9723429
METHOD FOR DELIVERING NOTIFICATION MESSAGES IN M2M SYSTEM AND DEVICES FOR SAME
2y 5m to grant Granted Aug 01, 2017
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
93%
With Interview (+32.8%)
4y 5m
Median Time to Grant
Moderate
PTA Risk
Based on 344 resolved cases by this examiner. Grant probability derived from career allow rate.