Last updated: April 19, 2026
Application No. 18/772,805
Transparent Frames In Video Conferencing

Non-Final OA §102§103
Filed
Jul 15, 2024
Examiner
PATEL, YOGESHKUMAR G
Art Unit
2691
Tech Center
2600 — Communications
Assignee
Zoom Video Communications, Inc.
OA Round
1 (Non-Final)
Interview Optional

— +3.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 650 resolved cases, 2023–2026
Examiner Intelligence

PATEL, YOGESHKUMAR G View full profile →
Grants 83% — above average
Career Allow Rate
538 granted / 650 resolved
+20.8% vs TC avg
Minimal +3% lift
Without
With
+3.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 4m
Avg Prosecution
17 currently pending
Career history
667
Total Applications
across all art units
Statute-Specific Performance

§101
4.7%
-35.3% vs TC avg
§103
61.9%
+21.9% vs TC avg
§102
14.4%
-25.6% vs TC avg
§112
14.2%
-25.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 650 resolved cases
Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1-4, 6-8, and 10-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lim et al. (US #2020/0045261).

Regarding Claim 1, Lim discloses a method (title, abstract, figs. 1-12), comprising:
transmitting an indication to provide a spatial configuration within a physical space for a video conference (Lim ¶0075 discloses the video conferencing system 102 enables such spatial arrangements to be dynamically created, communicated, and controlled by participants, thereby further improving the perceived quality, comfort, and effectiveness of video conferencing sessions. Figs. 6A-6D illustrate techniques for selecting and changing RGB cameras that further support providing gaze-correct video conferencing sessions among and between various participants at various geographic locations during a single video conferencing session. ¶0081 discloses due to the new physical position of the participant 155 in fig. 6B, the participant 155 has turned slightly to continue a gaze direction 622 directed at the RGB camera 180c, and a gaze-correct video conferencing session is maintained. Additionally, in response to the detected movement and change in physical position of the participant 155, the multimedia communication devices 100 and 160 each convey an off-axis spatial arrangement to each of the participants 132 and 155 that is responsive to movements of the participant 132 and/or 155);


    PNG
    media_image1.png
    880
    1430
    media_image1.png
    Greyscale


displaying, at a transparent screen, a video stream of remote participant frames and transparent frames (Lim ¶0028 discloses the multimedia communication device 100 can be configured to display images and/or video streams from one or more remote devices or systems participating in a video conferencing session with the multimedia communication device 100, such as from the multimedia communication device 160; figs. 1-3, 5-7, and 9);
capturing, via a camera positioned behind the transparent screen (Lim ¶0029 discloses each of the RGB cameras 110 are positioned behind the display device 105 to capture images from light received through the display device 105, and accordingly are not directly visible in fig. 1. By positioning the RGB cameras 110 behind the display device 105, images can be displayed on the display device 105 over the physical positions of the RGB cameras 110. By placing the RGB cameras 110 behind the display device 105, subject gazes may be directed at the RGB cameras 110, enabling gaze-correct multi-party video conferencing), video frames of a physical space participant when the transparent frames are displayed (Lim ¶0031 discloses the RGB cameras 110 are configured and operated to periodically capture images at a frame rate suitable for video conferencing. The multimedia communication device 160 similarly includes RGB cameras 180a, 180b, 180c, and 180d. ¶0033 discloses the video conferencing system 102 [for example, the multimedia communication device 100] is configured to select one or more foreground cameras from the multiple RGB cameras 110 for capturing one or more images of one or more identified foreground subjects [for example, a human subject]); and
transmitting the captured video frames of the physical space participant for display at a remote screen (Lim ¶0037 discloses the encoded image is then provided to the remote multimedia communication device 160, thereby causing the composite image 145 to be displayed, at least in part, by the remote multimedia communication device 160, such as via a video conferencing application program executed by the remote multimedia communication device 160; fig. 1).

Regarding Claim 2, Lim discloses the method of claim 1, further comprising:
assigning the physical space participant to the remote screen with a position located in a remote physical space associated with a remote participant (Lim ¶0037 discloses similar processing may be performed to generate a sequence of multiple such images, based on images captured by the RGB cameras 110, used for a sequence of frames that are encoded in one or more video streams transmitted to participants of the video conferencing session; fig. 1);
capturing a live audio recording of the physical space participant (Lim ¶0092 discloses fig. 9 illustrates an example of gaze-correct multiparty video conferencing among five participants each at a different geographic location. In some examples, similar techniques and advantages may be realized with three or more participants each at different locations. Fig. 9 illustrates a scenario 900 including a five scenes 900a, 900b, 900c, 900d, and 900e at respective different geographic locations 910, 912, 914, 916, and 918 with respective multimedia communication devices 930, 932, 934, 936, and 938 used by respective participants 920, 922, 924, 926, and 928 to participate in a single multi-party video conference session. Each of the multimedia communication devices 930, 932, 934, 936, and 938 may be configured as described for the multimedia communication devices 100 and 160 in figs. 1-8);
processing, in real time, the captured live audio recording for spatial positioning within a generated audio mix based on the assignment of the physical space participant to the remote screen and the position (Lim ¶0083 discloses in response to the detected movement and change in physical position of the participant 155, the multimedia [audio recording is inherent in multimedia] communication devices 100 and 160 each convey a more oblique off-axis spatial arrangement to each of the participants 132 and 155 than in fig. 6B. ¶0086 discloses thus, as illustrated by the examples shown in figs. 6A-6D, the video conferencing system 102, via the multimedia communication devices 100 and 160, enables spatial arrangements to be dynamically created, communicated, and controlled by video conferencing session participants); and
transmitting the generated audio mix for streamed playback at the video conference such that the spatially positioned captured live audio recording is localized for playback within a spatial position corresponding to the position of the remote screen (Lim ¶0037 discloses the encoded image is then provided to the remote multimedia communication device 160 [audio recording is inherent in multimedia], thereby causing the composite image 145 to be displayed, at least in part, by the remote multimedia communication device 160, such as via a video conferencing application program executed by the remote multimedia communication device 160; figs. 1 and 9).

Regarding Claim 3, Lim discloses the method of claim 2,
wherein the processing of the captured live audio recording utilizes spatial audio techniques (Lim ¶0086 discloses further, when the techniques of figs. 5A-5D are combined with the techniques of figs. 6A-6D, spatial arrangements can be controlled and perceived in further detail, further enhancing interactions).

Regarding Claim 4, Lim discloses the method of claim 2, wherein the capturing of the live audio recording is performed via at least one of:
the camera (Lim ¶0082 discloses in fig. 6C, the video conferencing system 102 has determined that the participant 155 has moved to another new physical position, which is still within an FOV 184c of the RGB camera 180c. Based on the new physical position, the video conferencing system 102 determines that the participant 155 is at a lateral physical position relative to the multimedia communication device 160 that is more aligned with the RGB camera 180a than the previous RGB camera 180b. In response to this determination, the video conferencing system 102 selects the corresponding RGB camera 110a as the foreground camera for the participant 132, changing from the RGB camera 110b selected in fig. 6B) or one or more microphones in proximity to the transparent screen (Lim ¶0116 discloses microphone to identify tagged audio signals. ¶0066 discloses fig. 5C illustrates a second scenario 550 occurring at about a second time after the first time in fig. 5A and during the video conferencing session shown in fig. 5A in which the second participant 514 has moved closer to the second multimedia communication device 512 [audio recording with a microphone in proximity of a video capture device is inherent in multimedia recording], including a scene 550a for the first participant 504 and a scene 550b for the second participant 514, and a resulting composite image 562).

Regarding Claim 6, Lim discloses the method of claim 1,
wherein the transparent screen is a transparent organic light-emitting diode (OLED) screen (Lim ¶0039, ¶0040 discloses OLED display) or a transparent active-matrix organic light-emitting diode (AMOLED) screen.
Regarding Claim 7, Lim discloses the method of claim 1, further comprising:
transmitting the indication to provide a spatial configuration to a server hosting the video conference (Lim ¶0094 discloses at the time shown in fig. 9, the participant 924 is currently speaking, and accordingly may be referred to as the "active speaker" in the video conferencing session. The video conferencing system 102 may be configured to provide a visual indication of the active speaker, to assist participant identification of and focus on the active speaker [it is inherent in a video conferencing session for sending an indication of assigning an active speaker]).

Regarding Claim 8, Lim discloses the method of claim 1,
wherein a background of a remote participant in the remote participant frames is replaced with a transparent background (Lim ¶0034 discloses a background image 140d has been obtained from the selected RGB camera 110d. Various techniques and details for dynamically selecting RGB cameras to capture foreground subject images and/or background images, segmenting foreground images can be used ¶0051 discloses first background mask 318 and/or second background mask 329; fig. 3B).

Regarding Claim 10, Lim discloses the method of claim 1, further comprising:
verifying that the transparent screen displays the remote participant frames such that a representation of eye contact of a remote participant is directed at the physical space participant (Lim ¶0034 discloses a background image 140d has been obtained from the selected RGB camera 110d. Various techniques and details for dynamically selecting RGB cameras to capture foreground subject images and/or background images, segmenting foreground images can be used ¶0051 discloses first background mask 318 and/or second background mask 329; fig. 3B).

Regarding Claim 11, Lim discloses the method of claim 1, further comprising:
receiving, from a server, verification that the physical space participant is displayed within a remote physical space such that a representation of eye contact of the physical space participant is directed at a remote participant (Lim ¶0034 discloses a background image 140d has been obtained from the selected RGB camera 110d. Various techniques and details for dynamically selecting RGB cameras to capture foreground subject images and/or background images, segmenting foreground images can be used ¶0051 discloses first background mask 318 and/or second background mask 329; fig. 3B).

Regarding Claim 12, Lim discloses the method of claim 1,
wherein the video stream is alternately displayed between the remote participant frames and the transparent frames at every other frame (Lim ¶0034 discloses a background image 140d has been obtained from the selected RGB camera 110d. Various techniques and details for dynamically selecting RGB cameras to capture foreground subject images and/or background images, segmenting foreground images can be used ¶0051 discloses first background mask 318 and/or second background mask 329; fig. 3B).

Regarding Claim 13, Lim discloses the method of claim 1, further comprising:
processing the captured video frames to optimize for remote real-time streaming playback (Lim figs. 1, 8, and 9).

Regarding Claim 14, Lim discloses the method of claim 1,
wherein the transparent screen is configured to output video at a refresh rate of 120 hertz or greater (Lim ¶0031 discloses the RGB cameras 110 are configured and operated to periodically capture images at a frame rate suitable for video conferencing. A video refresh rate of 120 hertz or greater is an inherent feature of a video conference).

Claims 15-19 are rejected for the same reasons as set forth in Claim 1-4, 6-8, and 10-14.

Regarding Claim 20, Lim discloses the system of claim 18,
wherein the spatial configuration represents one or more spatialized separations in the physical space between positions of two or more of: at least one participant (Lim ¶0031 discloses as illustrated by the examples in figs. 1-3 and 5-10, the RGB cameras 110 are positioned such that, when the multimedia communication device 100 is operated, a leftmost RGB camera 110 [in fig. 1, the RGB camera 110a] and a rightmost RGB camera 110 [in fig. 1, the RGB camera 110d] span a horizontal distance that is at least large enough, in most conditions, to obtain a view around a human subject located close to and within a field of view [FOV] of one or more of the RGB cameras 110. For example, in fig. 1, an image of the standing participant 132 is included in an image 140b captured by the RGB camera 110b, whereas the standing participant 132 is not visible in an image 140d captured by the RGB camera 110d at approximately the same time. ¶0049 discloses based on, for example, the determined distance d305 being less than the threshold distance 302 and/or the determined physical position being within the foreground space 303, the video conferencing system 102 identifies a foreground subject corresponding to the participant 132), at least one camera (Lim ¶0096 discloses the two multimedia communication devices 1020 and 1040 are communicatively coupled to operate together as a single larger multimedia communication device or system 1010, which is configured to make use of the RGB cameras 1030a, 1030b, 1030c, 1030d, 1050a, 1050b, 1050c, and 1050d, and the depth cameras 1035a, 1035b, 1055a, and 1055b, arranged behind display devices 1025 and 1045), and at least one transparent screen (Lim ¶0096 discloses the RGB cameras 1030a, 1030b, 1030c, 1030d, 1050a, 1050b, 1050c, and 1050d, and the depth cameras 1035a, 1035b, 1055a, and 1055b, arranged behind display devices [i.e., transparent screens] 1025 and 1045).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5 and 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lim et al. (US #2020/0045261) in view of Cassini et al. (US #2018/0098036).

Regarding Claim 5, Lim discloses the method of claim 2, but may not explicitly disclose wherein the spatially positioned captured live audio within the generated audio mix is optimized for one or more of: stereo, multichannel surround sound, binaural audio, three-dimensional sound, Ambisonics, and 360 degree sound.
However, Cassini (title, abstract, figs. 1-11) teaches wherein the spatially positioned (Cassini ¶0007 discloses the control unit includes a processor and a computer readable medium comprising an imaging controller. The imaging controller causes the processor to move the camera from a first spatial location to a second spatial location relative to a selected point of reference to enable the camera to capture different images for display, by a remote display device, to a remote participant to the video communication session. ¶0062 discloses the imaging controller 256 can determine the current object of interest and spatial location by one or more techniques including acquiring the facial images of each participant in the captured image using face detection techniques, motion detection, an active speaker tracking using speaker localization and a microphone array, gaze detection of one or more participants, and the like. The imaging controller 256, using face detection techniques, microphone array analysis, or motion detection, can detect a participant entering or leaving the monitored communication session area, e.g., room; figs. 2-6) captured live audio within the generated audio mix is optimized (Cassini ¶0066 discloses the local audio controller 258 can control the pan and/or tilt and/or spatial position in the remote room of the remote image capturing device 216 to provide better eye contact between local and remote participant active speakers. The local audio controller 258, by local audio tracking, can have information about local active speaker location relative to the current position of the local image capturing device 216, or point of reference of the remote participants. The local audio controller 258 can control the pan and/or tilt and/or spatial position in the remote room of the remote image capturing device 216 so as to align, by the local and remote display devices 200, an image of the active speaker participant in the local room with a selected active speaker or non-speaker participant in the remote room) for one or more of:
stereo (Cassini ¶0028, ¶0055, ¶0064, ¶0072, ¶0073 discloses stereophonic sound), multichannel surround sound (Cassini ¶0028 discloses surround sound), binaural audio (Cassini ¶0064 discloses binaural recording), three-dimensional sound (Cassini ¶0055 discloses the data can be decoded into a multi-channel [e.g., 5.1 channel] audio source), Ambisonics, and 360 degree sound.
Lim and Cassini are analogous art as they pertain to video conference. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify spatial positioning of participant (as taught by Lim) to capture audio stereophonically in the local monitored session location, which can be used for active speaker tracking [through speaker localization] to identify and locate a current active speaker, capture a video image of a selected area with the image capturing  device, and a processor to correlate the captured video image of the area and extract the digital data as an AC-3 bitstream (as taught by Cassini, ¶0055) to overcome issues such as providing remote participants with an undesirable video of the meeting, including blind spots, hidden participants, misleading gaze direction (Cassini, ¶0004).



Regarding Claim 9, Lim discloses the method of claim 1, but may not explicitly disclose wherein the captured video frames to be transmitted are optimized for one or more of: multichannel video, spatial video, and 360 degree video.
However, Cassini (title, abstract, figs. 1-11) teaches wherein the captured video frames to be transmitted are optimized for one or more of: multichannel video, spatial video, and 360 degree video (Cassini ¶0055 discloses the audio capturing device 220 can capture stereophonically audio in the local monitored session location, which can be used for active speaker tracking [through speaker localization] to identify and locate a current active speaker. For example, the image capturing device can capture a video image of a selected area, and a processor can correlate the captured video image of the area and extract the digital data as an AC-3 bitstream. The data can be decoded into a multi-channel [e.g., 5.1 channel] audio source. Commonly, the AC-3 bitstream is interleaved with the video and control bitstreams).
Lim and Cassini are analogous art as they pertain to video conference. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the invention was made to modify spatial positioning of participant (as taught by Lim) to capture audio stereophonically in the local monitored session location, which can be used for active speaker tracking [through speaker localization] to identify and locate a current active speaker, capture a video image of a selected area with the image capturing  device, and a processor to correlate the captured video image of the area and extract the digital data as an AC-3 bitstream (as taught by Cassini, ¶0055) to overcome issues such as providing remote participants with an undesirable video of the meeting, including blind spots, hidden participants, misleading gaze direction (Cassini, ¶0004).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YOGESHKUMAR G PATEL whose telephone number is (571)272-3957. The examiner can normally be reached 7:30 AM-4 PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at (571) 272-7503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/YOGESHKUMAR PATEL/Primary Examiner, Art Unit 2691
Read full office action
Prosecution Timeline

Jul 15, 2024
Application Filed
Mar 19, 2026
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/420,157
Patent 12598426
CHANGE OF A MODE FOR CAPTURING IMMERSIVE AUDIO
2y 5m to grant Granted Apr 07, 2026
18/534,033
Patent 12596525
METHOD TO DETERMINE INTENDED DIRECTION OF A VOCAL COMMAND AND TARGET FOR VOCAL INTERACTION
2y 5m to grant Granted Apr 07, 2026
18/389,832
Patent 12592675
AUDIO DEVICE WITH MICROPHONE AND MEDIA MIXING
2y 5m to grant Granted Mar 31, 2026
18/553,799
Patent 12593010
COMMUNICATION ASSEMBLY
2y 5m to grant Granted Mar 31, 2026
18/386,826
Patent 12587448
AI-BASED NETWORK TROUBLESHOOTING WITH EXPERT FEEDBACK
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
86%
With Interview (+3.4%)
2y 4m
Median Time to Grant
Low
PTA Risk
Based on 650 resolved cases by this examiner. Grant probability derived from career allow rate.