Last updated: May 29, 2026
Application No. 18/464,017
AUTONOMOUS VIDEO CONFERENCING SYSTEM WITH VIRTUAL DIRECTOR ASSISTANCE

Non-Final OA §102§103
Filed
Sep 08, 2023
Priority
Feb 23, 2022 — continuation of 12/041,347
Examiner
MILLER, JOHN W
Art Unit
2422
Tech Center
2400 — Computer Networks
Assignee
Huddly AS
OA Round
3 (Non-Final)
Interview Optional

— +2.9% interview lift. Interview lift (+2.9%) is below the 15.0% threshold. A written response is recommended.
Based on 32 resolved cases, 2023–2026
Examiner Intelligence

MILLER, JOHN W View full profile →
Grants 41% of resolved cases
Career Allowance Rate
13 granted / 32 resolved
-17.4% vs TC avg
Minimal +3% lift
Without
With
+2.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
8 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
83.9%
+43.9% vs TC avg
§102
11.8%
-28.2% vs TC avg
§112
2.2%
-37.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 32 resolved cases
Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claim(s) 36-67, 69, and 72-75 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 60, 62, and 65-67 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lavelle (US 9,270,941 B1).
As to claim 60, Lavelle discloses:  A multi camera system, comprising: a plurality of cameras in a video conferencing space, (see Figure 1 and col. 5, lines 31-57) each of the plurality of cameras comprising: an application programming interface for communicating with other cameras among the plurality of cameras and a display; (as noted in col. 5, lines 31-57, video conferencing endpoints include one or more endpoints 140 interconnected through a network 135; in some configurations, each of the endpoints 140 includes one or more display devices for at least displaying received video and audio data and video and audio capture devices for capturing video data to send to the other video conferencing endpoints 110, 140; the video conferencing endpoint could represent a video conferencing software application (e.g. Microsoft Skype); as another example, the endpoint 140 could represent a dedicated video conferencing environment in which multiple cameras are installed) an image sensor configured to capture an overview video stream representative of the video conferencing space; (see col. 6, lines 4-11; endpoint 110 includes a wide angle camera device 115 configured to capture a video stream of the environment, and preferably is positioned such that all users within the environment are depicted within the captured video stream) at least one microphone configured to provide direction of audio information (see one or more microphones 131, Figure 1; also see Figure 5 and col. 11, line 45, to col. 12, line 40, additionally, the camera controller component 230 collects audio data using microphones sensors within the physical environment (block 515) and determines a direction from which at least a portion of the audio data originated (block 515)); and a video processing unit configured to: automatically select a portion of the overview video stream as a focus video stream, wherein the focus video stream is selected based on a combination of the direction of audio information and one or more detected characteristics of a subject represented in the overview video stream; and output the focus video stream to one or more of the other cameras and the display, (see Figure 5 and col. 11, line 45, to col. 12, line 40; one of the supplemental video streams could be the video stream using a wide angle camera sensor; the camera controller component 230 then manipulates the orientation of the camera device based on the determined measures of activity, the determined direction from which a portion of the audio data originated, and a mapping structure describing a layout of the physical environment; if the camera controller component 230 then detects activity within the frames of the supplemental video stream(s) that is indicative of a user moving to the predefined area of interest within the physical environment, the camera controller component 230 could manipulate the orientation of the camera device to face and capture video data of the predefined area of interest) wherein the focus video stream or one of a plurality of other focus video streams from the other cameras is shown on the display. (see col. 12, lines 27-40; the video streaming component 240 then facilitates the transmission of the encoded video data to a remote conferencing endpoint for display)  
Claim 62, 66, and 67 are met as discussed above for claim 60.
Claim 65 is also met as discussed above for claim 60.  Specifically, the characteristic of a subject moving to a predefined location within a physical environment as captured by a wide-angle camera is an extent of a visibility within of that subject in a captured image frame.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 63 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavelle (US 9,270,941 B1).
As to claim 63: The multi-camera system of claim 60, wherein the one or more detected characteristics include whether the subject is engaged in a gesture.  Lavelle discloses that camera controller component 230 could select the participant whose movement closely matches a predefined movement profile indicative of user speech and that this may be the user who is determined to be speaking as opposed to a user who is simply moving throughout the environment (col. 12, line 60, to col. 13, line 3).  The reference does not disclose a gesture per se but the examiner gives Official Notice that it was notoriously well-known to makes gestures while speaking (that is, to ‘speak with one’s hands’) and to raise one’s hand when determined to speak within a group.  Accordingly, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to implement Lavelle so as to detect these characteristics in order to facilitate the capture of the focus stream.


Claim(s) 36-52, 54-59, 61, 64, 73, and 74 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavelle (US 9,270,941 B1) in view of Harrison et al (US 20190313058 A1).
 As to claim 36, Lavelle discloses:  A multi-camera system, comprising: a plurality of cameras in a video conferencing space, each of the plurality of cameras comprising: (see Figure 1 and col. 5, lines 31-57) an application programming interface for communicating with other cameras among the plurality of cameras and a display; (as noted in col. 5, lines 31-57, video conferencing endpoints include one or more endpoints 140 interconnected through a network 135; in some configurations, each of the endpoints 140 includes one or more display devices for at least displaying received video and audio data and video and audio capture devices for capturing video data to send to the other video conferencing endpoints 110, 140; the video conferencing endpoint could represent a video conferencing software application (e.g. Microsoft Skype); as another example, the endpoint 140 could represent a dedicated video conferencing environment in which multiple cameras are installed) an image sensor configured to capture an overview video stream; (see col. 6, lines 4-11; endpoint 110 includes a wide angle camera device 115 configured to capture a video stream of the environment, and preferably is positioned such that all users within the environment are depicted within the captured video stream) and a video processing unit configured to: automatically select a portion of the overview video stream to output as a focus video stream, wherein the focus video stream is selected based on one or more detected characteristics of a subject; and output the focus video stream to one or more of the other cameras and the display, (see col. 6, line 58, to col. 7, line 8; control device 130 analyzes the video stream captured by the wide angle camera device 115 to determine, for example, which user is speaking and adjusts the orientation of the pan and tilt of the camera device 120 to center the speaker and adjust the zoom level such that the speaker occupies 70% of captured video frames); and wherein the focus video stream or one of a plurality of other focus video streams from the other cameras is shown on the display. (see discussion above, and further col. 10, lines 31-46, upon determining that a user 305 is currently speaking, the camera controller 230 could determine a portion of the captured video data to extract as a stream to be transmitted to the remote video conferencing device)
As to the claimed:  wherein the one or more detected characteristics include a location of eyes of the subject, Lavelle determines that a particular user is speaking based on an analysis of the captured wide angle video stream, but is silent as to the specific analysis.  However, it was notoriously well-known in the art of image analysis to extract a person from video images based on facial features including the location of eyes of the subject.  Accordingly, it would have been clearly obvious to one of ordinary skill in the art prior to the effective filing date of the invention to implement the analysis of Lavelle with this well-known technique to facilitate the pan and zoom process.
Lavelle fails to disclose:  and wherein the focus video stream is framed such that the eyes of the subject are aligned with a separation between a top one third and a middle one third of a frame of the focus video stream.  Harrison, in a similar field of endeavor, teaches a system and method for creating an automated television studio production for a video conferencing space with virtual director assistance ([0002, 0028, 0053]; Fig 5, intelligent director 530).  As described in [0037], Fig 5, [0053-0054] and as illustrated in Figs 6-10, Harrison discloses that 2D pose data can be used by the intelligent director to make cinematic decisions (where to direct the camera, how close to zoom).  This suggests that virtually any image capture (including an image framed such that the eyes of a person are aligned with a separation between a top one third and middle one third of the image) is a matter of design choice.  Therefore, although Harrison doesn’t specifically teach this limitation, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to implement Harrison in this manner as a matter of design choice so as to frame the focused video stream such that the resulting image captured would show the head and upper body of conferees in a way that mirrors persons sitting around a conference table. This would be more natural and pleasing than an image capture where a person's face fills the entire image.  It would have been further obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify Lavelle with that taught by Harrison and as a matter of design choice in order to automate the presentation of video conferencing communications.
As to claim 37, Lavelle discloses:  A multi-camera system, comprising: a plurality of cameras in a video conferencing space, (see Figure 1 and col. 5, lines 31-57) each of the plurality of cameras comprising: an application programming interface for communicating with other cameras among the plurality of cameras and a display; (as noted in col. 5, lines 31-57, video conferencing endpoints include one or more endpoints 140 interconnected through a network 135; in some configurations, each of the endpoints 140 includes one or more display devices for at least displaying received video and audio data and video and audio capture devices for capturing video data to send to the other video conferencing endpoints 110, 140; the video conferencing endpoint could represent a video conferencing software application (e.g. Microsoft Skype); as another example, the endpoint 140 could represent a dedicated video conferencing environment in which multiple cameras are installed) an image sensor configured to capture an overview video stream including representations of at least a first subject and a second subject; (see col. 6, lines 4-11; endpoint 110 includes a wide angle camera device 115 configured to capture a video stream of the environment, and preferably is positioned such that all users within the environment are depicted within the captured video stream) and a video processing unit configured to: automatically select a portion of the overview video stream as a focus video stream; and output the focus video stream to one or more of the other cameras and the display, (see col. 6, line 58, to col. 7, line 8; control device 130 analyzes the video stream captured by the wide angle camera device 115 to determine, for example, which user is speaking and adjusts the orientation of the pan and tilt of the camera device 120 to center the speaker and adjust the zoom level such that the speaker occupies 70% of captured video frames); and wherein the focus video stream or one of a plurality of other focus video streams from the other cameras is shown on the display. (see discussion above, and further col. 10, lines 31-46, upon determining that a user 305 is currently speaking, the camera controller 230 could determine a portion of the captured video data to extract as a stream to be transmitted to the remote video conferencing device)
Lavelle fails to disclose:  wherein, if a representation of the first subject does not overlap with a representation of the second subject in the overview video stream, a first frame of the focus video stream is generated including a representation of the first subject free from a partial representation of the second subject, and a second frame of the focus video stream is generated including a representation of the second subject free from a partial representation of the first subject
Harrison, in a similar field of endeavor, teaches a system and method for creating an automated television studio production for a video conferencing space with virtual director assistance ([0002, 0028, 0053]; Fig 5, intelligent director 530).  Harrison teaches:  wherein, if a representation of the first subject does not overlap with a representation of the second subject in the overview video stream (see Fig. 3A, two subjects that are not overlapping), a first frame of the focus video stream is generated including a representation of the first subject free from a partial representation of the second subject (see Fig. 3B), and a second frame of the focus video stream is generated including a representation of the second subject free from a partial representation of the first subject (see Figs. 3A and 3B; [0034] discloses that either person could be zoomed in on and displayed separately; [0042], particularly note the last three lines “As a result, the intelligent director may personalize the cinematic decisions for Alice, and may thus provide instructions to focus the camera more on Caroline than on other people or objects).  It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify Lavelle with that taught by Harrison in order to automate the presentation of video conferencing communications.
As to claim 38, Harrison teaches:  The multi-camera system of claim 37, wherein an overlap is present between the representation of the first subject and the representation of the second subject if a location of the representation of the second subject in the overview video stream causes the representation of the first subject in the overview video stream to be only partially visible.  [0029, 0054,0067]
As to claim 39, Harrison teaches:  The multi-camera system of claim 37, wherein the first frame and the second frame of the focus video stream are shown in succession.  (see Figures 3A and 3B, and [0036.0056]
As to claim 40, Harrison teaches:  The multi-camera system of claim 37, wherein the first subject is speaking.   (see [0034] and Figures 3A and 3B which illustrate an example user interaction with an example intelligent communication device.  In Figure 3A the intelligent communication device is displaying a scene with two people who are talking to each other”) 
 As to claim 41, Harrison teaches:  The multi-camera system of claim 37, wherein, if a representation of the first subject overlaps with a representation of the second subject in the overview video stream, the focus video stream is generated to include at least one frame that includes the representation of the first subject and the representation of the second subject from the overview video stream.  [0029, 0054,0067]
As to claim 42, Harrison teaches:  The multi-camera system of claim 37, wherein the first subject is determined to be speaking.  (see Figures 3A and 3B and [0034]
Claim 43 is met as discussed above for claim 37.  
As to claim 44:  The multi-camera system of claim 43, wherein the video processing unit is included in one of the plurality of cameras, Lavelle discloses as discussed above in the rejection of claim 37, the control device 130 analyzes video captured from the cameras.  Lavelle does not disclose that it is physically included in one of the plurality of cameras, however this is not considered to be a patentable distinction as it has been held that the mere integration of parts does not in and of itself render a claim patentable.  In the case of Lavelle, the integration of the control device 130 into one or more of the cameras would yield a simplified arrangement of components.  Accordingly, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the combination, and particularly Lavelle, in this manner for the stated advantage.  
As to claim 45, Lavelle discloses a video conferencing space (Figure 3) which meets the claimed:   The multi-camera system of claim 37, wherein the overview video stream includes representation of a meeting room, a workshop, or a classroom.
As to claim 46, Lavelle discloses:  A multi-camera system, comprising: a plurality of cameras in a video conferencing space, (see Figure 1 and col. 5, lines 31-57) each of the plurality of cameras comprising: an application programming interface for communicating with other cameras among the plurality of cameras and a display; (as noted in col. 5, lines 31-57, video conferencing endpoints include one or more endpoints 140 interconnected through a network 135; in some configurations, each of the endpoints 140 includes one or more display devices for at least displaying received video and audio data and video and audio capture devices for capturing video data to send to the other video conferencing endpoints 110, 140; the video conferencing endpoint could represent a video conferencing software application (e.g. Microsoft Skype); as another example, the endpoint 140 could represent a dedicated video conferencing environment in which multiple cameras are installed) an image sensor configured to capture an overview video stream including a representation of at least one subject; (see col. 6, lines 4-11; endpoint 110 includes a wide angle camera device 115 configured to capture a video stream of the environment, and preferably is positioned such that all users within the environment are depicted within the captured video stream) and a video processing unit configured to: automatically select a portion of the overview video stream as a focus video stream, wherein the focus video stream features the at least one subject; and output the focus video stream to one or more of the other cameras and the display, (see col. 6, line 58, to col. 7, line 8; control device 130 analyzes the video stream captured by the wide angle camera device 115 to determine, for example, which user is speaking and adjusts the orientation of the pan and tilt of the camera device 120 to center the speaker and adjust the zoom level such that the speaker occupies 70% of captured video frames); and wherein the focus video stream or one of a plurality of other focus video streams from the other cameras is shown on the display. (see discussion above, and further col. 10, lines 31-46, upon determining that a user 305 is currently speaking, the camera controller 230 could determine a portion of the captured video data to extract as a stream to be transmitted to the remote video conferencing device) 
Lavelle fails to disclose:  wherein the focus video stream is framed to provide a first amount of frame space in a gaze direction of the subject that is greater than a second amount of frame space in a non-gaze direction of the subject.  Harrison, in a similar field of endeavor, teaches a system and method for creating an automated television studio production for a video conferencing space with virtual director assistance ([0002, 0028, 0053]; Fig 5, intelligent director 530).  As described in [0037], Fig 5, [0053-0054] and as illustrated in Figs 6-10, Harrison discloses that 2D pose data can be used by the intelligent director to make cinematic decisions (where to direct the camera, how close to zoom).  This suggests that virtually any image capture (including an image framed such that a first amount of frame space in a gaze direction of the subject is greater than a second amount of frame space in a non-gaze direction of the subject) is a matter of design choice.  Therefore, although Harrison doesn’t specifically teach this limitation, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to implement Harrison in this manner as a matter of design choice in order to mirror the focus of persons sitting around a conference table. This would be more natural and pleasing to the viewer.  It would have been further obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify Lavelle with that taught by Harrison and as a matter of design choice in order to automate the presentation of video conferencing communications.
	As to claim 47:  The multi-camera system of claim 46, wherein the focus video stream also features a representation of an object in the gaze direction of the subject, and wherein the subject is a speaker, Lavelle discloses a whiteboard, an object in the gaze direction of both speaking and non-speaking participants. (see col. 11, line 45, to col. 12, line 26)
As to claim 48:  The multi-camera system of claim 46, wherein the subject is a listener or is not speaking, Lavelle discloses a whiteboard, an object in the gaze direction of both speaking and non-speaking participants. (see col. 11, line 45, to col. 12, line 26)
As to claim 49, Lavelle discloses:  A multi-camera system, comprising: a plurality of cameras in a video conferencing space, (see Figure 1 and col. 5, lines 31-57) each of the plurality of cameras comprising: an application programming interface for communicating with other cameras among the plurality of cameras and a display; (as noted in col. 5, lines 31-57, video conferencing endpoints include one or more endpoints 140 interconnected through a network 135; in some configurations, each of the endpoints 140 includes one or more display devices for at least displaying received video and audio data and video and audio capture devices for capturing video data to send to the other video conferencing endpoints 110, 140; the video conferencing endpoint could represent a video conferencing software application (e.g. Microsoft Skype); as another example, the endpoint 140 could represent a dedicated video conferencing environment in which multiple cameras are installed) an image sensor configured to capture an overview video stream; (see col. 6, lines 4-11; endpoint 110 includes a wide angle camera device 115 configured to capture a video stream of the environment, and preferably is positioned such that all users within the environment are depicted within the captured video stream)and a video processing unit configured to: automatically select a portion of the overview video stream as a focus video stream, wherein the focus video stream is selected based on one or more detected characteristics of a non-speaking participant in a videoconference; and output the focus video stream to one or more of the other cameras and the display, (see col. 6, line 58, to col. 7, line 8; control device 130 analyzes the video stream captured by the wide angle camera device 115 to determine, for example, which user is speaking and adjusts the orientation of the pan and tilt of the camera device 120 to center the speaker and adjust the zoom level such that the speaker occupies 70% of captured video frames) and wherein the focus video stream or one of a plurality of other focus video streams from the other cameras is shown on the display. (see discussion above, and further col. 10, lines 31-46, upon determining that a user 305 is currently speaking, the camera controller 230 could determine a portion of the captured video data to extract as a stream to be transmitted to the remote video conferencing device)
Lavelle fails to disclose:  wherein the focus video stream frames the non-speaking participant.  Harrison, in a similar field of endeavor, teaches a system and method for creating an automated television studio production for a video conferencing space with virtual director assistance ([0002, 0028, 0053]; Fig 5, intelligent director 530).  With regard to the limitation: wherein the focus video stream frames the non-speaking participant, Harrison teaches at [0042], “If the person is inactive, the intelligent director may determine to zoom in on the person's face or upper torso and head. The intelligent director may also have rules based on the facial expressions a person makes. As an example, and not by way of limitation, if the person is laughing, the intelligent director may provide instructions to cut to that person and do a close crop of her face or upper torso and head so that the viewer may see the person laughing. The intelligent director may also have rules based on the gestures a person makes. Gestures may include anything from a hand wave, to a hug, to a head nod, to chopping vegetables in the kitchen. Depending on the gesture, the intelligent director may instruct the camera to do different things. For example, a hand wave or a hug may cause the intelligent director to instruct the camera to cut to the person waving his hand or hugging another person. But a gesture of chopping vegetables may cause the intelligent director to provide instructions to zoom in on the person's hands. So far, this discussion has focused on the actions of a participant who is sending visual data to a receiving participant”.  It would have been further obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify Lavelle with that taught by Harrison in order to automate the presentation of video conferencing communications where there are diverse forms of engagement apart from speaking per se.
As to claim 50, Harrison teaches:  The multi-camera system of claim 49, wherein the one or more detected characteristics include a posture change or a gaze direction associated with the non-speaking participant.   (for “a posture change”, see [0041] “The intelligent director may make cinematic decisions based on a person's location. For example, if the person is located far away from the intelligent communication device but is speaking, the intelligent director may make a determination to zoom in on the person.  A person's orientation may also factor into the cinematic decisions of the intelligent director.  For example, if a person is facing away from the intelligent communication device, the intelligent director may instruct the camera to focus elsewhere.”; for “gaze direction”, see [0045] “is the gaze of at least half the people in the environment directed toward the subject?” If the answer is yes, then the intelligent director may assign that feature a 1.”)
Claim 51 is met as discussed above for claim 49.
Regarding claim 52, Harrison teaches:  The camera system of claim 49, wherein the focus video stream is shown (Fig. 3B) on the display after a speaker shot (Fig. 3A, 310) or a presenter shot is displayed for a predefined length of time.  As discussed in [0034, 0045], “between users on either end of the video chat (with greater affinity scores being assigned a greater weight), the amount of time a person has been present in the environment during the audio-video communication session, the number of words a person has spoken during the audio-video communication session, only if the relevant users have opted in to sharing such information, a length of time during which a participant has made eye contact with an intelligent communication device, and contextual clues”. 
As to claim 54, Lavelle/Harrison fail to disclose:  The multi-camera system of claim 49, wherein the video processing unit and the at least one image sensor are located on a camera.  However, it is noted that the mere integration of parts is not considered to be a patentable distinction.  It would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the invention, in an implementation of Lavelle using the Intelligent Director teachings of Harrison, to integrate the camera controller component 230 into one or more of sensors 210, 220 as a matter of design choice so as to facilitate the implementation of the invention with known devices having camera capabilities.    
Claim 55 is met as discussed above for claim 46.
As to claim 56:  The multi-camera system of claim 49, wherein the focus video stream includes representation of an object in a gaze direction of the non- speaking participant, Lavelle discloses a whiteboard, an object in the gaze direction of both speaking and non-speaking participants. (see col. 11, line 45, to col. 12, line 26)
As to claim 57, Lavelle discloses:  A multi-camera system, comprising: a plurality of cameras in a video conferencing space, (see Figure 1 and col. 5, lines 31-57) each of the plurality of cameras comprising: an application programming interface for communicating with other cameras among the plurality of cameras and a display; (as noted in col. 5, lines 31-57, video conferencing endpoints include one or more endpoints 140 interconnected through a network 135; in some configurations, each of the endpoints 140 includes one or more display devices for at least displaying received video and audio data and video and audio capture devices for capturing video data to send to the other video conferencing endpoints 110, 140; the video conferencing endpoint could represent a video conferencing software application (e.g. Microsoft Skype); as another example, the endpoint 140 could represent a dedicated video conferencing environment in which multiple cameras are installed) an image sensor configured to capture an overview video stream; (see col. 6, lines 4-11; endpoint 110 includes a wide angle camera device 115 configured to capture a video stream of the environment, and preferably is positioned such that all users within the environment are depicted within the captured video stream) and a video processing unit configured to: automatically detect a videoconference participant and one or more objects with which the videoconference participant interacts; (see Figure 4, col. 11, lines 5-44, a method for controlling a camera to capture video data based on activity detected within video data  captured by another camera; camera controller component 230 analyzes the wide-angle stream to detect a region of activity within the frames of the video stream, activities for example including a user currently speaking and a user moving to a whiteboard) select a portion of the overview video stream as a focus video stream, wherein the focus video stream is selected based on the detected videoconference participant and the one or more objects with which the videoconference participant interacts (see col. 11, lines 5-44; the camera controller component 230 then manipulates a second camera device to capture a video stream corresponding to the detected region of activity); and output the focus video stream to one or more of the other cameras and the displays wherein the focus video stream or one of a plurality of other focus video streams from the other cameras is shown on the display. (see discussion above, and further col. 10, lines 31-46, upon determining that a user 305 is currently speaking, the camera controller 230 could determine a portion of the captured video data to extract as a stream to be transmitted to the remote video conferencing device)
Lavelle fails to explicitly disclose:  and wherein the focus video stream is framed to feature both the videoconference participant and the one or more objects with which the videoconference participant interacts.  Harrison, in a similar field of endeavor, teaches a system and method for creating an automated television studio production for a video conferencing space with virtual director assistance ([0002, 0028, 0053]; Fig 5, intelligent director 530).  As described in [0037], Fig 5, [0053-0054] and as illustrated in Figs 6-10, Harrison discloses that 2D pose data can be used by the intelligent director to make cinematic decisions (where to direct the camera, how close to zoom).  This suggests that virtually any image capture (including a focus stream framed to feature both a participant and the one or more objects with which the participant interacts) is a matter of design choice.  Therefore, although Harrison doesn’t specifically teach this limitation, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to implement Harrison in this manner as a matter of design choice in order to mirror the focus of persons sitting around a conference table.  Naturally, one would focus, for example, on a speaker and the whiteboard with which they interact.  It would have been further obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify Lavelle with that taught by Harrison and as a matter of design choice in order to automate the presentation of video conferencing communications.  
Claim 58 is met as discussed above for claims 57.
Claim 59 is met as discussed above for claim 57.  A user moving toward a whiteboard as disclosed in Lavelle is ‘a presenter determined to be speaking’.
Claim 64 is met as discussed above for claim 46.
Claim 61 is met as discussed above for claim 36.
Claim 73 is met as discussed above for claims 36.
As to claim 74, Lavelle discloses a video conferencing space (Figure 3) which meets the claimed:  The multi-camera system of claim 36, wherein the video conferencing space is a meeting space, meeting room, board room, lecture hall, a workshop, or a classroom. 

Claim 69 is rejected under 35 U.S.C. 103 as being unpatentable over Lavelle (US 9,270,941 B1) in view of Wysocki et al (WO 2020/208038 A1). 
Regarding claim 69, Wysocki teaches: “The camera system of claim 60, wherein the video processing unit includes a hardware accelerated convolutional neural network”.  As described in Wysocki Page 4, fifth paragraph, “The camera also includes a hardware accelerated programmable convolutional neural network (CNN). The CNN operates on a model designed using machine learning that allows the hardware to detect people in view of the camera using the overview stream. The CNN looks at the overview stream and detects where in the view of the camera people are detected.  Wysocki is analogous art in that it discloses on page 4, fourth paragraph, Fig. 1, “generating an overview video stream and a focus video stream, wherein said focus video stream comprises sub-video images framing detected objects within said overview video stream”.   See also Figures 1-2.  It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify Lavelle with that taught by Wysocki for the purpose of, as Wysocki (page 2, fourth paragraph) teaches, to provide “a solution avoiding sudden pan/tilt/zoom changes when altering the framing of people in a video stream, as this can disturb the experience of having a video call, using software to smoothly transition any adjustments to the framing of people across parameters that include pan, tilt and zoom. 

Claims 53, 72 and 75 are rejected under 35 U.S.C. 103 as being unpatentable over Lavelle (US 9,270,941 B1) in view of Harrison et al (US 20190313058 A1), and further in view of Wysocki et al (WO 2020/208038 A1). 
Regarding claim 53, Wysocki teaches:  The camera system of claim 49, wherein the video processing unit includes a hardware accelerated convolutional neural network.  As described in Wysocki Page 4, fifth paragraph, “The camera also includes a hardware accelerated programmable convolutional neural network (CNN). The CNN operates on a model designed using machine learning that allows the hardware to detect people in view of the camera using the overview stream. The CNN looks at the overview stream and detects where in the view of the camera people are detected.  Wysocki is analogous art in that it discloses on page 4, fourth paragraph, Fig. 1, “generating an overview video stream and a focus video stream, wherein said focus video stream comprises sub-video images framing detected objects within said overview video stream”.   See also Figures 1-2.  It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify Lavelle/Harrison with that taught by Wysocki for the purpose of, as Wysocki (page 2, fourth paragraph) teaches, to provide “a solution avoiding sudden pan/tilt/zoom changes when altering the framing of people in a video stream, as this can disturb the experience of having a video call, using software to smoothly transition any adjustments to the framing of people across parameters that include pan, tilt and zoom. 
Claims 72 and 75 are met as discussed above. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHN W MILLER whose telephone number is 571-272-7353. The examiner can normally be reached Monday - Friday 7:30 AM - 4:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Colleen Fauz can be reached at 571-272-1667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/JOHN W MILLER/Supervisory Patent Examiner, Art Unit 2422
Read full office action
Prosecution Timeline

Sep 08, 2023
Application Filed
Nov 18, 2024
Non-Final Rejection mailed — §102, §103
Mar 17, 2025
Response Filed
Aug 27, 2025
Final Rejection mailed — §102, §103
Feb 26, 2026
Request for Continued Examination
Mar 08, 2026
Response after Non-Final Action
May 06, 2026
Non-Final Rejection mailed — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/784,794
Patent 12598347
SIGNAL PROCESSING DEVICE AND IMAGE DISPLAY APPARATUS INCLUDING THE SAME
1y 8m to grant Granted Apr 07, 2026
18/704,438
Patent 12556756
SYSTEM COMPRISING TV AND REMOTE CONTROL, AND CONTROL METHOD THEREFOR
1y 9m to grant Granted Feb 17, 2026
18/732,405
Patent 12555179
DYNAMICALLY CONFIGURABLE VIDEO PROCESSING ARCHITECTURE
1y 8m to grant Granted Feb 17, 2026
18/628,791
Patent 12515524
DISPLAY CONTROL DEVICE AND DISPLAY CONTROL METHOD THEREOF
1y 9m to grant Granted Jan 06, 2026
18/208,449
Patent 12498782
Machine-Based Classification of Object Motion as Human or Non-Human as Basis to Facilitate Controlling Whether to Trigger Device Action
2y 6m to grant Granted Dec 16, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
41%
Grant Probability
44%
With Interview (+2.9%)
2y 7m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 32 resolved cases by this examiner. Grant probability derived from career allowance rate.