Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 23-26, 34-39, 45 and 46 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavelle (US 9,270,941 B1) in view Wysocki et al (WO 2020/208038 A1).
As to claim 23, Lavelle discloses: A method for automatically controlling framing associated with a focus video stream representative of at least a portion of a video conferencing space (see the video conferencing system, Figure 2 and 7, the video conferencing environment, Figures 7 and 8, and the method of creating a video stream, Figure 6), the method comprising:
capturing video images representing at least a portion of the video conferencing space using a plurality of image sensors each located in a different area of the video conferencing space; (video conferencing enclosures 810A-C covering auditorium seating sections 801-803, each enclosure corresponding to the enclosure 720 of Figure 7)
capturing audio signals in the video conferencing space using a plurality of audio devices each located in a different area of the video conferencing space; (col. 14, lines 39-42: “a single enclosure that includes multiple camera devices (e.g., the enclosure 720 that includes camera devices 115, 120A and 120B and in some configurations a microphone”))
generating a plurality of overview video streams based on the captured video images and generating the focus video stream, the focus video stream being a sub-view of one of the plurality of overview video streams; (see camera sensors 115, 120 of Figure 2; for overview streams see step 610, Figure 6, and for the focus stream see step 630; in steps 615-625, sub-video streams corresponding to focus streams are identified and extracted)
providing the plurality of overview video streams [to a machine learning model, wherein the machine learning model is configured] to analyze the plurality of overview video streams to detect one or more objects of interest present in the video conferencing space, [and wherein the machine learning model is configured] to generate an output indicative of the one or more detected objects of interest; (Figure 6, step 615: “perform facial recognition analysis to identify video conferencing participants within frames of the video data”; step 620: “determine a measure of motion for each identified participant across frames of the video data”; and step 625: “select one of the identified participants based on the determined measures of motion”)
using a virtual director to determine a framing change relative to the focus video stream, based on the one or more detected objects of interest and a particular scenario; (col. 15, lines 14-26: “control logic for the video conferencing enclosure 810C can analyze the received video data received from the other video conferencing enclosures, as well as the video data captured using the camera sensors 115 and 120 on the video conferencing enclosure 810C, to determine which video stream(s) to include in the generated video stream)
generating an updated focus video stream based on the determined framing change; and (Figure 6, col. 12, lines 55+, the camera controller component can determine which of the participants is currently speaking within the physical environment, and may ignore other forms of motion; given that the system described is a video conferencing system, it is evident that the analysis discussed above updates as current speakers change)
outputting the updated focus video stream to a display. (col. 15, lines 27-28: “The generated video stream is then transmitted to the user device 125”)
Lavelle fails to disclose the use of a machine learning model to analyze the video streams to detect objects of interest in the video conferencing space. Wysocki teaches a system for transitioning between best overview frames in live video. Specifically, Wysocki teaches, page 4, 5th paragraph: “The camera also includes a hardware accelerated programmable convolutional neural network (CNN). The CNN operates on a model designed using machine learning that allows the hardware to detect people in view of the camera using the overview stream. The CNN looks at the overview stream and detects where in the view of the camera people are detected. Wysocki is analogous art in that it likewise discloses a system for generating a focus video frame: page 4, 4th paragraph, “generating an overview video stream and a focus video stream, wherein said focus video stream comprises sub-video images framing detected objects within said overview video stream”. See also Figures 1-2. It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify Lavelle with that taught by Wysocki for the purpose of, as Wysocki teaches (page 2, 4th paragraph): “a solution avoiding sudden pan/tilt/zoom changes when altering the framing of people in a video stream, as this can disturb the experience of having a video call, using software to smoothly transition any adjustments to the framing of people across parameters that include pan, tilt and zoom”.
Claim 24 is met by that discussed above in claim 23.
As to claim 25, Wysocki teaches: The method of claim 24, wherein said neural network is pre-trained with a training set of video and audio data adapted to the particular scenario. (Wysocki, page 4, 5th paragraph, to page 6, 1st paragraph: the CNN can be trained to not be biased on parameters like gender, age, and race; the model will also be able to detect a partial view of a person and people viewed from different angles)
As to claim 26, Lavelle discloses: The method of claim 25, wherein said particular scenario is selected from the group consisting of a classroom, a workshop, a townhall, a newsroom, a boardroom, a courtroom, an interview studio, and a voting chamber. (the scenarios claimed are considered intended use and generally do not carry patentable weight; it is however noted that the auditorium disclosed with respect to Figure 8 of Lavelle can be applied to any of these scenarios)
As to claim 34, Lavelle discloses: The method of claim 23, further comprising capturing non-image signals in said video conferencing space using a plurality of smart sensors, each of said plurality of smart sensors comprising an application program interface connected with said virtual director to provide input to said virtual director. (Lavelle, col. 11, line 62, to col. 12, line 5: camera controller component 230 collects audio data using microphone sensors within the physical environment (block 515) and determines a direction from which at least a portion of the audio originated; a microphone array could be used to capture the audio data, and upon identifying that a portion of the audio data matches a predefined profile for user speech, the camera controller component 230 could use the data collected to determine the direction from which user speech originated; also, note that the virtual director is met by that discussed above with respect to the video conferencing enclosure 810C; an API is inherent to the arrangement in Figure 8 given the connections and successful communication between the video conferencing enclosures )
Claim 35 is met by that discussed above for claim 34.
As to claim 36, Lavelle discloses: The method of claim 23, wherein said one or more objects of interest comprise persons and non-person items. (Lavelle discloses, Figure 4, col. 11, lines 5-44, camera controller component 230 analyzing the wide-angle stream to detect a region of activity, including a user moving to a pre-defined location within the physical environment such as a whiteboard)
As to claim 37, Lavelle discloses: The method of claim 23, wherein the machine learning model is further configured to analyze the plurality of overview video streams to detect postures of the one or more objects of interest, and wherein said postures comprises positions, orientations, gestures, and directions of said detected objects. (As indicated above in the rejection of claim 23, Lavelle determines which participants are currently speaking in a physical environment and may ignore other forms of motion such as one of the participants nodding or scratching their head, col. 12, lines 55+; further, as indicated in the rejection of claim 36, Lavelle detects a user moving to a predefined region such as a whiteboard)
Claim 38 is met as discussed above for claim 23.
As to claim 39 Lavelle discloses: The method of claim 23, wherein the video conferencing space is selected from the group consisting of a classroom, a workshop, a townhall, a newsroom, a boardroom, a courtroom, an interview studio, and a voting chamber. (the type of video conferencing space claimed is considered intended use and generally does not carry patentable weight; it is however noted that the auditorium disclosed with respect to Figure 8 of Lavelle is applicable to any of these spaces)
Claim 45 is met as discussed above for claim 23.
As to claim 46, Lavelle/Wysocki fail to explicitly disclose: The method of claim 23, wherein the plurality of image sensors and the plurality of audio devices are included in a plurality of smart cameras. However, it is noted that the mere integration of parts is not considered to be a patentable distinction. It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention, in an implementation of the combination, to integrate cameras 115 and 120, control device 130, and microphones 131 as a matter of design choice in order to facilitate the implementation of the invention with known devices having these combinations of features.
Claim 47 is met as discussed above for claim 23.
Claim(s) 27-33 and 40-44 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavelle (US 9,270,941 B1) in view Wysocki et al (WO 2020/208038 A1) and in further view of Harrison et al (US 2019/0313058 A1).
As to claim 27, Harrison teaches an intelligent communication device for use in a video conferencing environment that has internal processing that enables automated cinematic decisions based on a descriptive model. Harrison teaches: The method of claim 23, wherein the method further comprises applying, using the virtual director, a predetermined rule set to determine the framing change, wherein the predetermined rule set comprises a first rule for evaluating possible framing for each object in the video conferencing space based on a first plurality of parameters to determine a best frame. (Note Harrison, Figure 5; [0053] teaches: “Once the intelligent director 530 has accessed the information in the descriptive model 520, it may generate a plan 540 for the camera and microphone to follow”; [0042] teaches: “The intelligent director may have rules based on movements the person is currently taking. Generally, if a person is active [e.g., moving quickly around the room, jumping, waving his arms], the intelligent director may determine to center the camera on the active person and zoom out so that the person's movements may be seen without quick and jerky camera movements; the predetermined rule set comprises a first rule to zoom out based on a first plurality of parameters, such as a person moving quickly around the room, jumping, or waving arms---this first rule indicates the evaluation of possible framing for each object in the video conferencing space, as claimed) It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison in order to automate the cinematic decision-making process to make the rendering of video more natural and pleasing to the participants.
As to claim 28, Wysocki teaches: The method of claim 27, wherein the predetermined rule set further comprises a second rule for detecting changes in the video conferencing space based on a second plurality of parameters to trigger transition of frames. (Note Wysocki, page 5, 2nd and 3rd paragraphs: “Once the number of people in view of the camera and their position has been established, the camera uses this information to run an algorithm designed to determine the appropriate and desired view to apply on the primary stream. The algorithm includes parameters that describe padding on all sides of the detected persons in view of the camera. It also includes parameters that describe how often the camera should react to change”; “By detecting where people are, the camera adapts the field of view for the best experience, at varied speed based on what is happening. For instance, if there are new people [Figure 2.1] then the camera is zoomed out, if there is one person [Figure 2.2] the camera frames that person, if there are two [Figure 2.3] then the camera frames both people and if the people move [Figure 2.4] the camera may update the framing as well”; that is, based on detected changes in the video conferencing space, the camera framing is updated---this indicates another rule (second rule) for triggering the transition of frames, such as updating the framing when the objects move. It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to further modify Lavelle with these teachings for the advantage noted above in the rejection of claim 23.
As to claim 29, Harrison teaches: The method of claim 28, wherein the predetermined rule set further comprises a third rule for applying suitable shot types to each frame based on a third plurality of parameters. (see Harrison, [0042]: The intelligent director may also have rules based on the gestures a person makes. The intelligent director may also have rules based on the facial expressions a person makes. As an example and not by way of limitation, if the person is laughing, the intelligent director may provide instructions to cut to that person and do a close crop of her face or upper torso and head so that the viewer may see the person laughing; that is, based on facial expressions made, which are a third plurality of parameters, the third rule directs a close crop of her face---this third rule is for applying suitable shot types to each frame; also see [0028]: consistent with television studio production principles, such cinematic decisions may include any choice a human director would make if she were controlling the camera[s] and microphones [e.g., generating cinematic cuts], as well as any decision that might be available by way of a video editor [e.g.., choosing to apply visual effects on the fly; that is, the decisions are based on the predetermined rules set, and since the decisions are cinematic decisions which include television studio production principles, the predetermined rule set, including the third rule, is interpreted as conforming to television studio production principles) It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
As to claim 30, Harrison teaches: The method of claim 29, wherein said shot types are selected from the group consisting of a total shot, a medium shot, a close shot, an interest shot, a listening shot, a presenter shot, a speaker shot, and a group shot. It is noted that claim 30 is a Markush claim and as such the examiner need only meet one alternative of the group to satisfy the claim. Nevertheless, the alternatives are met as follows: the claimed total shot/group shot is met by the overall shot of people in conference(Figures 1 and 3A, [0042]); the claimed medium shot is met by ‘the intelligent director may determine to center the camera on the active person and zoom out’---the camera centered on the active person is interpreted as a medium shot; the claimed close shot/interest shot is met by ‘close crop of her face so that the viewer may see the person laughing’---this is also an interest shot due to framing an object of Interest based on cues of the scene, i.e. laughing, see [0057]; the claimed listening shot, and presenter shot / speaker shot is met by Figure 3a which shows one person is a presenter and the other listening. It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
As to claim 31, Harrison teaches: The method of claim 30, wherein the predetermined rule set further comprises a fourth rule for applying a virtual Director's Cut based on a fourth plurality of parameters to the video conferencing space. (see [0002] This disclosure generally relates to video conferencing; that is the automated television studio production noted in [0028] is for a video conferencing space; also see [0042] which teaches various actions, movements, and gestures of persons and how they are factored in based on rule set for the cinematic decisions of the intelligent director) It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
As to claim 32, Lavelle discloses: The method of claim 31, wherein said video conferencing space comprises a classroom, a workshop, a meeting room, a broadcast, a bilateral negotiation, a court proceeding, a panel discussion, or a voting assembly. (the type of video conferencing space claimed is considered intended use and generally does not carry patentable weight; it is however noted that the auditorium disclosed with respect to Figure 8 of Lavelle is applicable to any of these spaces)
As to claim 33, Harrison teaches: The method of claim 32, wherein the predetermined rule set further comprises a fifth rule for framing clean shots for objects within the video conferencing space based on a fifth plurality of parameters. (see [0002] This disclosure generally relates to video conferencing; that is the automated television studio production noted in [0028] is for a video conferencing space; also see [0042] which teaches various actions, movements, and gestures of persons and how they are factored in based on rule set for the cinematic decisions of the intelligent director; additionally, the claimed ‘framing a clean shot’ is met by [0048]) It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
As to claim 40, Harrison teaches: The method of claim 27, wherein the first plurality of parameters comprises: (i) whether the one or more detected objects is speaking; (ii) a length of speaking time; (iii) a direction of the one or more detected objects' gaze; (iv) an extent of the one or more detected objects' visibility in the focus video stream; (v) a posture of the one or more detected objects; and (vi) what other objects are visible in the focus video stream. ((i) whether the object is speaking and (ii) the length of speaking time, is met by [0045]: “the amount of time a person has been present in the environment during the audio-video communication session, the number of words a person has spoken during the audio-video communication session”; (iii) the direction of the object's gaze, is met by [0045]: “is the gaze of at least half the people in the environment directed toward the subject?” If the answer is yes, then the intelligent director may assign that feature”; (iv) the extent of the object's visibility in the frame, (v) the posture of the object, and (vi) what other objects are visible in the frame, are met by [0032, 0036, 0041]) It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
As to claim 41, Harrison teaches: The method of claim 28, wherein the second plurality of parameters comprises: (i) the one or more detected objects starts to speak; (ii) the one or more detected objects moves; (iii) the one or more detected objects stands up; (iv) the direction of the one or more detected objects' gaze changes; (v) the one or more detected objects shows a reaction; (vi) the one or more detected objects displays a new item in the scene; (vii) the one or more detected objects has spoken for a predefined length of time; and (viii) lack of meaningful reactions in other detected objects for a predefined length of time. ((i) an object starts to speak, (ii) an object moves, (iii) an object stands up, (iv) the direction of an object's gaze changes, and (v) an object shows a reaction, are met by [0045, 0048]; (vi) an object displays a new item in the scene, is met by [0056]: “The old background information for bounding box regions therefore remains unchanged. This is why bounding boxes 720 in FIG. 7 show no background—this is to illustrate that no new background information is gathered about the area inside the bounding boxes—old data from previous frames may still be kept”; (vii) an object has spoken for a predefined length of time, is met by [0045]: “the amount of time a person has been present in the environment during the audio-video communication session, the number of words a person has spoken during the audio-video communication session”; (viii) lack of meaningful reactions in other objects for a predefined length of time, is met by [0042]: “As an example and not by way of limitation, a first participant, Alice, may be having an AV communication session with a friend, Betsy, and Betsy's friend, Caroline. When Caroline comes into view on Alice's smart communication device, Alice may smile and say something like “Hi Caroline! It's so nice to see you!”, Alice's smart communication device may pick up on this reaction from Alice and store this as an increased affinity for Caroline during the duration of the communication session, but not for future communication sessions. As a result, the intelligent director may personalize the cinematic decisions for Alice, and may thus provide instructions to focus the camera more on Caroline than on other people or objects. It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
As to claim 42, Harrison teaches: The method of claim 29, wherein the third plurality of parameters comprises: (i) a total shot to frame substantially all detected objects and most of the video conferencing space thereby providing an overall context to the video conferencing space; (ii) a medium shot to frame a predefined number of detected objects and focus on one who is speaking, thereby featuring an active dialog; and (iii) a close shot to frame an object of the one or more detected objects speaking for a predefined length of time, thereby featuring a presenter. (the claimed total shot/group shot, the overall shot of people in conference, is met by Figures 1 and 3A, and [0042]; the claimed medium shot is met by: the intelligent director may determine to center the camera on the active person and zoom out---the camera is still centered on the active person and is interpreted as a medium shot; the claimed close shot is met by: close crop of her face so that the viewer may see the person laughing) It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
As to claim 43, Harrison teaches: The method of claim 42, wherein the third plurality of parameters further comprises (i) an interest shot to frame the one or more detected objects of interest based on cues of the scene in the video conferencing space, including the one or more detected objects being at the center of a gaze from every other detected object within the video conferencing space and an item held up by the one or more detected objects; (ii) a listening shot to frame at least one other detected object who is not speaking, thereby featuring engagement of non-speaking objects in the video conferencing space; and (iii) a presenter shot to frame the one or more detected objects who has been speaking for the longest length of time compared to other detected objects, thereby featuring the presenter from different camera angles and compositions within the video conferencing space, wherein said interest shot is adapted as a close shot, said listening shot is adapted as one of a close and medium shot, and said presenter shot is adapted as one of a close and medium shot. (an interest shot and a total shot/group shot (the overall shot of people in conference) is met by Figures 1 and 3A and [0042]; a medium shot is met by: the intelligent director may determine to center the camera on the active person and zoom out---the camera is still centered on the active person and is interpreted as a medium shot; a close shot is met by: close crop of her face so that the viewer may see the person laughing; an interest shot is met by: a close crop of her face so that the viewer may see the person laughing (this shot is also an interest shot due to framing an object of Interest based on cues of the scene, i.e. laughing, see [0057]; a listening shot, and a presenter shot/speaker shot is met by Figure 3a in which one person is presenter and the other listening; a presenter shot to frame an object who has been speaking for the longest length of time compared to other objects, thereby featuring the presenter from different camera angles and compositions within the video conferencing space, wherein said interest shot is adapted as a close shot, said listening shot is adapted as one of a close and medium shot, and said presenter shot is adapted as one of a close and medium shot, is met by that noted above and in [0032, 0041-0042]. It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
As to claim 44, Harrison teaches: The method of claim 31, wherein the fourth plurality of parameters comprises:(i) a classroom production scenario starting with showing a presenter and an audience using total shots, then transitioning to framing the presenter in presenter shots for a predefined length of time, followed by switching between listening shots showing the audience and presenter shots showing the presenter; and(ii) a meeting room production scenario starting with total shots creating an understanding of the entire video conferencing space with all visible objects, after a predefined length of time transitioning to framing a group of objects with medium shots in a sub-location of the video conferencing space focusing on an active object, followed by framing an object who is speaking at the sub-location using medium shots that best display the front of the object's face, after another predefined length of time switching to framing other objects in the video conferencing space using listening shots that best display the front of the object's faces, and rotating back to total shots featuring all objects if no object is speaking in the video conferencing space. (as noted above, Harrison [0002]: “This disclosure generally relates to video conferencing, that is, the automated television studio production noted in [0028] is for a video conferencing space; further, for the limitation “any special-purpose video conferencing space” is interpreted as applicable to any intended use, such as a classroom, a workshop, a meeting room, a broadcast, a bilateral negotiation, a court proceeding, a panel discussion, and a voting assembly; the type of the conferencing space where the conference is held is considered intended use and generally carries no patentable weight, however, see Harrison [0079]; it was further noted above that the auditorium disclosed with respect to Figure 8 of Lavelle can be applied to any of these scenarios; the remaining limitations are met as discussed above for claims 41-43 and by Harrison [0032, 0041-0042]. It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Lavelle/Wysocki with that taught by Harrison for the reasons indicated above.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHN W MILLER whose telephone number is 571-272-7353. The examiner can normally be reached Monday - Friday 7:30 AM - 4:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Colleen Fauz can be reached at 571-272-1667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JOHN W MILLER/ Supervisory Patent Examiner, Art Unit 2422