DETAILED ACTION
This action is in response to the application filed 05/20/2024. Claims 1 – 20 are pending and have
been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 - 4, 6 – 8, 10 – 13, 15 -17 and 19 - 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yu (U.S. Pub. No. 2023/0247073) in view of Oyman (EP Pub. No. 2989790).
Regarding Claim 1, Yu teaches
A method for selectively normalizing resolutions of video streams in a communication system (see Yu, Paragraph [0129], method comprises determining normalized resolutions for first and second regions of interest of an initial video stream captured by a video capture device located within a physical space), comprising:
capturing by a camera, during a communication session, an image of a scene including a subject of interest (SOI), the image being normalized in a first normalization resolution (see Yu Figure 10, conference connection request is sent, connection is established, video capture is initiated, initial video stream is captured, and regions of interest are determined, Paragraph [0106], At 1010, video capture is initiated at the physical space device 1002. At 1012, responsive to the initiation of the video capture, an initial video stream is captured by the video capture device 1000. At 1014, regions of interest of the initial video stream are determined at the physical space device 1002, Paragraph [0082], The initial video stream processing tool 702 processes an initial video stream obtained from a video capture device located within a physical space, for example, the video capture device 400 shown in FIG. 4. Processing the initial video stream includes determining regions of interest of the initial video stream. The initial video stream processing tool 702 may determine the regions of interest by performing object detection against one or more video frames of the initial video stream. For example, the initial video stream processing tool 702 may use a machine learning model trained for object detection to detect objects (e.g., partial or whole human faces) within the initial video stream, and Paragraph [0084], the object size processing tool 704 determines sizes of the objects at the regions of interest within the initial video stream at the resolution captured by the video capture device);
analyzing, by a processor in communication with the camera, the captured image normalized in the first normalization resolution to determine a size of the SOI (see Yu Paragraph [0082], The initial video stream processing tool 702 processes an initial video stream obtained from a video capture device located within a physical space, for example, the video capture device 400 shown in FIG. 4. Processing the initial video stream includes determining regions of interest of the initial video stream. The initial video stream processing tool 702 may determine the regions of interest by performing object detection against one or more video frames of the initial video stream. For example, the initial video stream processing tool 702 may use a machine learning model trained for object detection to detect objects (e.g., partial or whole human faces) within the initial video stream, and Paragraph [0084], the object size processing tool 704 determines sizes of the objects at the regions of interest within the initial video stream at the resolution captured by the video capture device);
selecting, by the processor, a second normalization resolution, different than the first normalization resolution, for a video stream of the SOI based on the determined size, wherein a lower normalization resolution is selected for SOIs with a smaller determined size and a higher normalization resolution is selected for SOIs with a higher determined size (see Yu Paragraph [0085], The normalized resolution determination tool 706 determines the normalized resolutions at which to capture individual video streams for each of the regions of interest. The normalized resolutions are generally different from resolutions of the initial video stream from which the subject regions of interest were detected. Generally, the normalized resolutions determined for a given region of interest may be the same as or higher than the resolution at which the video content of the region of interest was captured within the initial video stream. Thus, determining the normalized resolutions may include increasing the resolution of portions of the initial video stream corresponding to each of the regions of interest. In particular, the amount by which the resolution of a portion of the initial video stream corresponding to a given region of interest is to increase may be based on the size of the detected object (e.g., the conference participant) within that region of interest. This helps to ensure that, when video stream captured according to the normalized resolutions are later output for rendering within separate user interface tiles of conferencing software, the resulting sizes and quality levels of the conference participants within those separate user tiles conform to one another. For example, where there are two regions of interest determined within the initial video stream in which one corresponds to a first conference participant near the video capture device within the physical space and one corresponds to a second conference participant farther from the video capture device within the physical space, determining the normalized resolutions can include increasing the resolution for the region of interest of the first conference participant by a first amount and increasing the resolution for the region of interest of the second conference participant by a second amount which is greater than the first amount. As a result of the increases by the first amount and the second amount, the sizes of the first conference participant and of the second conference participant and the quality levels of the video streams representing those conference participants will be identical or within a threshold range of each other);
causing the camera to normalize and transmit the video stream of the SOI at the selected normalization resolution (see Yu Figure 11, obtain video streams at normalized resolutions from video capture device and Paragraph [0091], The instruction generation tool 708 generates instructions that, when processed by the video capture device, cause the video capture device to capture the individual video streams for each of the regions of interest at the normalized resolutions. The instructions, while referred to as instructions, may be or otherwise include one or more of instructions, commands, data, and/or other information which can be processed to cause the video capture device which receives the instructions to capture the video streams at the normalized resolutions. The instructions are generated based on the normalized resolutions determined by the normalized resolution determination tool 706.); and
causing transmission, by the communication system, of the normalized video stream of the SOI to one or more client devices participating in the communication session (see Yu Figure 11, transmit video streams for output within separate user tiles of software user interface, and Paragraph [0119], At 1112, the video streams are transmitted to a server device for output within separate user interface tiles of a software user interface. The server device runs the conferencing software to which the physical space device and one or more remote devices are connected. A user interface of the conferencing software is output at each of those devices. The video streams captured according to the instructions are rendered, based on output from the server device to each such device, within the separate user interface tiles. As disclosed above, the representations of the first and second conference participants within their respective user interface tiles appear to be the same size and quality level despite those conference participants being at different locations within the physical space and thus initially being of different sizes within the initial video stream).
Yu does not expressively teach
encoding video streams and an encoding resolution
However, Oyman teaches
encoding video streams (see Oyman Paragraph [0022], This feature, named as ROI-based zooming (ROIZoom) can provide better image quality for the selected region than with a simple graphical zoom, since the sending device in this case can use all of the available bandwidth for encoding and transmitting the ROI, which can therefore deliver higher bitrates and quality to the receiving terminal, and Paragraph [0023], The sending client 202 encodes and transmits video based on the indicated ROI) and an encoding resolution (see Oyman Claim 9, receive information from the other user equipment (106, 202), wherein the information defines a user-defined region of interest (110) within a field of view at the user equipment (108, 204); capturing a video that includes the field of view at the first user equipment (108, 204); characterized by encoding the video corresponding to the defined region of interest (110) only, wherein the defined region of interest is configured to be encoded with a negotiated resolution rather than a whole captured frame; and transmitting the encoded video corresponding to the defined region of interest (110) only to the second user equipment (106, 202))
It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of capturing a video in a first resolution, selecting a second resolution of a subject of interest within a video in a video conference according to the size of the SOI, and transmitting the video to devices participating in the video conference at the second resolution (as taught in Yu), with encoding a video stream and altering an encoding resolution of a video stream (as taught in Oyman), the motivation being to increase bandwidth efficiency by encoding and transmitting the video, because encoding allows more efficient use of available bandwidth by allocating higher bitrate to regions of interest and lower bitrate to less important regions, therefore improving perceived quality at the receiving terminal without increasing overall bandwidth consumption (see Oyman Paragraph [0022]).
Regarding Claim 2, Yu in view of Oyman teaches
The method of claim 1, wherein the SOI is a human face (see Yu Paragraph [0123], At 1204, regions of interest of the initial video stream are determined based on the initial video stream. Determining the regions of interest may include performing object detection against the initial video stream, for example, using a machine learning model at or otherwise available to the video capture device. For example, the object detection may be performed to detect human faces representing conference participants within the physical space).
Regarding Claim 3, Yu in view of Oyman teaches
The method of claim 2, wherein analyzing, by the processor in communication with the camera, the captured image to determine the size of the SOI comprises using a facial recognition algorithm to determine a size of the human face (see Yu Paragraph [0074], In some cases, multiple regions of interest may be determined for a single conference participant. For example, a conference participant may be included within the fields of view of two or more different video capture devices 400. In such a case, those multiple regions of interest may be treated as candidate regions of interest for the conference participant and evaluated to select one for use in an output video stream for rendering within a user interface tile representing the conference participant. The candidate regions of interest may be evaluated using a machine learning model trained for facial recognition such as by scoring detections of a face of the subject conference participant within each of the candidate regions of interest according to one or more factors. Examples of the factors may include, but are not limited to, a size of the face of the conference participant, a percentage of the face of the conference participant which is visible (e.g., due to the conference participant facing one video capture device 400 and not another or due to differences in lighting captured by the video capture devices 400), and the presence of other conference participants within a threshold distance of the face of the conference participant. A candidate region of interest having the highest score may be selected and used for processing and rendering within a user interface tile representing the conference participant).
Regarding Claim 4, Yu in view of Oyman teaches
The method of claim 3, wherein using the facial recognition algorithms to determine the size of the human face comprises using a lookup table correlating head sizes to sizes for determining the second encoding resolution (see Yu Paragraph [0074], In some cases, multiple regions of interest may be determined for a single conference participant. For example, a conference participant may be included within the fields of view of two or more different video capture devices 400. In such a case, those multiple regions of interest may be treated as candidate regions of interest for the conference participant and evaluated to select one for use in an output video stream for rendering within a user interface tile representing the conference participant. The candidate regions of interest may be evaluated using a machine learning model trained for facial recognition such as by scoring detections of a face of the subject conference participant within each of the candidate regions of interest according to one or more factors. Examples of the factors may include, but are not limited to, a size of the face of the conference participant, a percentage of the face of the conference participant which is visible (e.g., due to the conference participant facing one video capture device 400 and not another or due to differences in lighting captured by the video capture devices 400), and the presence of other conference participants within a threshold distance of the face of the conference participant. A candidate region of interest having the highest score may be selected and used for processing and rendering within a user interface tile representing the conference participant and Paragraph [0032], The database server 110 stores, manages, or otherwise provides data for delivering software services of the application server 108 to a client, such as one of the clients 104A through 104D. In particular, the database server 110 may implement one or more databases, tables, or other information sources suitable for use with a software application implemented using the application server 108).
Regarding Claim 6, Yu in view of Oyman teaches
The method of claim 1, wherein the communication session is a video conference session (see Yu Paragraph [0021], Implementations of this disclosure address problems such as these by normalizing resolutions for video streams output for display within a software user interface. In particular, according to the implementations of this disclosure, resolutions at which a video capture device located within a physical space, for example, a conference room, concurrently captures multiple video streams are normalized based on regions of interest of an initial video stream captured by the video capture device).
Regarding Claim 7, Yu in view of Oyman teaches
The method of claim 1, wherein selecting, by the processor, the second encoding resolution, different than the first encoding resolution, for the video stream of the SOI based on the determined size comprises selecting the second encoding resolution also based upon a composited scene sent to the one or more client devices (see Yu Figure 5 and 6, Paragraph [0076] FIG. 5 is an illustration of an example of regions of interest of an initial video stream. Three conference participants are shown as being within a physical space, for example, the physical space 402 shown in FIG. 4. In the example shown, the three conference participants are located at different places around a conference room table and are facing a video capture device used to capture the initial video stream (e.g., one of the one or more video capture devices 400 shown in FIG. 4). For example, a front wall of the physical space which the three conference participants are facing may include the video capture device and a display at which a user interface of conferencing software (e.g., the conferencing software 406 shown in FIG. 4) is output. The initial video stream may be processed to determine three regions of interest 500, 502, and 504, in which the region of interest 500 corresponds to a first conference participant located closest to the video capture device near the front wall of the physical space, the region of interest 502 corresponds to a second conference participant located approximately halfway between the video capture device and a rear wall of the physical space, and the region of interest 504 corresponds to a third conference participant located farthest from the video capture device near the rear wall of the physical space, Paragraph [0077], The three conference participants appear as different sizes within the input video stream based on their proximity to the video capture device. As such, the first conference participant appears as a largest size, the second conference participant appears as an intermediate size, and the third conference participant appears as a smallest size. Accordingly, a size of the region of interest 500 (e.g., a number of pixels representing it within a given video frame of the initial video stream) is larger than a size of the region of interest 502, and a size of the region of interest 502 is similarly larger than a size of the region of interest 504. Without resolution normalization processing, video streams captured for each of the regions of interest 500 through 504 would cause the three conference participants to appear either as noticeably different sizes or at noticeably different quality levels within user interface tiles of the conferencing software. This difference in size or quality level may make it difficult to see the third conference participant, who would appear as the smallest of the three, and could ultimately cause some disruption or quality concerns with respect to the conference. However, using instructions for capturing the video streams of each of the regions of interest 500 through 504 at normalized resolutions, the three conference participants would appear to be the same or a similar size and quality level within their separate user interface tiles of the conferencing software, Paragraph [0078], FIG. 6 is an illustration of examples of user interface tiles of a software user interface 600 within which video streams concurrently captured for regions of interest are output. For example, the software user interface 600 may be a user interface of conferencing software, such as the conferencing software 406 shown in FIG. 4. The software user interface includes user interface tiles 602 associated with conference participants, in which some are remote conference participants and others are conference participants located within a physical space, such as the physical space 402 shown in FIG. 4. In particular, the user interface tiles 602 include a first user interface tile 604 at which a video stream captured for a first conference participant (e.g., the first conference participant associated with the region of interest 500 shown in FIG. 5) is output, a second user interface tile 606 at which a video stream captured for a second conference participant (e.g., the second conference participant associated with the region of interest 502 shown in FIG. 5) is output, and a third user interface tile 608 at which a video stream captured for a third conference participant (e.g., the third conference participant associated with the region of interest 504 shown in FIG. 5) is output, and Paragraph [0079], The user interface tiles 604 through 608 represent conference participants within a physical space. In particular, the video streams output within the user interface tiles 604 through 608 are captured at normalized resolutions determined for the regions of interest represented by the user interface tiles 604 through 608. Referring to the example in which the user interface tiles 604 through 608 respectively correspond to the first, second, and third conference participants referenced above in the discussion of FIG. 5, and despite those three conference participants appearing as noticeably different sizes in the initial video stream of FIG. 5, the video streams captured for the three conference participants according to the normalized resolutions conform their sizes and quality levels within the separate user interface tiles 604 through 608, in which the regions of interest are all displayed at normalized resolutions, and same or similar size and quality level within their separate user interface tiles of the conference software, therefore the composited video defines the required resolution).
Regarding Claim 8, Yu in view of Oyman teaches
The method of claim 1, wherein the scene includes a second SOI, and the method further comprises encoding the second SOI at a third resolution selected based on the determined size of the second SOI from the camera (see Yu Figure 5 and 6, Paragraph [0076] FIG. 5 is an illustration of an example of regions of interest of an initial video stream. Three conference participants are shown as being within a physical space, for example, the physical space 402 shown in FIG. 4. In the example shown, the three conference participants are located at different places around a conference room table and are facing a video capture device used to capture the initial video stream (e.g., one of the one or more video capture devices 400 shown in FIG. 4). For example, a front wall of the physical space which the three conference participants are facing may include the video capture device and a display at which a user interface of conferencing software (e.g., the conferencing software 406 shown in FIG. 4) is output. The initial video stream may be processed to determine three regions of interest 500, 502, and 504, in which the region of interest 500 corresponds to a first conference participant located closest to the video capture device near the front wall of the physical space, the region of interest 502 corresponds to a second conference participant located approximately halfway between the video capture device and a rear wall of the physical space, and the region of interest 504 corresponds to a third conference participant located farthest from the video capture device near the rear wall of the physical space, Paragraph [0077], The three conference participants appear as different sizes within the input video stream based on their proximity to the video capture device. As such, the first conference participant appears as a largest size, the second conference participant appears as an intermediate size, and the third conference participant appears as a smallest size. Accordingly, a size of the region of interest 500 (e.g., a number of pixels representing it within a given video frame of the initial video stream) is larger than a size of the region of interest 502, and a size of the region of interest 502 is similarly larger than a size of the region of interest 504. Without resolution normalization processing, video streams captured for each of the regions of interest 500 through 504 would cause the three conference participants to appear either as noticeably different sizes or at noticeably different quality levels within user interface tiles of the conferencing software. This difference in size or quality level may make it difficult to see the third conference participant, who would appear as the smallest of the three, and could ultimately cause some disruption or quality concerns with respect to the conference. However, using instructions for capturing the video streams of each of the regions of interest 500 through 504 at normalized resolutions, the three conference participants would appear to be the same or a similar size and quality level within their separate user interface tiles of the conferencing software, Paragraph [0078], FIG. 6 is an illustration of examples of user interface tiles of a software user interface 600 within which video streams concurrently captured for regions of interest are output. For example, the software user interface 600 may be a user interface of conferencing software, such as the conferencing software 406 shown in FIG. 4. The software user interface includes user interface tiles 602 associated with conference participants, in which some are remote conference participants and others are conference participants located within a physical space, such as the physical space 402 shown in FIG. 4. In particular, the user interface tiles 602 include a first user interface tile 604 at which a video stream captured for a first conference participant (e.g., the first conference participant associated with the region of interest 500 shown in FIG. 5) is output, a second user interface tile 606 at which a video stream captured for a second conference participant (e.g., the second conference participant associated with the region of interest 502 shown in FIG. 5) is output, and a third user interface tile 608 at which a video stream captured for a third conference participant (e.g., the third conference participant associated with the region of interest 504 shown in FIG. 5) is output, and Paragraph [0079], The user interface tiles 604 through 608 represent conference participants within a physical space. In particular, the video streams output within the user interface tiles 604 through 608 are captured at normalized resolutions determined for the regions of interest represented by the user interface tiles 604 through 608. Referring to the example in which the user interface tiles 604 through 608 respectively correspond to the first, second, and third conference participants referenced above in the discussion of FIG. 5, and despite those three conference participants appearing as noticeably different sizes in the initial video stream of FIG. 5, the video streams captured for the three conference participants according to the normalized resolutions conform their sizes and quality levels within the separate user interface tiles 604 through 608).
Regarding Claims 10 - 13, they are rejected similarly as Claims 1 - 4, respectively. The system can be found in Yu (Figure 1, system).
Regarding Claims 15 - 17, they are rejected similarly as Claims 6 - 8, respectively. The system can be found in Yu (Figure 1, system).
Regarding Claims 19 - 20, they are rejected similarly as Claims 1 - 2, respectively. The machine-readable storage device can be found in Yu (Paragraph [0129], non-transitory computer readable media).
Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Yu (U.S. Pub. No. 2023/0247073) in view of Oyman (EP Pub. No. 2989790) and Vroom et al. (U.S. Pub. No. 2024/0137425, hereinafter “Vroom”).
Regarding Claim 5, Yu in view of Oyman teaches all the limitations of claim 1, but does not expressively teach
The method of claim 1, further comprising periodically re-analyzing the captured image to determine any change in the size of the SOI from the camera and adjusting the encoding resolution accordingly.
However, Vroom teaches
The method of claim 1, further comprising periodically re-analyzing the captured image to determine any change in the size of the SOI from the camera and adjusting the encoding resolution accordingly (see Vroom Paragraph [0085], The video/audio communication module 132 can update the quality and/or characteristics of the stream of video/audio media data periodically or continuously throughout the active session (e.g., lowering and/or raising the frame rate, resolution, etc. of the video/audio media data stream as the network connection quality fluctuates over the course of the active session)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of capturing a video in a first resolution, selecting a second encoded resolution of a subject of interest within a video in a video conference according to the size of the SOI, and transmitting the video to devices participating in the video conference at the second encoded resolution (as taught in Yu in view of Oyman), with periodically re-analyzing the captured image to determine any change in the size of the SOI from the camera and adjusting the resolution accordingly (as taught in Vroom), the motivation being to dynamically adjust and update the video quality in real-time, to account for poor network connection quality or issues with network latency (see Vroom Paragraph [0085]).
Regarding Claim 14, it is rejected similarly as Claim 5. The system can be found in Yu (Figure 1, system).
Claims 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Yu (U.S. Pub. No. 2023/0247073) in view of Oyman (EP Pub. No. 2989790) and Wang et al. (U.S. Patent No. 10445402, hereinafter “Wang”).
Regarding Claim 9, Yu in view of Oyman teaches all the limitations of claim 1, and does teach using a machine learning model trained for object detection, facial recognition, or other segmentation that can process the video data of the input video stream to identify humans (see Yu Paragraph [0072]), but does not expressively teach
The method of claim 1, further comprising determining the subject of interest in the image based upon a convolutional neural network (CNN) trained to detect a particular type of object.
However, Wang teaches
The method of claim 1, further comprising determining the subject of interest in the image based upon a convolutional neural network (CNN) trained to detect a particular type of object (see Wang Column 1, lines 9 – 11, method and/or apparatus for implementing a fast and energy-efficient region of interest pooling for object detection with a convolutional neural network).
It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of capturing a video in a first resolution, selecting a second encoded resolution of a subject of interest within a video in a video conference according to the size of the SOI, and transmitting the video to devices participating in the video conference at the second encoded resolution (as taught in Yu in view of Oyman), with determining a subject of interest in an image based upon a convolutional neural network (CNN) trained to detect a particular type of object (as taught in Wang), the motivation being to implement a fast, energy efficient, and real-time automation that detects subjects of interest within a video, by using a convolutional neural network, to provide a high performance image processing and computer vision pipeline in minimal area and with minimal power consumption (see Wang Column 2, lines 34 - 52).
Regarding Claim 18, it is rejected similarly as Claim 9. The system can be found in Yu (Figure 1, system).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Refer to PTO-892, Notice of References Cited for a listing of analogous art.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CARISSA A JONES whose telephone number is (703)756-1677. The examiner can normally be reached Telework M-F 6:30 AM - 4:00 PM CT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at 5712727503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CARISSA A JONES/Examiner, Art Unit 2691
/DUC NGUYEN/Supervisory Patent Examiner, Art Unit 2691