Last updated: April 19, 2026
Application No. 18/130,459
USER-CHOSEN, OBJECT GUIDED REGION OF INTEREST (ROI) ENABLED DIGITAL VIDEO

Non-Final OA §103
Filed
Apr 04, 2023
Examiner
TELAN, MICHAEL R
Art Unit
2426
Tech Center
2400 — Computer Networks
Assignee
Ittiam Systems (P) Ltd.
OA Round
5 (Non-Final)
This examiner grants 42% of cases after interview

— +27.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 417 resolved cases, 2023–2026
Examiner Intelligence

TELAN, MICHAEL R View full profile →
Grants 42% of resolved cases
Career Allow Rate
176 granted / 417 resolved
-15.8% vs TC avg
Strong +27% interview lift
Without
With
+27.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
36 currently pending
Career history
453
Total Applications
across all art units
Statute-Specific Performance

§101
7.2%
-32.8% vs TC avg
§103
65.6%
+25.6% vs TC avg
§102
13.6%
-26.4% vs TC avg
§112
9.6%
-30.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 417 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on February 5, 2026 has been entered.
 
Response to Arguments
Applicant's arguments filed February 5, 2026 have been fully considered but they are not persuasive.

With regard to claim 1, Applicant submits that the application of Govil to the combination relies on hindsight. Remarks, p. 12.
In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning. But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper. See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).

Applicant additionally submits that the cited prior art does not teach “excluding background visual information outside the identified pixel-level, arbitrarily shaped boundaries,” as recited in claim 1. Remarks, pp. 12-13.
Claim 1 is rejected under 35 USC §103 over a combination of Waggoner et al. (US 2015/0268822), Ramaswamy et al. (US 2018/0270515), Gondo et al. (US 20160088294), and Govil (US 2021/0120286).
Waggoner teaches wherein the stream is encoded to represent an enhanced visual quality for the object of interest and excluding background visual information ([0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail. If used with a tile approach, the additive streams can be requested for only those tiles that are currently being displayed at the magnified view.”).
Gondo teaches providing additional visual information spanning only a region of an identified area of interest ([0075], “Next, FIG. 12 is a diagram illustrating a distribution mode of the video stream in the streaming system according to the fifth embodiment. In the streaming system according to the fifth embodiment, the user receives a reduced entire image in the client device 2. The decoder 27 of the client device 2 decodes the received entire image, and displays the decoded image on the monitor device. Next, in the client device 2, when the user designates a desired area in the entire image, the decoder 27 makes a request for transmission of an image of a designated area designated by the user, to the server device 1. The server device 1 encodes a high-resolution zoomed image of the designated area corresponding to the request for transmission of the image, and transmit the encoded image to the client device 2. The first to n-th trimming areas illustrated in FIG. 11 each represent a high-resolution zoomed image (GOP of trimming area) of each designated area designated by the user. The decoder 27 receives and decodes the transmitted high-resolution zoomed image, and displays the decoded image on the monitor device. Therefore, the partial high-quality image corresponding to the desired part of the entire image can be seen without blur.”).
Govil teaches identifying pixel-level, arbitrarily shaped boundaries for an object of interest, and excluding background visual information outside the identified pixel-level, arbitrarily shaped boundaries ([0043], “Referring first to FIG. 4 with continued reference to FIGS. 1-2, upon analysis of the visual content of the video frame 400, the object detection module 120 may detect or otherwise identify a number of regions 402, 404, 406 within the video frame 400 that correspond to replaceable objects (e.g., task 202), for example, by detecting or otherwise differentiating the boundaries of the corresponding regions of pixels from the underlying background content of the video frame 400 using machine learning, artificial intelligence, object recognition, or other pattern recognition or image analysis techniques. For each detected object region 402, 404, 406, the object recognition module 122 analyzes the respective set of pixels that comprise the respective region 402, 404, 406 to determine the type or other taxonomic classification of the respective object and discern additional physical and/or visual attributes of the respective object (e.g., task 204).”).
Taking the teachings of the references together, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings to produce a combination included with means for identifying pixel-level, arbitrarily shaped boundaries for the object of interest, wherein the additional visual information spans only the identified pixel-level, arbitrarily shaped boundaries of the object of interest, and wherein the object-specific coded stream is encoded to exclude background visual information outside the identified pixel-level, arbitrarily shaped boundaries. The modification would serve to provide an alternative and/or supplemental means of identifying objects in content.

In response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., “dynamically defining the shape of the coded stream to ensure background pixels remain excluded as the object moves and changes shape,” Remarks, p. 13) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2, 4-5, 7, 9-10, 13-14, and 23-25 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Waggoner et al. (US 2015/0268822), Ramaswamy et al. (US 2018/0270515), Gondo et al. (US 20160088294), and Govil (US 2021/0120286).

Regarding claim 1, Waggoner teaches a method comprising:
providing, via an application, a first video stream on a display panel of a user device ([0020], “As an example, FIG. 2(a) illustrates an example situation 200 wherein a user is able to view a presentation of video content 206 on a touch-sensitive display 204 of a computing device 202.” [0060], “User or client devices can include any of a number of general purpose personal computers, such as desktop or laptop computers running a standard operating system, as well as cellular, wireless and handheld devices running mobile software and capable of supporting a number of networking and messaging protocols.” Fig. 2a);
receiving, from a user, a selection of an object of interest associated with a portion of the first video stream ([0016], “In particular, various embodiments enable a user to specify/select one or more objects of interest to be tracked in video content displayed on a computing device or other presentation device. In some embodiments, a user can select an object by specifying, using an input element (e.g., two or more fingers), a boundary around the object, and then specify a magnification level by adjusting a separation of at least two of those fingers.” [0039], “FIG. 6 illustrates an example process 600 for determining a portion of video content to display based upon a selected OBJECT of interest that can be utilized in accordance with various embodiments. … In this example, user input is received 602 that indicates one or more points in a video frame.”);
in response to receiving the selection, continuously tracking movements of the object of interest across video frames and identifying boundaries for the object of interest ([0016], “In particular, various embodiments enable a user to specify/select one or more objects of interest to be tracked in video content displayed on a computing device or other presentation device. … A location of a representation of that object (e.g., the object of interest) within the video can be determined whenever the representation is determined to be present in a frame of video to be displayed. … Algorithms can be used to track the representation of the object between different frames, and track the representation of the object even if it undergoes various deformations of appearance (e.g., turns to the side).” [0039], “In this example, the tracking data for the object exists, such that the tracking data can be received 612 to the client device. The portion of the subsequent video frames that include a representation of the object of interest can then be displayed 614 with the object of interest approximately centered in the portion with the appropriate presentation size, to the extent possible and/or practical as discussed elsewhere herein.” [0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc. In this example, there are four bounding boxes 1002 indicating objects that have been selected by a user or provider, or that have been identified using an algorithm or process, as being potentially of interest to be tracked for the user.” Fig. 10),
wherein the object of interest is separated from a background and other objects ([0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc. In this example, there are four bounding boxes 1002 indicating objects that have been selected by a user or provider, or that have been identified using an algorithm or process, as being potentially of interest to be tracked for the user.” Fig. 10, each of the bounding boxes 1002 indicating objects are separate from the background and other objects);
providing, from a video server to the user device, stream comprising additional visual information ([0037], “In some cases, the media service provider can provide the content directly, such as from a video data store 514 of the provider environment 506.”),
wherein the additional visual information comprises one of
an enhancement layer of a scalable video coding (SVC) scheme ([0016], “in some embodiments, the location of the representation of the object, when the object is included in a frame, may be approximately centered (in the displayed portion) and displayed with a presentation size that corresponds with the magnification level specified by the user. Such an approach provides what is referred to herein as a ‘smart zoom,’ as frames or segments of the video that include the object of interest can be ‘zoomed in,’ enabling a greater level of detail to be seen, particularly on devices with relatively small and/or low resolution display screens.” [0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail.”),
a higher adaptive bitrate streaming (ABR) variant, and
an arbitrarily shaped object-based coded stream representing the object of interest,
wherein the stream is encoded to represent an enhanced visual quality for the object of interest and excluding background visual information ([0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail. If used with a tile approach, the additive streams can be requested for only those tiles that are currently being displayed at the magnified view.”); and
rendering a region of interest on the display panel using the additional visual information to display the object of interest at an enhanced visual quality and resolution ([0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail.”).
Waggoner does not expressly teach identifying pixel-level, arbitrarily shaped boundaries for the object of interest. While Waggoner teaches providing, from the video server to the user device, a stream comprising additional visual information, Waggoner does not expressly teach providing, from a video server to the user device, an object-specific coded stream comprising additional visual information spanning only the identified pixel-level, arbitrarily shaped boundaries of the object of interest. Waggoner also does not expressly teach wherein the object-specific coded stream is encoded to represent an enhanced visual quality for the object of interest and excluding background visual information outside the identified pixel-level, arbitrarily shaped boundaries.
Ramaswamy teaches:
providing, from a video server to the user device, an object-specific coded stream, wherein the object-specific coded stream is encoded to represent an enhanced visual quality for an object of interest ([0063], “The zoom coding encoder 208 receives the source video stream either in uncompressed or a previously compressed format, encodes or transcodes the source video stream into a plurality of zoom coded streams 210, wherein each of the zoom coded streams represents a portion (e.g. a slice, a segment, or a quadrant) of the overall source video. The zoom coded streams may be encoded at a higher resolution than traditional reduced resolution ABR streams. In some embodiments, the zoom coded streams are encoded at the full capture resolution.” [0064], “The zoom coded streams are transmitted to or placed onto the streaming server for further transmission to the client devices.” [0071]).
In view of Ramaswamy’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination to include providing, from a video server to the user device, an object-specific coded stream, wherein the object-specific coded stream is encoded to represent an enhanced visual quality for an object of interest. The modification would serve to facilitate identification and tracking of objects in content.
The combination teaches the limitations specified above; however, the combination does not expressly teach does not expressly teach identifying pixel-level, arbitrarily shaped boundaries for the object of interest. The combination also does not expressly teach the additional visual information spanning only the identified pixel-level, arbitrarily shaped boundaries of the object of interest. The combination also does not expressly teach wherein the object-specific coded stream is encoded to exclude background visual information outside the identified pixel-level, arbitrarily shaped boundaries.
Gondo teaches providing additional visual information spanning only a region of an identified area of interest ([0075], “Next, FIG. 12 is a diagram illustrating a distribution mode of the video stream in the streaming system according to the fifth embodiment. In the streaming system according to the fifth embodiment, the user receives a reduced entire image in the client device 2. The decoder 27 of the client device 2 decodes the received entire image, and displays the decoded image on the monitor device. Next, in the client device 2, when the user designates a desired area in the entire image, the decoder 27 makes a request for transmission of an image of a designated area designated by the user, to the server device 1. The server device 1 encodes a high-resolution zoomed image of the designated area corresponding to the request for transmission of the image, and transmit the encoded image to the client device 2. The first to n-th trimming areas illustrated in FIG. 11 each represent a high-resolution zoomed image (GOP of trimming area) of each designated area designated by the user. The decoder 27 receives and decodes the transmitted high-resolution zoomed image, and displays the decoded image on the monitor device. Therefore, the partial high-quality image corresponding to the desired part of the entire image can be seen without blur.”).
In view of Gondo’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner to include providing additional visual information spanning only the bounding region of the identified object of interest. By providing additional visual information spanning only the bounding region of the identified object of interest, the modification would serve to allow users to zoom in on objects of interest while enabling efficient usage of available bandwidth.
The combination teaches the limitations specified above; however, the combination does not expressly teach does not expressly teach identifying pixel-level, arbitrarily shaped boundaries for the object of interest, and that the additional visual information spans only the identified pixel-level, arbitrarily shaped boundaries of the object of interest. The combination also does not expressly teach wherein the object-specific coded stream is encoded to exclude background visual information outside the identified pixel-level, arbitrarily shaped boundaries.
Govil teaches identifying pixel-level, arbitrarily shaped boundaries for an object of interest, and excluding background visual information outside the identified pixel-level, arbitrarily shaped boundaries ([0043], “Referring first to FIG. 4 with continued reference to FIGS. 1-2, upon analysis of the visual content of the video frame 400, the object detection module 120 may detect or otherwise identify a number of regions 402, 404, 406 within the video frame 400 that correspond to replaceable objects (e.g., task 202), for example, by detecting or otherwise differentiating the boundaries of the corresponding regions of pixels from the underlying background content of the video frame 400 using machine learning, artificial intelligence, object recognition, or other pattern recognition or image analysis techniques. For each detected object region 402, 404, 406, the object recognition module 122 analyzes the respective set of pixels that comprise the respective region 402, 404, 406 to determine the type or other taxonomic classification of the respective object and discern additional physical and/or visual attributes of the respective object (e.g., task 204).”).
In view of Govil’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination to include identifying pixel-level, arbitrarily shaped boundaries for the object of interest, wherein the additional visual information spans only the identified pixel-level, arbitrarily shaped boundaries of the object of interest, and wherein the object-specific coded stream is encoded to exclude background visual information outside the identified pixel-level, arbitrarily shaped boundaries. The modification would serve to provide an alternative and/or supplemental means of identifying objects in content.

Regarding claim 2, the combination further teaches wherein the first video stream comprises one of:
a file on a file system in the user device;
an Internet video that is being delivered over an internet protocol in a managed network or in an over-the-top (OTT) (Waggoner: [0036], “FIG. 5 illustrates an example environment 500 in which aspects of various embodiments can be implemented. In this example, users are able to utilize various types of electronic devices 502 to request delivery of content over at least one network 504, such as the Internet, a cellular network, a local area network, and the like. As known for such purposes, a user can utilize a client device to request video content, and in response the video content can be downloaded, streamed, or otherwise transferred to the device”); and
a video within a video collaboration tool, whereby the user zooms into a specific object/region of interest to examine details of the video, infographics, text, or other visual media being communicated via the video collaboration tool.

Regarding claim 4, the combination further teaches wherein tracking the movements of the object of interest across the video frames comprises tracking the object of interest as the object of interest moves or changes across the video frames and rendering the tracked object in a zoomed-in view (Waggoner: [0016], “In particular, various embodiments enable a user to specify/select one or more objects of interest to be tracked in video content displayed on a computing device or other presentation device. In some embodiments, a user can select an object by specifying, using an input element (e.g., two or more fingers), a boundary around the object, and then specify a magnification level by adjusting a separation of at least two of those fingers. A location of a representation of that object (e.g., the object of interest) within the video can be determined whenever the representation is determined to be present in a frame of video to be displayed. Likewise, in some embodiments, the location of the representation of the object, when the object is included in a frame, may be approximately centered (in the displayed portion) and displayed with a presentation size that corresponds with the magnification level specified by the user. Such an approach provides what is referred to herein as a ‘smart zoom,’ as frames or segments of the video that include the object of interest can be ‘zoomed in,’ enabling a greater level of detail to be seen, particularly on devices with relatively small and/or low resolution display screens. Algorithms can be used to track the representation of the object between different frames, and track the representation of the object even if it undergoes various deformations of appearance (e.g., turns to the side). In some embodiments, different magnification levels can be set for different objects, or types of objects. For scenes without representations of those objects, the magnification level can be set to a default level, such as a level defined by a source of the content, an original content level, a fully zoomed out level, or full screen view.” [0035]).

Regarding claim 5, the combination further teaches wherein rendering the region of interest on the display panel comprises:
in response to receiving the selection of the object of interest, sending a request for the additional visual information to the video server; based on the request, receiving a second video stream directed to the same video content as the first video stream, wherein the second video stream comprises the additional visual information spanning only the pixel-level, arbitrarily shaped boundaries of the object of interest, wherein the additional visual information comprises a higher bitrate/resolution variant including enhanced visual details to enable the user to zoom into the object of interest on the display panel; and rendering the region of interest associated with a portion of the second video stream on the display panel (Waggoner: [0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail.” Gondo: [0075]; Govil: [0043]).

Regarding claim 7, the combination further teaches wherein rendering the region of interest on the display panel comprises:
receiving, from the video server, the first video stream along with the additional visual information associated with objects in the first video stream (Waggoner: [0036], “FIG. 5 illustrates an example environment 500 in which aspects of various embodiments can be implemented. In this example, users are able to utilize various types of electronic devices 502 to request delivery of content over at least one network 504, such as the Internet, a cellular network, a local area network, and the like. As known for such purposes, a user can utilize a client device to request video content, and in response the video content can be downloaded, streamed, or otherwise transferred to the device.” [0038], “In this example, the tracking data can be stored in a location such as a metadata repository 516, which can be transferred with the video content in order to allow the selected portion(s) of the video to be displayed on the appropriate user device.”),
the additional visual information comprising metadata including a position information of the objects across the video frames of the first video stream that can be used to track the movements of the object of interest (Waggoner: [0039], “The tracking data in some embodiments includes the position of the object, or the appropriate position of the center of the appropriate portion to be displayed, such that the portion can be displayed at the appropriate magnification level.”); and
in response to receiving the selection of the object of interest, rendering the region of interest including the object of interest using the metadata associated with the object of interest (Waggoner: [0039], “In this example, the tracking data for the object exists, such that the tracking data can be received 612 to the client device. The portion of the subsequent video frames that include a representation of the object of interest can then be displayed 614 with the object of interest approximately centered in the portion with the appropriate presentation size, to the extent possible and/or practical as discussed elsewhere herein. The tracking data in some embodiments includes the position of the object, or the appropriate position of the center of the appropriate portion to be displayed, such that the portion can be displayed at the appropriate magnification level.”).

Regarding claim 9, the combination further teaches wherein the object of interest is signaled to the user device in the form of the metadata conveying the identified pixel-level arbitrarily shaped boundaries of the object of interest (Waggoner: [0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc. In this example, there are four bounding boxes 1002 indicating objects that have been selected by a user or provider, or that have been identified using an algorithm or process, as being potentially of interest to be tracked for the user.” Fig. 10; Govil: [0043]).

Regarding claim 10, the combination further teaches wherein the object of interest is signaled to the user device in the form of the metadata conveying boundaries of the object of interest in the form of an object mask (Waggoner: [0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc. In this example, there are four bounding boxes 1002 indicating objects that have been selected by a user or provider, or that have been identified using an algorithm or process, as being potentially of interest to be tracked for the user.” Fig. 10).

Regarding claim 13, the combination further teaches wherein rendering the region of interest on the display panel comprises:
providing the first video stream including a base layer of the scalable video coding scheme on the display panel; and in response to receiving the selection of the object of interest, providing the additional visual information including the at least one enhancement layer of the scalable video coding scheme spanning only the pixel-level, arbitrarily shaped boundaries for the object of interest on the display panel, wherein the at least one enhancement layer provides details associated with the object of interest (Waggoner: [0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail.” Gondo: [0075]; Govil: [0043]).

Regarding claim 14, the combination further teaches wherein the base layer comprises a first visual quality or resolution for the video frames of the first video stream, and wherein the enhancement layer comprises an enhanced visual quality or resolution for the pixel-level, arbitrarily shaped boundaries of the object of interest (Waggoner: [0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail.” Govil: [0043]).

Regarding claim 23, the combination further teaches wherein the additional visual information is generated at the user device or at a serving entity that serves the first video stream (Waggoner: [0040], “In some embodiments video content can be analyzed to identify commonly selected objects such as people and faces. … In some embodiments, a user can subscribe to a service that provides such tracking data. A user might select an option to obtain tracking data for any video content that includes the user's favorite actor, for example, and can have this data provided automatically any time the user downloads, streams, or otherwise obtains video content including that actor.”).

Regarding claim 24, the combination teaches a video processing system comprising: a display panel; a processor; and memory coupled to the processor, wherein the memory comprises an object-based video processing module (Waggoner: [0058], [0064], Figs. 12a-12b). The rejection of claim 1 under 35 USC §103 is similarly applied to the remaining limitations of claim 24.

Regarding claim 25, the combination teaches a non-transitory computer-readable storage medium having instructions executable by a processor of a video processing system (Waggoner: [0058], [0064], Figs. 12a-12b). The rejection of claim 1 under 35 USC §103 is similarly applied to the remaining limitations of claim 25.

Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Waggoner, Ramaswamy, Gondo, Govil, and Boskovich (US 2019/0238952).

Regarding claim 3, Waggoner further teaches wherein the object of interest is obtained based on partitions contained in the first video stream ([0044], “In this example, the video content 802 is comprised of a set of video tiles 804. These tiles each represent a portion of the video content, where the tiles are organized spatially.” [0045], “If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail. If used with a tile approach, the additive streams can be requested for only those tiles that are currently being displayed at the magnified view.” Fig. 8). However, Waggoner does not expressly teach wherein the first video stream is compressed video stream containing I-frame data, P-frame data, B-frame data, or any combination thereof.
Boskovich teaches wherein a first video stream is compressed video stream containing I-frame data, P-frame data, B-frame data, or any combination thereof ([0126], “video decoding generally comprises three frame types, Intra frames (I-frames), Predictive frames (P-frames), and Bi-directional frames (B-frames). H.264 allows other types of coding such as Switching I (SI) and Switching P (SP) in the Extended Profile (EP).” [0128], “Open GOPs generally provide better compression than do closed GOPs of the same structure and size, due in part to the fact that a closed GOP contains one more P-frame than does an open GOP of the same length. Since P-frames generally require more bits than do B-frames, the open GOP achieves better compression.”).
In view of Boskovich’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner wherein the first video stream is compressed video stream containing I-frame data, P-frame data, B-frame data, or any combination thereof in order to allow for more efficient distribution of content to users.

Claim(s) 6, 8, and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Waggoner, Ramaswamy, Gondo, Govil, and Guntur et al. (US 2014/0199043).

Regarding claim 6, the combination further teaches wherein rendering the region of interest on the display panel comprises:
in response to receiving the selection of the object of interest:
detecting a plurality of objects in the first video stream (Waggoner: [0046], “In some embodiments, multiple objects can be selected with different presentation sizes, and with each of these being represented in a different region of the display.” [0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc.” Fig. 10);
determining an object of the plurality of objects that corresponds to the selected object of interest (Waggoner: [0046], “In some embodiments, multiple objects can be selected with different presentation sizes, and with each of these being represented in a different region of the display.” [0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc.” Fig. 10);
generating the additional visual information spanning only the identified pixel-level, arbitrarily shaped boundaries of the object of interest (Gondo: [0075]; Govil: [0043]), wherein the additional visual information comprises 
the enhancement layer of the scalable video coding (SVC) scheme (Waggoner: [0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail.”); and
zooming the first video stream to display the object of interest on the display panel based on the additional visual information (Waggoner: [0027], “An object of interest may be any object or region within a video that a user desires to track. For example, the object of interest may be an object that moves in the video with respect to other objects, such as representations of humans, dogs, cars, boats, planes, etc.” [0045], “Instead of streaming video switching between independent bit rates, a number of layers can be established. For example, there might be a 300 kb stream at a low resolution, and a 600 kb version functions as an enhancement, rather than a replacement, to the lower resolution stream. Each previous layer can similarly be increased as bitrates go higher. Such an approach enables a client device to only request as many layers as are appropriate for the device and/or settings. For a mobile device at a typical magnification level, such as zoomed all the way out, the minimum bit stream alone might be acceptable. If the user adjusts the magnification level, such that higher resolution (e.g., 4K resolution) is appropriate, one or more additive streams can be obtained to achieve the increased detail.” Gondo: [0075]).
However, Waggoner does not expressly teach panning the first video stream to display the object on the display panel based on the additional visual information.
Guntur teaches panning a video stream to display an object on a display panel ([0025], “In order to help users interact and have a better experience with large format (e.g., High Definition) videos on small screen devices, a video player that performs virtual camera functions in a pre-recorded video is disclosed. The virtual camera automatically zooms-in/zooms-out/pans of a region of interest within the high resolution video, there by retargeting a viewport to screen dimensions. Objects of interest appear magnified and in focus.” [0068]).
In view of Guntur’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner to include panning the first video stream to display the object on the display panel based on the additional visual information. The modification would further facilitate viewing on small screen devices. The modification would thereby improve the user experience.
Regarding claim 8, the combination further teaches further comprising:
determining the identified pixel-level, arbitrarily shaped boundaries for the object of interest or an object mask of the object of interest using the received metadata (Waggoner: [0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc. In this example, there are four bounding boxes 1002 indicating objects that have been selected by a user or provider, or that have been identified using an algorithm or process, as being potentially of interest to be tracked for the user.” Fig. 10; Govil: [0043]); and
 zooming the first video stream according to determined boundaries to display the region of interest including the object of interest on the display panel (Waggoner: [0016], “In particular, various embodiments enable a user to specify/select one or more objects of interest to be tracked in video content displayed on a computing device or other presentation device. In some embodiments, a user can select an object by specifying, using an input element (e.g., two or more fingers), a boundary around the object, and then specify a magnification level by adjusting a separation of at least two of those fingers. A location of a representation of that object (e.g., the object of interest) within the video can be determined whenever the representation is determined to be present in a frame of video to be displayed. Likewise, in some embodiments, the location of the representation of the object, when the object is included in a frame, may be approximately centered (in the displayed portion) and displayed with a presentation size that corresponds with the magnification level specified by the user. Such an approach provides what is referred to herein as a ‘smart zoom,’ as frames or segments of the video that include the object of interest can be ‘zoomed in,’ enabling a greater level of detail to be seen, particularly on devices with relatively small and/or low resolution display screens. Algorithms can be used to track the representation of the object between different frames, and track the representation of the object even if it undergoes various deformations of appearance (e.g., turns to the side). In some embodiments, different magnification levels can be set for different objects, or types of objects. For scenes without representations of those objects, the magnification level can be set to a default level, such as a level defined by a source of the content, an original content level, a fully zoomed out level, or full screen view.” [0047]).
However, the combination does not expressly teach panning the first video stream according to the geometrical boundary or the object mask to display the region of interest including the object on the display panel.
Guntur teaches panning a video stream according to display a region of interest including an object on the display panel ([0025], “In order to help users interact and have a better experience with large format (e.g., High Definition) videos on small screen devices, a video player that performs virtual camera functions in a pre-recorded video is disclosed. The virtual camera automatically zooms-in/zooms-out/pans of a region of interest within the high resolution video, there by retargeting a viewport to screen dimensions. Objects of interest appear magnified and in focus.” [0068]).
In view of Guntur’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner to include panning the first video stream according to the geometrical boundary or the object mask to display the region of interest including the object on the display panel. The modification would further facilitate viewing on small screen devices. The modification would thereby improve the user experience.

Regarding claim 11, the combination further teaches wherein rendering the region of interest on the display panel comprises:
receiving, from the video server, the first video stream along with the additional visual information (Waggoner: [0036], “FIG. 5 illustrates an example environment 500 in which aspects of various embodiments can be implemented. In this example, users are able to utilize various types of electronic devices 502 to request delivery of content over at least one network 504, such as the Internet, a cellular network, a local area network, and the like. As known for such purposes, a user can utilize a client device to request video content, and in response the video content can be downloaded, streamed, or otherwise transferred to the device.” [0038], “In this example, the tracking data can be stored in a location such as a metadata repository 516, which can be transferred with the video content in order to allow the selected portion(s) of the video to be displayed on the appropriate user device.”),
wherein the additional visual information comprises object-based coded streams representing objects in the first data stream (Waggoner: [0038], “the metadata can also be used to indicate to a user which objects have magnification information available for selection by a user.”); and
in response to receiving the selection of the object of interest, zooming the first video stream to display a boundary of the object of interest on the display panel (Waggoner: [0016], “In particular, various embodiments enable a user to specify/select one or more objects of interest to be tracked in video content displayed on a computing device or other presentation device. In some embodiments, a user can select an object by specifying, using an input element (e.g., two or more fingers), a boundary around the object, and then specify a magnification level by adjusting a separation of at least two of those fingers. A location of a representation of that object (e.g., the object of interest) within the video can be determined whenever the representation is determined to be present in a frame of video to be displayed. Likewise, in some embodiments, the location of the representation of the object, when the object is included in a frame, may be approximately centered (in the displayed portion) and displayed with a presentation size that corresponds with the magnification level specified by the user. Such an approach provides what is referred to herein as a ‘smart zoom,’ as frames or segments of the video that include the object of interest can be ‘zoomed in,’ enabling a greater level of detail to be seen, particularly on devices with relatively small and/or low resolution display screens. Algorithms can be used to track the representation of the object between different frames, and track the representation of the object even if it undergoes various deformations of appearance (e.g., turns to the side). In some embodiments, different magnification levels can be set for different objects, or types of objects. For scenes without representations of those objects, the magnification level can be set to a default level, such as a level defined by a source of the content, an original content level, a fully zoomed out level, or full screen view.” [0047]),
the boundary including an object-based coded stream representing a zoomed portion of the object of interest (Waggoner: [0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc. In this example, there are four bounding boxes 1002 indicating objects that have been selected by a user or provider, or that have been identified using an algorithm or process, as being potentially of interest to be tracked for the user.” Fig. 10).
However, the combination does not expressly teach, in response to receiving the selection of the object of interest, panning the first video stream to display a boundary of the object on the display panel.
Guntur teaches, in response to receiving a selection of an object of interest, panning a video stream to display an object on a display panel ([0025], “In order to help users interact and have a better experience with large format (e.g., High Definition) videos on small screen devices, a video player that performs virtual camera functions in a pre-recorded video is disclosed. The virtual camera automatically zooms-in/zooms-out/pans of a region of interest within the high resolution video, there by retargeting a viewport to screen dimensions. Objects of interest appear magnified and in focus.” [0068]).
In view of Guntur’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner to include, in response to receiving the selection of the object of interest, panning the first video stream to display a boundary of the object on the display panel. The modification would further facilitate viewing on small screen devices. The modification would thereby improve the user experience.

Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Waggoner, Ramaswamy, Gondo, Govil, and Park et al. (US 2011/0012994).

Regarding claim 15, the combination teaches the limitations specified above; however, the combination does not expressly teach:
wherein the base layer comprises a 2D base view forming “one eye view” of a stereo-3D video, and
wherein the enhancement layer comprises incremental information pertaining to a depth of the object of interest,
wherein depth information associated with the depth of the object of interest is conveyed via an “other eye view” or a depth map.
Park teaches wherein a base layer comprises a 2D base view forming “one eye view” of a stereo-3D video, and wherein the enhancement layer comprises incremental information pertaining to a depth, wherein depth information associated with the depth via an “other eye view” or a depth map ([0036], “A two-dimensional (2D) picture of one view may be reconstructed by taking a base layer's bitstream from the bitstream and decoding the base layer's bitstream, and an enhancement layer picture having a different view in, for example, a 3D picture may be reconstructed by decoding the base layer's bitstream and then combining a prediction picture generated by performing view conversion according to an exemplary embodiment with a residual picture generated by decoding an enhancement layer's bitstream.”).
In view of Park’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination wherein the base layer comprises a 2D base view forming "one eye view" of a stereo-3D video, and wherein the enhancement layer comprises incremental information pertaining to a depth of the object of interest, wherein depth information associated with the depth of the object of interest is conveyed via an "other eye view" or a depth map. The modification would enable users to view access and view 3D content, thereby improving the user experience.

Claim(s) 16-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Waggoner, Ramaswamy, Gondo, Govil, and Cohen-Tidhar et al. (US 2022/0353435).

Regarding claim 16, the combination teaches the limitations specified above; however, the combination does not expressly teach wherein the first video stream is associated with an augmented reality (AR), virtual reality (VR), or gaming application. 
Cohen-Tidhar teaches wherein a video stream is associated with an augmented reality (AR), virtual reality (VR), or gaming application ([0033], “System 100 further comprises an End-User Device 150, which may be an electronic device or a computerized device capable of playing video; for example, a smartphone, a tablet, a laptop computer, a desktop computer, a smart-watch, a wearable device, an Augmented Reality (AR) helmet or headset or glasses or gear, a Virtual Reality (VR) helmet or headset or glasses or gear, a smart television, a smart display unit, an Internet connected display unit, or dedicated video playback device, or the like.”).
In view of Cohen-Tidhar’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the Waggoner wherein the first video stream is associated with an augmented reality (AR), virtual reality (VR), or gaming application. The modification would allow users to access and view additional forms of content. The modification would thereby improve the user experience.

Regarding claim 17, the combination teaches using an analysis tool to:
enable recognition and segmentation of the object of interest in the first video stream (Waggoner: [0033], “In one embodiment, a user selecting two points on the display can cause that frame of video to be analyzed using at least one object recognition process, such as an object identification process or computer vision process, among others, to attempt to identify a representation of an object that has edges or other features proximate to the selected points. In some embodiments, the process can take a portion of the frame of video corresponding to the points and utilize an image matching process to attempt to match the portion against a library of images in order to identify the object of interest.”); and
determining the pixel-level, arbitrarily shaped boundaries of the object of interest (Waggoner: [0047], “FIG. 10 illustrates an example situation 1000 wherein tracking data exists for multiple objects represented in the video content, or at least where objects have been identified that the user might be interested in viewing at a higher magnification level, etc. In this example, there are four bounding boxes 1002 indicating objects that have been selected by a user or provider, or that have been identified using an algorithm or process, as being potentially of interest to be tracked for the user.” Fig. 10. Govil: [0043], “Referring first to FIG. 4 with continued reference to FIGS. 1-2, upon analysis of the visual content of the video frame 400, the object detection module 120 may detect or otherwise identify a number of regions 402, 404, 406 within the video frame 400 that correspond to replaceable objects (e.g., task 202), for example, by detecting or otherwise differentiating the boundaries of the corresponding regions of pixels from the underlying background content of the video frame 400 using machine learning, artificial intelligence, object recognition, or other pattern recognition or image analysis techniques. For each detected object region 402, 404, 406, the object recognition module 122 analyzes the respective set of pixels that comprise the respective region 402, 404, 406 to determine the type or other taxonomic classification of the respective object and discern additional physical and/or visual attributes of the respective object (e.g., task 204).”).
However, the combination does not expressly teach: using an artificial intelligence (AI)-based analysis tool to: enable recognition and segmentation of the object of interest in the first video stream; and determining a window that covers the object of interest.
Cohen-Tidhar teaches using an artificial intelligence (AI)-based analysis tool to:
enable recognition and segmentation of the object of interest in the first video stream ([0014], “The server performs content processing of the uploaded high-resolution video, using a computer vision algorithm and/or object recognition algorithm and/or object identification algorithm (e.g., optionally utilizing Machine Learning (ML) or Artificial Intelligence (AI) or image comparison or other suitable methods), and/or optionally takes into account or utilizes manual input provided by a human moderator (e.g., who reviewed the video and indicated the presence and/or location of object(s) to be tracked), and detects or identifies or recognizes particular objects that are depicted in the video content (‘objects-of-interest’).”); and
determining a window that covers the object of interest ([0020], “An Object-of-Interest Recognition Unit 104 performs an object recognition process, utilizing computer vision and/or other suitable methods, and generates a List of Objects-of-Interest 105. Such list indicates, for example: a serial number or ID number for each recognized object (e.g., object 1 being a goalkeeper; object 2 being a forward player); the in-frame location of the central pixel of the recognized object-of-interest, or a bounding box or bounding rectangle that contains the object and that is defined via parameters (e.g., coordinates of top-left corner and coordinates of top-right corner; or, coordinates of top-left corner, and rectangular length and rectangular width);….”).
In view of Cohen-Tidhar’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner using an artificial intelligence (AI)-based analysis tool to: enable recognition and segmentation of the object of interest in the first video stream; and determining a window that covers the object of interest. The modification would serve to facilitate object recognition by the system.

Regarding claim 18, the combination further teaches wherein the recognition and segmentation can be performed at a serving entity that generates the first video stream or at the user device that displays the first video stream (Cohen-Tidhar: [0014], “The server performs content processing of the uploaded high-resolution video, using a computer vision algorithm and/or object recognition algorithm and/or object identification algorithm (e.g., optionally utilizing Machine Learning (ML) or Artificial Intelligence (AI) or image comparison or other suitable methods), and/or optionally takes into account or utilizes manual input provided by a human moderator (e.g., who reviewed the video and indicated the presence and/or location of object(s) to be tracked), and detects or identifies or recognizes particular objects that are depicted in the video content (‘objects-of-interest’). … The server then produces or generates multiple (two or more) variants or versions of the original high-resolution video, at different bitrates and at different resolutions, by taking into account the detected object(s)-of-interests and its (their) in-frame location(s).” Waggoner: [0045]).

Claim(s) 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Waggoner, Ramaswamy, Gondo, Govil, and Liu et al. (US 2007/0200953).

Regarding claim 19, the combination teaches the limitations specified above; however, the combination does not expressly teach further teaches further comprising: receiving feedback associated with the object of interest selected by the user; and using the feedback for further analytics pertaining to the object of interest.
Liu teaches receiving feedback associated with an object of interest selected by a user, and using the feedback for further analytics pertaining to the object of interest ([0054], “Please refer to FIG. 15. FIG. 15 shows a fourth zoom operation performed on the image of FIG. 14 according to the second embodiment of the present invention. The viewer decides to pan right because the resulting zoomed video image 60 is not exactly the desired region of interest 1400. Note that the thumbnail 1301 provides excellent visual feedback for the viewer to select a correct zoom or pan command to achieve the exact desired region of interest.” Fig. 15).
In view of Liu’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner to include receiving feedback associated with the object of interest selected by the user; and using the feedback for further analytics pertaining to the object of interest. The modification would serve to allow users to correct zoom or pan commands. The modification would thereby improve the user experience.

Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Waggoner, Ramaswamy, Gondo, Govil, and Hendry et al. (US 2018/0103199).

Regarding claim 20, the combination teaches the limitations specified above; however, the combination does not expressly teach:
wherein the first video stream comprises digital video that is encapsulated in adaptive bitrate streams, on which the region of interest would be achieved using client-side processing,
wherein the adaptive bitrate streams comprise chunks of video data, each chunk encapsulating independently decodable Advanced Video Coding (AVC), High Efficiency Video Coding (HEVC), AOMedia Video (AV1 or AV2), or Versatile Video Coding (VVC).
Hendry teaches:
wherein a first video stream comprises digital video that is encapsulated in adaptive bitrate streams, on which a region of interest would be achieved using client-side processing ([0005], “Regions referred to as ‘most interested regions’ can also be determined based on user statistics or can be user-defined. For instance, a most interested region in a 360-degree video picture can include one of the regions (e.g., covered by one or more tiles) that are statistically most likely to be rendered to the user at the presentation time of the picture. The most interested regions can be used for various purposes, such as for data pre-fetching in 360-degree video adaptive streaming, for transcoding optimization when a 360-degree video is transcoded, for cache management, for content management, among others.” [0102], “The decoding device 112 may output the decoded video to a video destination device 122, which may include a display or other output device for displaying the decoded video data to a consumer of the content.” [0127], “To enable high quality streaming of media content using conventional HTTP web servers, adaptive bitrate streaming can be used.”),
wherein the adaptive bitrate streams comprise chunks of video data, each chunk encapsulating independently decodable Advanced Video Coding (AVC), High Efficiency Video Coding (HEVC), AOMedia Video (AV1 or AV2), or Versatile Video Coding (VVC) ([0078], “The encoding device 104 (or encoder) can be used to encode video data using a video coding standard or protocol to generate an encoded video bitstream. Examples of video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual, ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC), including its Scalable Video Coding (SVC) and Multiview Video Coding (MVC) extensions, and High Efficiency Video Coding (HEVC) or ITU-T H.265.”).
In view of Hendry’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner wherein the first video stream comprises digital video that is encapsulated in adaptive bitrate streams, on which the region of interest would be achieved using client-side processing, wherein the adaptive bitrate streams comprise chunks of video data, each chunk encapsulating independently decodable Advanced Video Coding (AVC), High Efficiency Video Coding (HEVC), AOMedia Video (AV1 or AV2), or Versatile Video Coding (VVC). The modification would serve to allow for more efficient distribution of content to users.

Claim(s) 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Waggoner, Ramaswamy, Gondo, Govil, and Yamagishi et al. (US 2023/0239245).

Regarding claim 21, the combination teaches the limitations specified above; however, the combination does not expressly teach wherein rendering a region of interest on the display panel comprises: based on the additional visual information, generating high quality video frames using a deep learning based super resolution to render the object of interest on the client side.
Yamagishi teaches:
rendering a region of interest on a display panel comprises generating high quality video images using a deep learning based super resolution to render an object of interest ([0357], “The analysis module 82 includes an AI engine using machine learning such as deep learning, acquires the ROI super-resolution image stream from the super-resolution processing module 72, and performs image analysis processing of analyzing the super-resolution image of the region of interest ROI. For example, the analysis module 82 performs a process of identifying (recognizing) a person of an object OBJ included in a super-resolution image of the region of interest ROI, predicting (determining) an action (a dangerous action) of the person, and the like”).
In view of Yamagishi’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Waggoner wherein rendering a region of interest on the display panel comprises: based on the additional visual information, generating high quality video frames using a deep learning based super resolution to render the object of interest on the client side. The modification would serve to provide an alternative and/or supplemental means of providing higher quality content to users.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL R TELAN whose telephone number is (571)270-5940. The examiner can normally be reached 9:30AM-6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nasser Goodarzi can be reached at (571) 272-4195. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL R TELAN/           Primary Examiner, Art Unit 2426
Read full office action
Prosecution Timeline

Apr 04, 2023
Application Filed
Sep 06, 2024
Non-Final Rejection — §103
Dec 09, 2024
Response Filed
Jan 23, 2025
Final Rejection — §103
Apr 28, 2025
Request for Continued Examination
May 07, 2025
Response after Non-Final Action
Jun 05, 2025
Non-Final Rejection — §103
Sep 09, 2025
Response Filed
Nov 04, 2025
Final Rejection — §103
Feb 05, 2026
Request for Continued Examination
Feb 18, 2026
Response after Non-Final Action
Feb 25, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/398,538
Patent 12604066
SYSTEMS AND METHODS FOR GENERATING NOTIFICATION INTERFACES BASED ON MEDIA BROADCAST ACCESS EVENTS
2y 5m to grant Granted Apr 14, 2026
17/727,344
Patent 12598361
VIDEO OPTIMIZATION PROXY SYSTEM AND METHOD
2y 5m to grant Granted Apr 07, 2026
18/805,898
Patent 12598352
VIDEO PRESENTATION METHOD AND APPARATUS, AND ELECTRONIC DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/465,086
Patent 12581137
VIDEO MANAGEMENT SYSTEM FOR VIDEO FILES AND LIVE STREAMING CONTENT
2y 5m to grant Granted Mar 17, 2026
17/438,732
Patent 12549801
LYRIC VIDEO DISPLAY METHOD AND DEVICE, ELECTRONIC APPARATUS AND COMPUTER-READABLE MEDIUM
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
42%
Grant Probability
69%
With Interview (+27.0%)
3y 6m
Median Time to Grant
High
PTA Risk
Based on 417 resolved cases by this examiner. Grant probability derived from career allow rate.