DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 10/24/2025 has been entered.
Response to Arguments
Examiner incorporates herein previous Responses to Arguments. Particularly, the following explanation was provided in the record and is memorialized, infra, for expediency.
Applicant previously contended the teachings of Wadhwa are deficient because, “Wadhwa would need to disclose that the machine learning technique is used to determine, based on the depth map, a set of pixels in a warped target frame that correspond to an occlusion of the object in a target frame.” Such an argument is unreasonable in view of the knowledge possessed by one of ordinary skill. There are several modular processes at work in this art, any of which can benefit from employing a neural network or other algorithm. First, there is the process of frame interpolation wherein a reference frame(s) is used to predict a target frame using optical flow. This simply means that objects in a scene are likely to move such that when predicting a target frame from a reference frame, movements between successive frames can be used to estimate where the objects are likely to be in subsequent frames. The rejection relies on Zhao and Niklaus to essentially teach these basic features of video coding. It is important to note that all scenes have objects therein that are in the foreground and occlude the background. Think of a person sitting in a chair. A depth map identifies objects in a scene and segments those objects (person) from the background (chair). Depth dimensions for pixels can be determined by hardware, such as time of flight sensors, or can be determined by image and video analytics. Setting aside depth data obtained by a sensor, video analysis can segment an image by tracking groups of pixels that move with respect to the background and assign shallower depth to objects moving with respect to the background. A neural network can perform a video analysis and determine groups of pixels that move together with respect to a background and create a depth map indicating which image regions are background (deeper) and which image regions represent objects. As a foreground object moves in front of a background region of the image, the foreground object occludes the background pixels and may later reintroduce those background pixels as the object first covers and then uncovers the background image region. Obviously, as more complex images with multiple moving objects are encountered, foreground objects can themselves become background image regions for yet closer foreground objects. As the prior art demonstrates, a neural network can be good at identifying and distinguishing those complex foreground and background relationships and create a depth map of the scene which essentially identifies objects according to depth. Wadhwa is relied upon in the rejection to teach this aspect of the art. As already alluded to, as a foreground object moves in front of a background object, occlusions occur at the boundaries between the foreground object and the background. Those occlusion regions are easily identified by finding the optical flow (trajectory) of the foreground object and realizing that the leading edge of the foreground object in the direction of travel will occlude background pixels along that trajectory in future images. Even without optical flow and determining the direction of travel of the foreground object, merely having the depth map identify according to depth which pixel regions belong to foreground regions and which pixel regions belong to background regions identifies a penumbra around the objects that are subject to occlusion should the foreground object move in any manner with respect to the background. Thus, to one of ordinary skill in the art, Wadhwa’s teaching of a depth map and determining ordering and overlap on content within an image, which inherently identifies and segments foreground objects from background objects within a scene according to depth, teaches or suggests determining a set of pixels that correspond to an occlusion area using a machine learning technique. The information provided here is basic knowledge the skilled artisan must possess in order to understand the contributions described in the cited prior art. Applicant’s insistence that such fundamental knowledge is not taught by the references represents an unreasonable demotion of the level of skill in this art.
On page 9 of the Remarks, Applicant contends, Applicant’s Specification includes sufficient information to enable a person skilled in the art to make and use the claimed invention. To support that contention, Applicant cites to paragraphs [0042]–[0044] and Figs. 1A–1D. While Examiner finds these portions of the Specification provide little information about specifically recited features of the independent claims, the issue of lack of written description and enablement was primarily the result of claim language now removed from the claims. Accordingly, in view of the amendments to the claims, the rejections under 35 U.S.C. 112(a) are withdrawn. Remarks, 8–9.
On page 11 of the Remarks, Applicant contends Niklaus fails to teach or suggest inputting an occlusion mask into the frame synthesis network. Such an argument fails to respond to the rejection of record. The rejection relies on the teachings of Swisher and Staranowicz to teach the obviousness of including an occlusion mask layer in a neural network to handle occlusions for frame interpolation. The combination of Swisher and Staranowicz teaches or suggests using an occlusion mask and transmitting such a mask to another processing module within an optical flow and frame interpolation scheme. It is the combination, then, of these teachings, with the teachings of Niklaus regarding warped frames fed “to a video frame synthesis neural network to produce the interpolated frame” wherein handling occlusions is also performed using the neural network, that would teach or suggest to the skilled artisan Applicant’s claimed features. In other words, the skilled artisan, taking Niklaus’s teachings regarding the handling of occlusions within the neural network, would have consulted relevant art like Swisher and Staranowicz to achieve the occlusion handling suggested by Niklaus. Because Applicant attacks the references individually and fails to account for how one skilled in the art would have been able to combine prior art teachings to arrive at an obvious system that includes an occlusion mask for the optical flow enabled frame interpolation scheme represented by Applicant’s claimed features, Applicant’s argument against the rejection under 35 U.S.C. 103 is unpersuasive of error.
On page 12 of the Remarks, Applicant contends Zhao fails to teach predicting a flow vector between two frames and applying a residual motion correction. Examiner finds the argument moot in view of the new grounds of rejection necessitated by amendment. Zhao was never relied upon to teach these features as these features are newly presented in the currently amended claims. The rejection now relies on additionally cited teachings of principally Staranowicz to teach or suggest the features added by way of amendment. See rejections, infra.
On page 12 of the Remarks, Applicant contends that Niklaus does not teach or suggest producing a warped target frame using optical flow and supports that contention by arguing that Niklaus generates an interpolation frame without using pixel-wise blending. It is unclear why Applicant believes Niklaus’s statement regarding pixel-wise blending is relevant to whether Niklaus teaches or suggests using optical flow estimation to produce a warped target frame. Indeed, the two concepts are unrelated in terms of their mutual exclusivity. Attorney arguments and conclusory statements unsupported by factual evidence are entitled to little probative value. In re Geisler, 116 F.3d 1465, 1470 (Fed. Cir. 1997). As the prior art explains, pixel-wise blending is a separate post-processing step and does not constrain any teaching of Niklaus in the way Applicant suggests. See rejection, infra.
On pages 13–14 of the Remarks, Applicant contends Staranowicz is deficient for failing to teach or suggest encoding an occlusion mask based on the target frame. Applicant admits that Staranowicz teaches transmitting the occlusion mask. Remarks, 13. Applicant also admits that Staranowicz teaches using the occlusion mask to perform edge correction in the interpolated frame (i.e. Applicant’s target frame), and therefore, corresponds to the target frame, i.e. the mask identifies occlusion areas in the interpolated frame. So, of course the mask is coded “based on” Applicant’s target frame. How can one transmit digital data, like the occlusion mask, without it being encoded? Applicant has provided no evidence or reasoning to support such a contention. Attorney arguments and conclusory statements unsupported by factual evidence are entitled to little probative value. In re Geisler, 116 F.3d 1465, 1470 (Fed. Cir. 1997).
On page 15 of the Remarks, Applicant contends, “Staranowicz fails to disclose any use of an occlusion mask that involves encoding or converting the occlusion mask into a digital format for the purpose of compressing, storing, or distributing digital video content….” This argument is unreasonable in view of the level of skill in the art. How would one go about transmitting the occlusion mask as taught by Staranowicz without digitizing the data (by the way, it’s already in a digital format), storing the data, and serializing the data for serial data transmission as suggested by Staranowicz? Applicant offers no explanation for how Staranowicz can achieve such transmission without encoding and the skilled artisan immediately recognizes the transmission of any digital data and including an encoding step as broadly claimed. Applicant’s argument amount to a mere sematic argument about language and does not come close to any technological assertion of a technological difference between the prior art and the claimed embodiment. The claim merely requires encoding and Applicant has pointed to no concrete aspect of Applicant’s encoding, supported by Applicant’s disclosure that differs from the prior art. Applicant’s argument relies on a web definition of video encoding. Does the claim say the occlusion mask is encoded using video encoding? Chien (US 2017/0323471 A1) teaches, “Through the use of the compact encoding scheme and two-level structure, the occlusion data can be stored with moderate memory usage.” (¶ 0053). Sugano (US 2021/0134049 A1) teaches, “Three-dimensional data such as three-dimensional occlusion data, if any, is encoded together with the two-dimensional image data and the depth data.” (¶ 0201). Georgakis (US 2014/0192148 A1) teaches an encoder sending occlusion data to a decoder (¶ 0053). Tian (US 2013/0162774 A1) is titled, “Compression Methods and Apparatus For Occlusion Data.” Klein Gunnewiek (US 2010/0195716 A1) teaches “an occlusion data frame [is] encoded.” (Abstract). The foregoing references are in addition to the two references, Pang and Doyen, already provided for the record and cited under the Conclusion Section of this Office Action. Therefore, Applicant’s argument that the teaching of encoding an occlusion map “is provided solely by the Examiner” is belied by the overwhelming evidence provided on this record. Accordingly, Applicant’s argument is not persuasive of patentability.
Other claims are not argued separately. Remarks, 16.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1–4, 6, 8–13, 16, and 18–20 are rejected under 35 U.S.C. 103 as being unpatentable over Niklaus et al., “Context-aware Synthesis for Video Frame Interpolation,” March 2018 (herein “Niklaus”), Zhao et al., “CNN-based Bi-Directional Motion Compensation for High Efficiency Video Coding,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2018, pp. 1–4 (herein “Zhao”), Swisher (US 2018/0300540 A1), and Staranowicz (US 2018/0315174 A1).
Regarding claim 1, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests a computer-implemented method for compressing a target video, the computer-implemented method comprising (Examiner finds, as evidenced by the prior art listed under the Conclusion Section of this Office Action, that motion compensation for B-frames is essentially the same as frame interpolation and that the skilled artisan would find them to be equivalents; Therefore, in teaching frame interpolation by CNN, Niklaus teaches using CNN optical flow for video compression; However, absent the aforementioned understanding, it could incorrectly be argued that neither Niklaus nor Meister teach using optical flow for video compression; Therefore, the rejection relies on Zhao to explicitly teach using optical flow specifically for the motion compensation component of the video compression algorithm; Zhao, Title and Abstract: explains bi-directional optical flow can be used for video compression): determining a first estimated optical flow based on a displacement of pixels between a first reference frame included in the target video and a target frame included in the target video (Examiner notes the term bi-directional optical flow means there is an optical flow estimated in both the forward and reverse directions by virtue of using a preceding reference frame and a succeeding reference frame; Examiner also notes that optical flow characterizes motion and motion is only possible when there is (i.e. motion is the definition of) a displacement (mathematical translation) of pixels between two frames; Zhao, Section 1: teaches optical flow; Niklaus, Abstract and Introduction: teaches computing bi-directional optical flow estimations that are warped to produce interpolated frames (i.e. Applicant’s target frame)), wherein determining the first estimated optical flow comprises: predicting a flow vector from the first reference frame to a second reference frame, and applying a residual motion correction to the predicted flow vector, the residual motion correction being calculated between the first reference frame and the target frame (Staranowicz, ¶¶ 0041–0044: teaches optical flow estimations and the errors that can occur can be corrected by obtaining two or more reference frames (the reference frames can be original frames or the result of previous frame interpolation) from the video sequence and calculating optical flow errors (i.e. motion residuals) between forward and backward optical flows; As explained in the cited portion of the reference, a comparison between the optical flow calculation and the actual frame can quantify an error (i.e. residual motion) between the optical flow prediction and the actual frame; see also Staranowicz, ¶¶ 0056: explaining the frame error is calculated by finding the difference between a starting pixel location and an end location pixel by subtracting estimated from actual (i.e. the definition of a residual motion); Staranowicz, ¶ 0059: teaches “correcting” the inaccurate optical flows); applying the first estimated optical flow to the first reference frame to produce a first warped target frame (Niklaus, Abstract and Introduction: teaches computing bi-directional optical flow estimations that are warped to produce interpolated frames (i.e. Applicant’s target frame)), wherein producing the first warped target frame comprises warping pixels in the first reference frame to a new position in the target frame based on the first estimated optical flow (Examiner notes this is simply the definition of warping a frame) with a pixel contributing to a value of a neighboring pixel with a bilinear interpolation weight (Staranowicz, ¶ 0048: teaches a post-processing step in which color values of pixels can be corrected that do not agree with neighboring pixels; Examiner finds bilinear interpolation weighting is the most obvious type of filter to the skilled artisan and is widely represented in the art, but Staranowicz nevertheless still teaches it explicitly; Staranowicz, ¶ 0050: teaches linear pixel color blending using a weighted average to interpolate color between two pixels; While Staranowicz’s paragraph [0050] teaches the color interpolation of a pixel occurring by bilinear interpolation of the color value between frames (i.e. temporal bilinear interpolation using a temporal neighbor), Staranowicz’s paragraph [0048] nevertheless also teaches spatial bilinear interpolation using a spatial neighbor; Likewise, Zhao, Section 1: describes Yan’s work in which a CNN-based interpolation filter is added to the CODEC for improving half-pel estimation accuracy; Niklaus, Sections 3.1 and 4.4 and Fig. 3: teaches including bilinear upsampling (i.e. interpolation) into the frame interpolation algorithm to address checkboard artifacts; Furthermore, the prior art is replete with examples of bilinear pixel interpolation such that asserting such a feature as nonobvious is unreasonable to one of ordinary skill in the art; see e.g. Liu, ¶ 0103, cited under the Conclusion Section of this Office Action, teaching upsampling in a frame interpolation scheme utilizing neural networks by using bilinear interpolation to correct for artifacts; It is noted Liu has same authorship as Niklaus); region that is occluded in the first reference frame (Swisher, ¶ 0053: teaches a neural network may include “an occlusion estimation layer that estimates a binary mask to indicate [occlusions].”; Staranowicz, ¶ 0070: teaches the output of an optical flow process can be fed into an occlusion mask generating process); synthesizing, via a (Niklaus, Abstract: teaches warped frames fed “to a video frame synthesis neural network to produce the interpolated frame” and teaches handling occlusions using the neural network); encoding the target frame based on the estimate of the target frame (Zhao, Title and Abstract: explains bi-directional optical flow can be used for video compression); and encoding the mask based on the target frame (Staranowicz, ¶ 0064: teaches “generated occlusion masks may optionally be transmitted….”).
One of ordinary skill in the art, before the effective filing date of the claimed invention, would have been motivated to combine the elements taught by Zhao, with those of Niklaus, because both references are drawn to the same field of endeavor such that one wishing to practice optical flow estimation would be led to their relevant teachings, because Zhao itself references Niklaus’s related work (endnote 12 of Zhao), and because, as evidenced by other prior art cited under the Conclusion Section of this Office Action, the skilled artisan finds motion compensation for video coding is the same as frame interpolation due to their recognized substantial overlap in concept. Thus, the combination is nothing more than a mere combination of prior art elements, according to known methods, to yield a predictable result. This rationale applies to all combinations of Zhao and Niklaus used in this Office Action unless otherwise noted.
One of ordinary skill in the art, before the effective filing date of the claimed invention, would have been motivated to combine the elements taught by Zhao and Niklaus, with those of Swisher, because all three references are drawn to the same field of endeavor such that one wishing to practice image processing using machine learning would be led to their relevant teachings and because, as evidenced by Niklaus, optical flow analysis requires handling the challenges of occlusions such that one would be led to Swisher’s machine learning techniques addressing occlusions using, for example, binary filtering (mask). Thus, the combination is nothing more than a mere combination of prior art elements, according to known methods, to yield a predictable result. This rationale applies to all combinations of Zhao, Niklaus, and Swisher used in this Office Action unless otherwise noted.
One of ordinary skill in the art, before the effective filing date of the claimed invention, would have been motivated to combine the elements taught by Zhao, Niklaus, and Swisher, with those of Staranowicz, because all four references are drawn to the same field of endeavor such that one wishing to practice frame interpolation (which includes handling occlusions and in painting) using optical flow would be led to their relevant teachings and because, as evidenced by Niklaus, optical flow analysis requires handling the challenges of occlusions such that one would be led to Staranowicz’s optical flow and masking techniques (¶¶ 0064 and 0070) addressing occlusions using the transmission of the occlusion mask to assist with other image processing tasks. Thus, the combination is nothing more than a mere combination of prior art elements, according to known methods, to yield a predictable result. This rationale applies to all combinations of Zhao, Niklaus, Swisher, and Staranowicz used in this Office Action unless otherwise noted.
Regarding claim 2, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the computer-implemented method of claim 1, further comprising synthesizing the estimate of the target frame based on a second warped target frame, wherein the second warped target frame is generated based on a second reference frame included in the target video (Zhao, Abstract: teaches the optical flow estimation is for “bi-directional motion compensation” (i.e. B-frames); Examiner notes the skilled artisan knows B-frames conventionally means predicting a target frame based on a first reference frame and a second reference frame).
Regarding claim 3, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the computer-implemented method of claim 2, wherein the first reference frame precedes the target frame within the target video and the second reference frame succeeds the target frame within the target video (Examiner notes the recited feature is simply the definition of a B-frame; see evidence provided under the Conclusion Section of this Office Action; Zhao, Abstract: teaches the optical flow estimation is for “bi-directional motion compensation” (i.e. B-frames); Examiner notes the skilled artisan knows B-frames conventionally means predicting a target frame based on a first reference frame and a second reference frame).
Regarding claim 4, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the computer-implemented method of claim 1, further comprising training a machine learning model based on interpolation training data and one or more losses to generate the trained machine learning model (Zhao, Section II(B): teaches the CNN is trained using losses calculated by comparing the current frame to its prediction; Niklaus, Section on Loss Functions: teaches the synthesis neural network is trained using loss functions; Examiner notes using losses to train a machine learning model is manifestly obvious given the pure ubiquity of the concept being represented in the art), wherein the interpolation training data comprises one or more training reference frames and a training target frame (Zhao, Section II(B): teaches the CNN is trained using bi-prediction, which, as already explained supra, means one or more reference frames and a target frame).
Regarding claim 6, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the computer-implemented method of claim 1, further comprising determining, via an occlusion estimation technique, the set of pixels that correspond to the region that is occluded in the first reference frame (Niklaus, Section 3: teaches the neural network identifies (i.e. estimates) occluded regions; Swisher, ¶ 0053: teaches a neural network may include “an occlusion estimation layer that estimates a binary mask to indicate [occlusions].”; Staranowicz, ¶ 0070: teaches the output of an optical flow process can be fed into an occlusion mask generating process).
Regarding claim 8, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the computer-implemented method of claim 1, further comprising encoding the first estimated optical flow based on the target frame (Zhao, Title and Abstract: explains bi-directional optical flow can be used for video compression).
Regarding claim 9, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the computer-implemented method of claim 1, wherein encoding the target frame comprises encoding a residual between the target frame and the estimate of the target frame (Zhao, Title and Abstract: explains bi-directional optical flow can be used for video compression; Examiner finds encoded residuals are ubiquitous in the art of video compression).
Regarding claim 10, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the computer-implemented method of claim 1, wherein the trained machine learning model comprises a convolutional neural network (Zhao, Title and Abstract: explains bi-directional optical flow using a CNN).
Claim 11 lists the same elements as claim 1, but in CRM form rather than method form. Therefore, the rationale for the rejection of claim 1 applies to the instant claim.
Claim 12 lists the same elements as claim 2, but in CRM form rather than method form. Therefore, the rationale for the rejection of claim 2 applies to the instant claim.
Claim 13 lists the same elements as claim 6, but in CRM form rather than method form. Therefore, the rationale for the rejection of claim 6 applies to the instant claim.
Claim 16 lists the same elements as claim 8, but in CRM form rather than method form. Therefore, the rationale for the rejection of claim 8 applies to the instant claim.
Regarding claim 18, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the one or more non-transitory computer-readable media of claim 11, wherein the second trained machine learning model comprises a GridNet neural network (Niklaus, Description for Figure 2: explains GridNet is used for the frame synthesis network).
Regarding claim 19, the combination of Zhao, Niklaus, Swisher, and Staranowicz teaches or suggests the one or more non-transitory computer-readable media of claim 11, wherein the first reference frame comprises a key frame (Examiner notes B-frames are known to be predicted from key frames; Zhao, Introduction: explains the bi-directional optical flow prediction produces B-frames, which can obviously be predicted from key frames).
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Niklaus, Zhao, Swisher, Staranowicz, and Ye (US 2020/0221122 A1).
Regarding claim 5, the combination of Zhao, Niklaus, Swisher, Staranowicz, and Ye teaches or suggests the computer-implemented method of claim 4, wherein the one or more losses comprise an L1 norm between a first set of pixels generated by the machine learning model based on the one or more training reference frames and a second set of pixels included in the training target frame (Ye, ¶ 0127: teaches the L1 norm can be distortion (e.g. sum of absolute differences), which the skilled artisan would find essentially equivalent to a mean square error determination; see Liu and Podilchuk under the Conclusion Section of this Office Action).
One of ordinary skill in the art, before the effective filing date of the claimed invention, would have been motivated to combine the elements taught by Zhao, Niklaus, Swisher, and Staranowicz, with those of Ye, because all five references are drawn to the same field of endeavor such that one wishing to practice optical flow estimation would be led to their relevant teachings, because Ye explains the most ubiquitous comparison technique used in the art, sum of absolute differences (i.e. SAD) is an L1 norm, and because, as evidenced by other prior art cited under the Conclusion Section of this Office Action, the skilled artisan would find using an L1 norm for evaluating loss in machine learning for optical flow estimation an obvious way to score performance. Thus, the combination is nothing more than a mere combination of prior art elements, according to known methods, to yield a predictable result. This rationale applies to all combinations of Zhao, Niklaus, Swisher, Staranowicz, and Ye used in this Office Action unless otherwise noted.
Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Niklaus, Zhao, Swisher, Staranowicz, and Meister et al., “UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss,” Nov 2017 (herein “Meister”).
Regarding claim 7, the combination of Zhao, Niklaus, Swisher, Staranowicz, and Meister teaches or suggests the computer-implemented method of claim 6, wherein the additional occlusion estimation technique comprises at least one of determining a difference between a first pixel value from the first reference frame and a second pixel value from the target frame, determining a magnitude of motion between the first pixel value and the second pixel value (Examiner notes a difference between two pixels is the definition of motion; Meister, Section on Unsupervised Loss: “We mark pixels as becoming occluded whenever the mismatch between these two flows is too large.”; Examiner finds this means that motions are compared for large motions indicating occlusions rather than legitimate motions), or performing a depth test associated with the depth map (Examiner notes it is prevalent in the art of optical flow to distinguish between “layers” (e.g. foreground and background), which indicates different depths as a closer object moves in front of a farther way (deeper) object in the image).
One of ordinary skill in the art, before the effective filing date of the claimed invention, would have been motivated to combine the elements taught by Zhao, Niklaus, Swisher, and Staranowicz, with those of Meister, because all five references are drawn to the same field of endeavor such that one wishing to practice optical flow estimation would be led to their relevant teachings, because Meister explains how to address occlusions, which are also the subject of both Zhao and Niklaus, and because, as demonstrated by the prior art, the skilled artisan instinctively knows that outlier motions in an optical flow analysis is likely to indicate an occlusion. Thus, the combination is nothing more than a mere combination of prior art elements, according to known methods, to yield a predictable result. This rationale applies to all combinations of Zhao, Niklaus, Swisher, Staranowicz, and Meister used in this Office Action unless otherwise noted.
Claim 14 lists the same elements as claim 7, but in CRM form rather than method form. Therefore, the rationale for the rejection of claim 7 applies to the instant claim.
Claims 15, 17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao, Niklaus, Swisher, Staranowicz, and Lee (US 2019/0311202 A1).
Regarding claim 15, the combination of Zhao, Niklaus, Swisher, Staranowicz, and Lee teaches or suggests the one or more non-transitory computer-readable media of claim 11, wherein encoding the mask comprises: inputting the target frame and the mask into an encoder neural network; and generating, via the encoder neural network, an encoded representation of the mask based on (i) features extracted from the target frame and (ii) the mask (Lee, ¶ 0038: teaches the term “neural network” in this field means an “encoder neural network” based on its function or output; The independent claim already addressed the encoding the mask and inputting the target frame and mask and so the rationale for those features is not repeated here; Examiner notes a previous version of this claim required a second encoder neural network; The teachings of the prior art with respect to a second neural network also apply to a single neural network performing the recited functions; Lee, ¶ 0097: teaches a mask associated with the target frame is input into a (second) neural network; Lee, ¶ 0097 and claim 3: teaches that various component aspects of the overall encoding scheme can be broken out into separate neural network encoders; This teaching, combined with the teachings of Zhao, Niklaus, and Staranowicz would teach or suggest an HEVC video encoder wherein a first neural network may perform the motion estimation using optical flow component of video coding and then optionally feeding that result (output) into subsequent components of the overall HEVC video encoder to produce a coded video stream).
One of ordinary skill in the art, before the effective filing date of the claimed invention, would have been motivated to combine the elements taught by Zhao, Niklaus, Swisher, and Staranowicz, with those of Lee, because all five references are drawn to the same field of endeavor such that one wishing to practice optical flow estimation would be led to their relevant teachings, because Lee explains how to address using information about a target from generated by one neural network to feed into another encoder neural network, and because making separable things that are separable or combinable things combinable is obvious without some particular, nonobvious discovery regarding how the separation or combination yields something more than the constituent parts. This rationale applies to all combinations of Zhao, Niklaus, Swisher, Staranowicz, and Lee used in this Office Action unless otherwise noted.
Regarding claim 17, the combination of Zhao, Niklaus, Swisher, Staranowicz, and Lee teaches or suggests the one or more non-transitory computer-readable media of claim 11, wherein encoding the target frame based on the estimate of the target frame comprises: inputting the target frame and the estimate of the target frame into an encoder neural network; and generating, via the encoder neural network, an encoded representation of the target frame based on features extracted from the estimate of the target frame and the target frame (Lee, ¶ 0038: teaches the term “neural network” in this field means an “encoder neural network” based on its function or output; The independent claim already addressed the inputting the target frame and estimate of the target frame and generating an encoded representation of the target frame based on the estimated target frame (i.e. frame interpolation) and so the rationale for those features is not repeated here; Examiner notes a previous version of this claim required a second encoder neural network; The teachings of the prior art with respect to a second neural network also apply to a single neural network performing the recited functions; Lee, ¶ 0097: teaches a mask associated with the target frame is input into a (second) neural network; Lee, ¶ 0097 and claim 3: teaches that various component aspects of the overall encoding scheme can be broken out into separate neural network encoders; This teaching, combined with the teachings of Zhao, Niklaus, and Staranowicz would teach or suggest an HEVC video encoder wherein a first neural network may perform the motion estimation using optical flow component of video coding and then optionally feeding that result (output) into subsequent components of the overall HEVC video encoder to produce a coded video stream).
Claim 20 lists the same elements as claim 15, but in system form rather than CRM form. Therefore, the rationale for the rejection of claim 15 applies to the instant claim.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Bross et al., “Versatile Video Coding (Draft 2), JVET-K1001-v7, July 2018. Section 8.3.4.5 is titled “Bi-directional optical flow prediction process,” which demonstrates the state of the art video compression algorithm already employs bi-directional optical flow using L0 and L1 samples (i.e. frame interpolation).
Z. Zhao, S. Wang, S. Wang, X. Zhang, S. Ma, and J. Yang, “CNN-based bi-directional motion compensation for high efficiency video coding,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), 27 May 2018, pp. 1–4.
Alshin et al., “BI-DIRECTIONAL OPTICAL FLOW FOR IMPROVING MOTION COMPENSATION,” 28th Picture Coding Symposium, PCS2010, December 8-10, 2010, Nagoya, Japan.
Ljunggren (US 2010/0332675 A1) teaches coding B-frames is the same as frame interpolation (e.g. ¶ 0006).
Cedervall (US 2010/0091772 A1) teaches “B-frames interpolate data in the following frame as well as the preceding frame.” (¶ 0005).
Ho (US 2007/0140351 A1) teaches, “Motion compensation based prediction and interpolation for B frames is similar to that of P frames except that for each B frame, motion vectors are generated relative to a successive I or P frame and a prior I or P frame.” (¶ 0007).
Liu (US 2020/0012940 A1) has the same authorship as “Niklaus” used in the rejection of the claims and states, among other things, using the L1 norm to evaluate loss (Liu, ¶ 0061).
Podilchuk (US 2007/0286497 A1) teaches, “The most commonly used cost functions for the block matching algorithm are the mean square error (MSE) or L2 norm and the mean absolute difference (MAD) or L1 norm.” (¶ 0040). Ye explains SAD is also an L1 norm, so MAD and SAD are equivalent.
Socek (US 2018/0315199 A1) teaches occlusion can be identified using depth or incoherent motion (¶ 0003).
Swisher (US 2018/0300540 A1) teaches a neural network may include “an occlusion estimation layer that estimates a binary mask to indicate [occlusions].” (¶ 0053).
Jiang (US 2020/0242777 A1) teaches a CNN or other machine learning implementation that generates a depth map (¶ 0031).
Argiro (US 11,826,636 B2) teaches generating a depth map using a neural network (claim 42).
Lee (US 2018/0302606 A1) teaches recognizing and correcting occlusion regions using depth maps (claim 1).
Sinha (US 2017/0168586 A1) teaches inference (interpolation) based on a depth map using a neural network (¶ 0038) and teaches “robustness against occlusion due to discriminative training of neural networks.” (¶ 0006).
Palou et al., “Depth Ordering on Image Sequences Using Motion Occlusions,” 2012 19th IEEE International Conference on Image Processing, September 30, 2012. Teaches “optical flow algorithms can be used directly to estimate occlusion points” and teaches determining depth orders based on occlusions.
Wadhwa (US 2021/0183089 A1) was used to reject a previous version of claims 1 and 6. Wadhwa teaches using machine learning techniques to determine a depth map and teaches a machine learning technique for identifying objects and their ordering and/or overlap of the objects within the image (Wadhwa, ¶ 0037). The skilled artisan would find it obvious that the use of depth ordering produces a depth map and that such depth data can be used to identify occlusion regions around objects as they move throughout the scene, thus a correlation exists between objects at varying depths and their surrounding pixels being occlusion regions, the most basic of concepts. Examiner further notes those skilled in the art interpret ordering as depth ordering.
Doyen (US 2019/0385323 A1) teaches an occlusion mask generated from a depth occlusion map is transmitted to a rendering device (¶ 0008).
Pang (US 2018/0097867 A1) teaches compression coding of video data includes encoding side information (¶ 0382) and teaches an occlusion mask indicating occluded regions between a reference vantage and a target vantage wherein the occlusion mask “may be provided as part of the compressed data stream…[to be] available to the decoder…” (¶ 0432).
Chien (US 2017/0323471 A1) teaches, “Through the use of the compact encoding scheme and two-level structure, the occlusion data can be stored with moderate memory usage.” (¶ 0053).
Sugano (US 2021/0134049 A1) teaches, “Three-dimensional data such as three-dimensional occlusion data, if any, is encoded together with the two-dimensional image data and the depth data.” (¶ 0201).
Georgakis (US 2014/0192148 A1) teaches an encoder sending occlusion data to a decoder (¶ 0053).
Tian (US 2013/0162774 A1) is titled, “Compression Methods and Apparatus For Occlusion Data.”
Klein Gunnewiek (US 2010/0195716 A1) teaches “an occlusion data frame [is] encoded.” (Abstract).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael J Hess whose telephone number is (571)270-7933. The examiner can normally be reached on Mon - Fri 9:00am-5:30pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William Vaughn can be reached on (571)272-3922. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
MICHAEL J. HESS
Primary Examiner
Art Unit 2481
/MICHAEL J HESS/Primary Examiner, Art Unit 2481