Last updated: April 19, 2026
Application No. 17/949,156
ADAPTIVE VIDEO FRAME BLENDING

Final Rejection §103
Filed
Sep 20, 2022
Examiner
CHOWDHURY, NIGAR
Art Unit
2484
Tech Center
2400 — Computer Networks
Assignee
Nvidia Corporation
OA Round
2 (Final)
Interview Optional

— +17.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 713 resolved cases, 2023–2026
Examiner Intelligence

CHOWDHURY, NIGAR View full profile →
Grants 69% — above average
Career Allow Rate
490 granted / 713 resolved
+10.7% vs TC avg
Strong +17% interview lift
Without
With
+17.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
21 currently pending
Career history
734
Total Applications
across all art units
Statute-Specific Performance

§101
7.3%
-32.7% vs TC avg
§103
50.7%
+10.7% vs TC avg
§102
29.4%
-10.6% vs TC avg
§112
3.3%
-36.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 713 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 8, 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20230217001 by Holzer et al. in view of US 2023/0344962 by Tran et al.
Regarding claim 1, Holzer et al. discloses one or more processors comprising: circuitry to: identify one or more pixels of a first video frame of a video sequence or a second video frame of the video sequence having depth value; and generate one or more intermediate video frames using the identified one or more pixels (paragraph 0190 teaches “In the present embodiment, an artificially rendered image is generated as intermediate frame X by tracking a set of control points between the first frame and the second frame, and interpolating individual control points for the third location X using the set of control points and interpolating pixel locations using the individual control points. As shown, the first frame includes a first image captured from a first location and the second frame includes a second image captured from a second location. In one example, interpolating the pixel locations includes interpolation using barycentric coordinates based on three or more control points. In another example, interpolating the pixel locations includes interpolation using splines. In yet another example, interpolating the pixel locations includes interpolation using finite elements. In yet another example, interpolating the pixel locations includes using motion of the nearest control point. In some examples, interpolating pixel locations using the individual control points includes using dense optical flow. In other examples, interpolating pixel locations using the individual control points includes using depth data.”, paragraph 0235).
Holzer et al. fails to disclose identify one or more pixels of the video sequence having a similar depth value; and generate pixel values for one or more intermediate video frames between the first video frame and the second video frame using the identified one or more pixels.
Tran et al. discloses identify one or more pixels of the video sequence having a similar depth value; and generate pixel values for one or more intermediate video frames between the first video frame and the second video frame using the identified one or more pixels (fig. 12, paragraph 0079-0085 teaches “The method may begin at step 1210, where a convolutional neural network (CNN) is trained to perform video frame interpolation. This may include training the CNN to predict non-linear movements that occur over two or more consecutive image frames of a video stream…... At step 1240, the method may (optionally) include detecting, by each of one or more two-dimensional (2D) filters of the CNN, a respective image feature of interest in the stack of consecutive image frames….. At step 1250, method 1200 may include performing, for each channel, a 3D space-time convolution operation in which a 3D filter is passed over the stack of consecutive image frames in a width dimension common to each of the image frames in the stack of image frames, a height dimension common to each of the image frames in the stack of image frames, and a depth dimension representing the number of image frames in the stack of image frames, the convolution predicting any non-linear movement to be depicted in the resulting interpolated image frames. For example, generating the interpolated image frames may include generating, by respective convolutional layers of the CNN, image data for each of a plurality of channels of the interpolated image frames based on image data in one or more of the plurality of channels in the stack of consecutive image frames…… At step 1260, the method may include generating, by a two-dimensional (2D) prediction layer of the CNN based on a 3D output of the 3D space-time convolution, n 2D interpolated image frames…… At step 1270, method 1200 may include outputting a video stream in which the n 2D interpolated image frames are inserted between the image frames of the target pair.”, paragraph 0035 teaches “As used herein, the terms “feature” or “features” may refer to input data or output data associated with a convolution operation. In particular embodiments, the output of each layer of a convolutional neural network may be represented by features that no longer resemble the original input in content, size, and/or shape. For example, an input image including 10×10 pixels with RGB channels may be represented by 10×10×3 features. After one round of convolution, the output may be represented by 4×4×2 features that might or might not look like an image.”).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to incorporate to include generate pixel values for one or more intermediate video frames between the first video frame and the second video frame using the identified one or more pixels, as taught by Tran et al. into the system of Holzer et al., because such incorporation would allow for the benefit of increasing the frame rate for low frame rate video, thus improving video quality for the user during display of video.

Regarding claim 2, the processor wherein the circuitry is to generate the one or more intermediate video frames by at least using the depth value to generate one or more pixels in at least one of the one or more intermediate video frames, wherein the one or more pixels lack one or more corresponding pixels in at least one of the first video frame and the second video frame (in addition to discussion above, Holzer et al., paragraph 0186-0187 teaches “In the present example, a layer is a region of image pixels that share a similar motion, based on factors such as depth and or speed of the corresponding real world object. In the present example, viewpoint-interpolation can be used to generate an artificially rendered image as described above with regard to FIG. 14. In addition, because different motions of the layers can create occlusions and/or regions with missing data, the process can also include methods for filling the regions corresponding to any detected occlusions. Whether data is occluded or missing is based on the ordering of layers. Although the layers are expected to stay in the same order, they can switch order if the objects change their depth location within the scene”; Tran et al., fig. 12 as discussed above).

Regarding claim 3, the processor wherein the one or more intermediate video frames are to be generated using a neural network (in addition to discussion above, Holzer et al., paragraph 0077 teaches “In some embodiments, the system may include a neural network that is trained to recognize the object type of the object of interest to generate the user template. In some embodiments, the neural network may further be trained to detect and recognize particular features or components and prompt the user to capture corresponding specific view MIDMRs. The system may also automatically generate and place tags corresponding to the specific view MIDMRs. The tags may be automatically placed on a particular feature or component based on neural network recognition.”).

Regarding claim 4, the processor wherein the circuitry are to blend the one or more intermediate video frames based, at least in part, on one or more blending factors (in addition to discussion above, Holzer et al., paragraph 0129 teaches “The information in the overlaps of the series of images, including interpolation information for generating artificial frames in between captured frames, is stored. Matching keypoints are identified to compute intermediate frames and linear blending is implemented to transform an image between two capture frames.”, paragraph 0212 teaches “Method 2100 may then be implemented for the transition between each image frame in the sequence. Various embodiments of method 2100 may provide advantages over existing methods of rendering artificial images, such as alpha blending. Especially in the case of concave MIDMRs, existing methods result in artifacts or ghosting effect from improperly aligned image frames.”, paragraph 0308 teaches “However, transformation T1-2_N may be applied to keyframe 2 to obtain pixel R1-2, and/or transformation T1-3_M may be applied to keyframe 3 to obtain pixel R1-3. The synthetic pixels R1-2 and R1-3 are then blended at 3021 to render a final rendering pixel, PR based on the following blending algorithm”, paragraph 0336).

Regarding claim 5, the processor wherein the circuitry are to generate the one or more intermediate frames by at least using the depth value to, in an intermediate frame, select one or more pixels of the intermediate frame to use to generate one or more other pixels of the intermediate frame (in addition to discussion above, Holzer et al., paragraph 0129 teaches “In contrast, MIDMRs, as described herein, comprise a series of images that are presented to the user as a user interacts with the MIDMR or viewing device. The information in the overlaps of the series of images, including interpolation information for generating artificial frames in between captured frames, is stored. Matching keypoints are identified to compute intermediate frames and linear blending is implemented to transform an image between two capture frames. To compute intermediate frames, transformations are implemented, such as homography which may be used for stabilization, as well as scaling, which allows interpolated keypoints in images to match up. No part of any image frame is discarded. This causes parallax to be visible in MIDMRs generated by systems and methods described herein, in contrast to existing panoramas,”, paragraph 0227 teaches “As previously described, intermediate images can be rendered between image frames in a MIDMR. Such intermediate images correspond to viewpoints located between the viewpoints of the existing image frames. In some embodiments, stereo pairs may be generated for each of these intermediate images and presented to the user by method 2200.”; Tran et al., fig. 12 as discussed above).

Regarding claim 6, the processor wherein the circuitry are to use the depth value to generate one or more filters to be used by the circuitry to calculate one or more pixels of the one or more intermediate frames based, at least in part, on one or more neighboring pixels of the one or more pixels (in addition to discussion above, Holzer et al., paragraph 0243-0244 teaches “If the calculated transformation T1 applied at step 2409 does result in the largest group of inlier keypoints, the image intensity difference between the first frame and the second frame is calculated at step 2415. In some embodiments, the calculated transformation T1 may be applied to every pixel in the first frame to warp the first frame onto the second frame, and an image intensity difference is calculated between the transformed first frame and the second frame. For example, an error or difference for each color channel is calculated between each corresponding pixel in the two frames. Then an average error of the RGB value of all color channels is derived for each pixel. Next, the average error for all pixels within the first and second frames is calculated. In some embodiments, the image intensity difference between two frames will be between 0 and 255. A lower image intensity difference value corresponds to a closer image match between frames, and consequently a more accurate transformation.”; Tran et al., fig. 12 as discussed above).

Regarding claim 7, the processor wherein the one or more intermediate video frames are to be blended based, at least in part, on one or more motion types (in addition to discussion above, Holzer et al., paragraph 0182 teaches “With reference to FIG. 14, shown is an example of a process 1400 for generating an artificially rendered image from two frames. In this example, a transformation is estimated between a first frame and a second frame at 1402. The transformation is estimated using methods such as homography, affine, similarity, translation, rotation, or scale. As described above with regard to FIG. 4, the first frame includes a first image captured from a first location (Location A 422) and the second frame includes a second image captured from a second location (Location B 424). In some examples, the first image and second image are associated with the same layer. Next, an artificially rendered image is rendered corresponding to a third location. This third location is positioned on a trajectory between the first location and the second location, as shown as Location X 426 in FIG. 4.”, paragraph 0186-0187 teaches “In some examples, foreground and/or background can include one or more layer. In the present example, a layer is a region of image pixels that share a similar motion, based on factors such as depth and or speed of the corresponding real world object. In the present example, viewpoint-interpolation can be used to generate an artificially rendered image as described above with regard to FIG. 14. In addition, because different motions of the layers can create occlusions and/or regions with missing data, the process can also include methods for filling the regions corresponding to any detected occlusions. Whether data is occluded or missing is based on the ordering of layers.”).

Regarding claim 8, a computer-implemented method comprising: identifying one or more pixels of a first video frame of a video sequence or a second video frame of the video sequence having a depth value; and generating for one or more intermediate video frames between the first video frame and the second video frame using identified one or more pixels (in addition to discussion above, paragraph 0190 teaches “In the present embodiment, an artificially rendered image is generated as intermediate frame X by tracking a set of control points between the first frame and the second frame, and interpolating individual control points for the third location X using the set of control points and interpolating pixel locations using the individual control points. As shown, the first frame includes a first image captured from a first location and the second frame includes a second image captured from a second location. In one example, interpolating the pixel locations includes interpolation using barycentric coordinates based on three or more control points. In another example, interpolating the pixel locations includes interpolation using splines. In yet another example, interpolating the pixel locations includes interpolation using finite elements. In yet another example, interpolating the pixel locations includes using motion of the nearest control point. In some examples, interpolating pixel locations using the individual control points includes using dense optical flow. In other examples, interpolating pixel locations using the individual control points includes using depth data.”, paragraph 0192 teaches “In the present embodiment an artificially rendered image is generated using this parallax effect. In particular, a set of control points (not shown) is moved perpendicular to a trajectory between a first frame and a second frame. In this example, the first frame includes a first image captured from a first location and the second frame includes a second image captured from a second location and each control point is moved based on an associated depth of the control point. Specifically, control points located at a further depth are moved less than control points at a closer depth in some examples. Additionally, in some applications, each pixel in the first image corresponds to a control point in the set of control points. Furthermore, in addition to having a concrete depth associated with a point, it is also possible to infer the depth from the amount of frame-to-frame motion of the control points.”).
Holzer et al. fails to disclose identifying one or more pixels of the video sequence having a similar depth value; and generating pixel values for one or more intermediate video frames between the first video frame and the second video frame using identified one or more pixels.
Tran et al. discloses identifying one or more pixels of the video sequence having a similar depth value; and generating pixel values for one or more intermediate video frames between the first video frame and the second video frame using identified one or more pixels (fig. 12, paragraph 0079-0085 teaches “The method may begin at step 1210, where a convolutional neural network (CNN) is trained to perform video frame interpolation. This may include training the CNN to predict non-linear movements that occur over two or more consecutive image frames of a video stream…... At step 1240, the method may (optionally) include detecting, by each of one or more two-dimensional (2D) filters of the CNN, a respective image feature of interest in the stack of consecutive image frames….. At step 1250, method 1200 may include performing, for each channel, a 3D space-time convolution operation in which a 3D filter is passed over the stack of consecutive image frames in a width dimension common to each of the image frames in the stack of image frames, a height dimension common to each of the image frames in the stack of image frames, and a depth dimension representing the number of image frames in the stack of image frames, the convolution predicting any non-linear movement to be depicted in the resulting interpolated image frames. For example, generating the interpolated image frames may include generating, by respective convolutional layers of the CNN, image data for each of a plurality of channels of the interpolated image frames based on image data in one or more of the plurality of channels in the stack of consecutive image frames…… At step 1260, the method may include generating, by a two-dimensional (2D) prediction layer of the CNN based on a 3D output of the 3D space-time convolution, n 2D interpolated image frames…… At step 1270, method 1200 may include outputting a video stream in which the n 2D interpolated image frames are inserted between the image frames of the target pair.”, paragraph 0035 teaches “As used herein, the terms “feature” or “features” may refer to input data or output data associated with a convolution operation. In particular embodiments, the output of each layer of a convolutional neural network may be represented by features that no longer resemble the original input in content, size, and/or shape. For example, an input image including 10×10 pixels with RGB channels may be represented by 10×10×3 features. After one round of convolution, the output may be represented by 4×4×2 features that might or might not look like an image.”).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to incorporate to include generating pixel values for one or more intermediate video frames between the first video frame and the second video frame using the identified one or more pixels, as taught by Tran et al. into the system of Holzer et al., because such incorporation would allow for the benefit of increasing the frame rate for low frame rate video, thus improving video quality for the user during display of video.

Regarding claim 9, the computer-implemented method further comprising: generating one or more additional pixels based, at least in part, on the depth value; and adding the one or more additional pixels to at least one of the intermediate video frames (in addition to discussion above, Holzer et al., paragraph 0169 teaches “Thus, in some embodiments, outputs from intermediate convolution-pool layer pairs, which include higher resolutions, may be added together to refine the output of the segmenting neural network.”; Tran et al., fig. 12 as discussed above).

Regarding claim 10, the computer-implemented method wherein the depth value is used to determine one or more pixels adjacent to the one or more pixels (in addition to discussion above, Holzer et al., paragraph 0211 teaches “At step 2109, the optimal transformation candidate is applied to each pixel. Each pixel in an image may experience a different transformation between frames. In some embodiments, each of the transformation candidates is applied to each pixel. The transformation candidate that results in the least difference between frames may be selected. In some embodiments, each of the transformation candidates is applied to a group, or “community,” of pixels. For example, a community of pixels may comprise a 7×7 (−3, +3) group of pixels. Once an optimal transformation is applied to each pixel, an artificial image may be rendered at step 2111. In various embodiments, steps 2109 and 2111 may be performed during runtime when the user is viewing the sequence of images. In such embodiments, the transformation may be a function of frame number of the frame between N and N+1.”; Tran et al., fig. 12 as discussed above).

Regarding claim 11, the computer-implemented method further comprising generate at least some of the depth value based, at least in part, on optical flow between the first video frame and the second video frame (in addition to discussion above, Holzer et al., fig. 16, paragraph 0189-0190 teaches “As shown, the first frame includes a first image captured from a first location and the second frame includes a second image captured from a second location. In one example, interpolating the pixel locations includes interpolation using barycentric coordinates based on three or more control points. In another example, interpolating the pixel locations includes interpolation using splines. In yet another example, interpolating the pixel locations includes interpolation using finite elements. In yet another example, interpolating the pixel locations includes using motion of the nearest control point. In some examples, interpolating pixel locations using the individual control points includes using dense optical flow. In other examples, interpolating pixel locations using the individual control points includes using depth data.”; Tran et al., fig. 12 as discussed above).

Regarding claim 12, the computer-implemented method further comprising: receiving one or more first motion vectors from the first video frame to the second video frame; generating one or more second motion vectors from the second video frame to the first video frame based, at least in part, on the first motion vectors; and generating the one or more intermediate video frames based, at least in part, on blending the first motion vectors and the second motion vectors (in addition to discussion above, Holzer et al., paragraph 0211 teaches “In various embodiments, steps 2109 and 2111 may be performed during runtime when the user is viewing the sequence of images. In such embodiments, the transformation may be a function of frame number of the frame between N and N+1. The number of frames between N and N+1 may be determined based on various considerations, such as the speed of movement and/or the distance between frames N and N+1. Because method 2100 may generate any number of frames between frames N and N+1, the user may perceive a smooth transition as the user view different viewpoints of the three-dimensional model of an object of interest, as an image frame may be rendered for virtually any viewpoint position the user is requesting to view. Furthermore, because the artificial image frames may be rendered based on the calculated transformation parameters, storage of such artificial image frames is not required. This enhances the functioning of image processing computer systems by reducing storage requirements.”).

Regarding claim 13, the computer-implemented method generating the one or more intermediate video frames comprises generating the one or more intermediate video frames using a neural network (in addition to discussion above, Holzer et al., paragraph 0077 teaches “In some embodiments, the system may include a neural network that is trained to recognize the object type of the object of interest to generate the user template. In some embodiments, the neural network may further be trained to detect and recognize particular features or components and prompt the user to capture corresponding specific view MIDMRs. The system may also automatically generate and place tags corresponding to the specific view MIDMRs. The tags may be automatically placed on a particular feature or component based on neural network recognition.”).

Regarding claim 14, the computer-implemented method further comprising generating a filter to be used to generate the one or more intermediate frames (in addition to discussion above, Holzer et al., paragraph 0243-0244 teaches “If the calculated transformation T1 applied at step 2409 does result in the largest group of inlier keypoints, the image intensity difference between the first frame and the second frame is calculated at step 2415. In some embodiments, the calculated transformation T1 may be applied to every pixel in the first frame to warp the first frame onto the second frame, and an image intensity difference is calculated between the transformed first frame and the second frame. For example, an error or difference for each color channel is calculated between each corresponding pixel in the two frames. Then an average error of the RGB value of all color channels is derived for each pixel. Next, the average error for all pixels within the first and second frames is calculated. In some embodiments, the image intensity difference between two frames will be between 0 and 255. A lower image intensity difference value corresponds to a closer image match between frames, and consequently a more accurate transformation.”).

Regarding claim 15, a computer system comprising: one or more processors and memory storing instructions that, if performed by the one or more processors are to: identify one or more pixels of a first video frame of a video sequence or a second video frame of the video sequence having depth value; and generate one or more intermediate video frames using the identified one or more pixels (in addition to discussion above, paragraph 0190 teaches “In the present embodiment, an artificially rendered image is generated as intermediate frame X by tracking a set of control points between the first frame and the second frame, and interpolating individual control points for the third location X using the set of control points and interpolating pixel locations using the individual control points. As shown, the first frame includes a first image captured from a first location and the second frame includes a second image captured from a second location. In one example, interpolating the pixel locations includes interpolation using barycentric coordinates based on three or more control points. In another example, interpolating the pixel locations includes interpolation using splines. In yet another example, interpolating the pixel locations includes interpolation using finite elements. In yet another example, interpolating the pixel locations includes using motion of the nearest control point. In some examples, interpolating pixel locations using the individual control points includes using dense optical flow. In other examples, interpolating pixel locations using the individual control points includes using depth data.”, paragraph 0192 teaches “In the present embodiment an artificially rendered image is generated using this parallax effect. In particular, a set of control points (not shown) is moved perpendicular to a trajectory between a first frame and a second frame. In this example, the first frame includes a first image captured from a first location and the second frame includes a second image captured from a second location and each control point is moved based on an associated depth of the control point. Specifically, control points located at a further depth are moved less than control points at a closer depth in some examples. Additionally, in some applications, each pixel in the first image corresponds to a control point in the set of control points. Furthermore, in addition to having a concrete depth associated with a point, it is also possible to infer the depth from the amount of frame-to-frame motion of the control points.”).
Holzer et al. fails to disclose identify one or more pixels of the video sequence having a similar depth value; and generate pixel values for one or more intermediate video frames between the first video frame and the second video frame using the identified one or more pixels.
Tran et al. discloses identify one or more pixels of the video sequence having a similar depth value; and generate pixel values for one or more intermediate video frames between the first video frame and the second video frame using the identified one or more pixels (fig. 12, paragraph 0079-0085 teaches “The method may begin at step 1210, where a convolutional neural network (CNN) is trained to perform video frame interpolation. This may include training the CNN to predict non-linear movements that occur over two or more consecutive image frames of a video stream…... At step 1240, the method may (optionally) include detecting, by each of one or more two-dimensional (2D) filters of the CNN, a respective image feature of interest in the stack of consecutive image frames….. At step 1250, method 1200 may include performing, for each channel, a 3D space-time convolution operation in which a 3D filter is passed over the stack of consecutive image frames in a width dimension common to each of the image frames in the stack of image frames, a height dimension common to each of the image frames in the stack of image frames, and a depth dimension representing the number of image frames in the stack of image frames, the convolution predicting any non-linear movement to be depicted in the resulting interpolated image frames. For example, generating the interpolated image frames may include generating, by respective convolutional layers of the CNN, image data for each of a plurality of channels of the interpolated image frames based on image data in one or more of the plurality of channels in the stack of consecutive image frames…… At step 1260, the method may include generating, by a two-dimensional (2D) prediction layer of the CNN based on a 3D output of the 3D space-time convolution, n 2D interpolated image frames…… At step 1270, method 1200 may include outputting a video stream in which the n 2D interpolated image frames are inserted between the image frames of the target pair.”, paragraph 0035 teaches “As used herein, the terms “feature” or “features” may refer to input data or output data associated with a convolution operation. In particular embodiments, the output of each layer of a convolutional neural network may be represented by features that no longer resemble the original input in content, size, and/or shape. For example, an input image including 10×10 pixels with RGB channels may be represented by 10×10×3 features. After one round of convolution, the output may be represented by 4×4×2 features that might or might not look like an image.”).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to incorporate to include generate pixel values for one or more intermediate video frames between the first video frame and the second video frame using the identified one or more pixels, as taught by Tran et al. into the system of Holzer et al., because such incorporation would allow for the benefit of increasing the frame rate for low frame rate video, thus improving video quality for the user during display of video.

Regarding claim 16, the computer system wherein the one or more intermediate video frames are to be blended based, at least in part, on one or more motions of dynamic objects displayed in at least one of the first video frame and the second video frame (in addition to discussion above, Holzer et al., paragraph 0063 teaches “According to various embodiments, a multi-view interactive digital media (MIDM) is used herein to describe any one of various images (or other media data) used to represent a dynamic surrounding view of an object of interest and/or contextual background. Such dynamic surrounding view may be referred to herein as multi-view interactive digital media representation (MIDMR).”, paragraph 0089 teaches “According to various examples, computer vision techniques can also be used to provide stabilization for MIDM. For instance, keypoints can be detected and tracked. However, in certain scenes, such as a dynamic scene or static scene with parallax, no simple warp exists that can stabilize everything. Consequently, there is a trade-off in which certain aspects of the scene receive more attention to stabilization and other aspects of the scene receive less attention. Because MIDM is often focused on a particular object of interest, MIDM can be content-weighted so that the object of interest is maximally stabilized in some examples.”).

Regarding claim 17, the computer system wherein the one or more intermediate video frames are to be blended based, at least in part, on a first viewpoint location of the first video frame and a second viewpoint location of the second video frame (in addition to discussion above, Holzer et al., paragraph 0110 teaches “According to various methods described herein, any number of image frames may be rendered between two keyframes of the MIDMR based on the location of the user and/or viewing device, which provides a viewpoint at any location around an object of interest, as well as a smooth transition between keyframes.”, paragraph 0124 teaches “In some embodiments Frame X 406 may be an artificially rendered image that is generated for a viewpoint at Location X 426 on a trajectory between Location A 422 and Location B 424. In such example, a single transform for viewpoint interpolation is used along the trajectory between two frames, Frame A 402 and Frame B 404.”).

Regarding claim 18, the computer system wherein one or more intermediate video frames are to be blended based, at least in part, on one or more static objects displayed in at least one of the first video frame and the second video frame (in addition to discussion above, Holzer et al., paragraph 0098 teaches “In some examples, filters can also be used during capture or generation of MIDM to enhance the viewing experience of the MIDMR. Just as many popular photo sharing services provide aesthetic filters that can be applied to static, two-dimensional images, aesthetic filters can similarly be applied to surround images. However, because a MIDMR is more expressive than a two-dimensional image, and three-dimensional information is available in MIDM, these filters can be extended to include effects that are ill-defined in two dimensional photos.”, fig. 17, paragraph 0191-0192 teaches “Although not shown, two frames capturing images of the scene 1708 can be obtained by moving a camera along camera motion 1716. Comparing the two images will show parallax motion in the direction 1712. As shown in the figure, the amount of parallax motion 1710 of each of the layers can indicate movement or differences between the frames. Although not indicated in the figure, the different layers 1702, 1704, and 1706 can each have different amounts of parallax motion. In the present embodiment an artificially rendered image is generated using this parallax effect. In particular, a set of control points (not shown) is moved perpendicular to a trajectory between a first frame and a second frame. In this example, the first frame includes a first image captured from a first location and the second frame includes a second image captured from a second location and each control point is moved based on an associated depth of the control point.”).

Regarding claim 19, the computer system wherein the instructions comprise instructions to generate, in at least one of the intermediate video frames, at least one pixel that lacks a corresponding pixel in at least one of the first video frame and second video frame (in addition to discussion above, Holzer et al., paragraph 0186-0187 teaches “In the present example, a layer is a region of image pixels that share a similar motion, based on factors such as depth and or speed of the corresponding real world object. In the present example, viewpoint-interpolation can be used to generate an artificially rendered image as described above with regard to FIG. 14. In addition, because different motions of the layers can create occlusions and/or regions with missing data, the process can also include methods for filling the regions corresponding to any detected occlusions. Whether data is occluded or missing is based on the ordering of layers. Although the layers are expected to stay in the same order, they can switch order if the objects change their depth location within the scene”).

Regarding claim 20, the computer system wherein the instructions to cause the one or more processors to generate the one or more intermediate video frames, if performed by the one or more processors, cause the one or more processors to use the depth value to calculate one or more pixel values of the one or more intermediate video frames based, at least in part, on a plurality of other pixels of the one or more intermediate video frames (in addition to discussion above, Holzer et al., paragraph 0243-0244 teaches “If the calculated transformation T1 applied at step 2409 does result in the largest group of inlier keypoints, the image intensity difference between the first frame and the second frame is calculated at step 2415. In some embodiments, the calculated transformation T1 may be applied to every pixel in the first frame to warp the first frame onto the second frame, and an image intensity difference is calculated between the transformed first frame and the second frame. For example, an error or difference for each color channel is calculated between each corresponding pixel in the two frames. Then an average error of the RGB value of all color channels is derived for each pixel. Next, the average error for all pixels within the first and second frames is calculated. In some embodiments, the image intensity difference between two frames will be between 0 and 255. A lower image intensity difference value corresponds to a closer image match between frames, and consequently a more accurate transformation.”; Tran et al., fig. 12 as discussed above).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NIGAR CHOWDHURY whose telephone number is (571)272-8890. The examiner can normally be reached Monday-Friday 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Thai Tran can be reached at 571-272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NIGAR CHOWDHURY/Primary Examiner, Art Unit 2484
Read full office action
Prosecution Timeline

Sep 20, 2022
Application Filed
May 16, 2025
Non-Final Rejection — §103
Aug 18, 2025
Applicant Interview (Telephonic)
Aug 18, 2025
Examiner Interview Summary
Nov 21, 2025
Response Filed
Mar 13, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/674,826
Patent 12601856
DEVICE AND METHOD FOR MULTI-ANGLE STEREOSCOPIC IMAGING MEASUREMENT OF PRECIPITATION PARTICLES
2y 5m to grant Granted Apr 14, 2026
18/810,174
Patent 12600311
VIDEO RECORD SYSTEM FOR VEHICLE, METHOD OF CONTROLLING THE VIDEO RECORD SYSTEM, AND USER TERMINAL
2y 5m to grant Granted Apr 14, 2026
18/970,400
Patent 12604071
SYSTEM AND METHOD FOR GENERATING A CUSTOM SUMMARY OF UNCONSUMED PORTIONS OF A SERIES OF MEDIA ASSETS
2y 5m to grant Granted Apr 14, 2026
18/483,289
Patent 12592260
VIDEO GENERATION METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/641,358
Patent 12591167
ENCLOSURES FOR ACCOMMODATING BOARD STACKS
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
69%
Grant Probability
86%
With Interview (+17.3%)
3y 3m
Median Time to Grant
Moderate
PTA Risk
Based on 713 resolved cases by this examiner. Grant probability derived from career allow rate.
ADAPTIVE VIDEO FRAME BLENDING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email