DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Following prior arts are considered pertinent to applicant's disclosure.
HAO CHEN ET AL: "NeRV: Neural Representations for Videos",
ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 October 2021 (Chen)
US 20200356779 A1 (Ye779)
MILDENHALL BEN ET AL: "NeRF: Representing Scenes as Neural
Radiance Fields for View Synthesis", ARXIV.ORG, PAGE(S) 405 - 421,(Ben)
US 9552623 B1 (Cheng)
MARTIN BOHME ET AL: "Gaze-contingent temporal filtering of video", EYE TRACKING RESEARCH & APPLICATIONS, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 27 March 2006 (2006-03-27), pages 109-115, (Bohme)
US 20210142520 A1 ( para 60-61, 73 and 78-79 teaches determining one or more increased spatial resolution areas)
BARRON JONATHAN TET AL: "Mip-NeRF: A Multiscale Representation
for Anti-Aliasing Neural Radiance Fields", 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEЕ, 10 October 2021 (2021-10-10), pages 5835-5844
Claim Objection (Allowable Subject Matter)
Claims 5-8, 14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is an examiner’s statement of reasons for allowability:
The primary reason for allowance of claim 5 is that while US 20210142520 A1 in para 60-61, 73 and 78-79 teaches determining one or more increased spatial resolution areas of the image frame to be rendered at a higher spatial resolution than one or more other areas of the image frame and rendering the one or more increased spatial resolution areas at the higher spatial resolution by rendering an increased area density of pixels for the one or more increased spatial resolution areas and gaze tracking for ROI, the prior art failed to teach “rendering the increased area density of pixels comprising determining an increased number of viewing directions for the increased spatial resolution areas, the increased number of viewing directions corresponding to an increased density of rays into the scene compared to the one or more other areas of the image frame, the increased density of rays into the scene defining a reduced angular difference between the corresponding viewing directions.”. Other claims are dependent upon claim 5.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 16, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Ye779.
Regarding claim 1. Chen teaches a computer-implemented method of decoding video, comprising [ section 4.3 ]
obtaining encoded video data for a video comprising a plurality of video frames [(see 1st paragraph of section 4.1 where UVG sequence consists of 7 videos with 3900 frames in total)] , wherein the video frames comprise a sequence of sets of video frames, each set of video frames comprising the video frames [(see 1st paragraph of section 4.1 where each of the 7 videos of UVG corresponds to a set of video frames, see 2nd paragraph of section 4.3 "Compare with state-of-the-arts methods" where the 7 video sequences are concatenated. The 1st video frame of each video are usually being considered as a key frame)] ,
wherein the encoded video data for each set of video frames comprises parameters of a scene representation neural network that encodes the video frames [(see section 3.2 where the model parameters of the NERV representation of the video in section 3.1 are encoded and see 2nd paragraph of section 4.3 "Compare with state-of-the-arts methods" where two embodiments are discloses: a) the NERV representation is trained on the all the frames of the 7 videos and b) the NERV representation is trained for each of 7 videos, i.e. each subset of video frames has its own parameters)] , and
wherein the scene representation neural network is configured to receive a representation of a frame time, and to process the representation of the frame time to generate a scene representation output for rendering an image that depicts a scene encoded by the parameters of the scene representation neural network at the frame time [ see “1. Introduction” 1st and 2nd paragraph. Video represents a scene which is represented by NeRV neural network. The scene/video represented by frame duration “T” and frame time “t” that resides between 1 and T. See figure 2b and 1st paragraph of section 2.1 where the input of the function is a frame index t and the output is the RGB image, and see sub-section "Network Architecture" in section 3.1" ]
and for each set of video frames in the encoded video data
processing a representation of each of a set of frame times between the respective pair of key frames for the set of video frames, using the scene representation neural network configured with the parameters for the set of video frames, to generate the scene representation output for each of the frame times [ (see 2nd paragraph of section 4.3 "Compare with state-of-the-arts methods" where the NERV representation is trained for each of 7 videos, i.e. each subset of video frames has its own parameters and see section 3.1 where for each video of the 7 videos, i.e. for set, the input of the function is a frame index t corresponding to the set and the output is the RGB image) ]
and rendering a set of image frames, one for each of the frame times, using the scene representation output for each of the frame times, wherein the set of image frames provides the decoded video [(“Video encoding in NeRV is simply fitting a neural network to video frames and decoding process is a simple feedforward operation” the Abstract; Fig.9 shows reconstructed or decoded video )]
Chen does not explicitly show between a respective pair of key frames and frame time defining a time between the respective pair of key frames. Instead, it shows 7 videos concatenated.
However, in the same/related field of endeavor, Ye779 does not explicitly show between a respective pair of key frames and frame time defining a time between the respective pair of key frames [(Ye779 para 72 to show that first frame of each video is keyframe given Chen teaches 7 videos are concatenated {section 4.3}, this means videos are between respective pair of key frames)]
Therefore, in light of above discussion it would have been obvious to one of the ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teaching of the prior arts because such combination would provide predictable result with no change of their respective functionalities.
Regarding Claim 16: See analysis of claim 1 and note Chen section 4.3 "Compare with state-of-the-arts methods" teaches the steps are used in training in an encoder.
Regarding Claim 19: See analysis of claim 1 and note the implementation using one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers in Ye779 para 5.
Claims 2-4, 17, 21-23 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Ye779 in view of Ben.
Regarding Claim 2. Chen in view of Ye779 does not explicitly show neural network is further configured to receive a representation of a viewing direction, and to process the representation of the frame time and the representation of the viewing direction to generate the scene representation output for rendering the image
the method further comprising rendering each of the image frames by
determining a viewing direction for each of a plurality of pixels in the image frame, wherein the viewing direction for a pixel corresponds to a direction of a ray into the scene from the pixel
for each pixel in the image frame, processing the representation of the frame time and the representation of the viewing direction for the pixel, using the scene representation neural network to generate the scene representation output for the pixel and the frame time
and rendering the image frame for the frame time using the scene representation outputs for the pixels of the image frame
However, in the same/related field of endeavor, Ben teaches neural network is further configured to receive a representation of a viewing direction, and to process the representation of the frame time and the representation of the viewing direction to generate the scene representation output for rendering the image [Ben section 1 with the help of Chen]
the method further comprising rendering each of the image frames by
determining a viewing direction for each of a plurality of pixels in the image frame, wherein the viewing direction for a pixel corresponds to a direction of a ray into the scene from the pixel [(section 3 and Fig.2, also see section 4)]
for each pixel in the image frame, processing the representation of the frame time and the representation of the viewing direction for the pixel, using the scene representation neural network to generate the scene representation output for the pixel and the frame time [ Chen with the help of above sections of Ben]
and rendering the image frame for the frame time using the scene representation outputs for the pixels of the image frame [(Ben section 1)]
Therefore, in light of above discussion it would have been obvious to one of the ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teaching of the prior arts because such combination would provide predictable result with no change of their respective functionalities.
Claim 3. The method of claim 2, wherein the parameters of the scene representation neural network encode a representation of the scene over a three dimensional spatial volume, and wherein the scene representation neural network is further configured to receive a representation of a spatial location in the scene, and to process the representation of the frame time, the representation of the viewing direction, and the representation of the spatial location to generate the scene representation output, and wherein the scene representation output defines a light level emitted from the spatial location along the viewing direction and an opacity at the spatial location [ Ben section 1 ]
and wherein rendering each of the image frames comprises, for each pixel in the image frame
determining a plurality of spatial locations along the ray into the scene from the pixel [ Ben Section 3 & 4 ]
for each of the spatial locations, processing the representation of the frame time, the representation of the viewing direction, and the representation of the spatial location, using the scene representation neural network, to generate the scene representation output, wherein the scene representation output defines the light level emitted from the spatial location along the viewing direction and the opacity at the spatial location [(Ben section 2 and 3)]
and combining, for the spatial locations along the ray, the light level emitted from the spatial location along the viewing direction and the opacity at the spatial location, to determine a pixel value for the pixel in the image frame. [(Ben section 2 and 3)]
Claim 4. The method of claim 2, wherein the set of image frames comprise image frames defined on a concave 2D surface, and wherein the viewing direction for a pixel corresponds to a direction of a ray outwards from a point of view for the decoded video that is within the three-dimensional spatial volume [(Ben section 6.1 on a full sphere)] .
Claim 17. The method of claim 16 wherein the encoder scene representation neural network is configured to receive a representation of the source frame time, and to process the representation of the source frame time to generate the encoder scene representation output for rendering an image that depicts a scene encoded by the parameters of the encoder scene representation neural network at the source frame time [(Chen, see section 3.2 where the model parameters of the NERV representation of the video in section 3.1 are encoded and see 2nd paragraph of section 4.3 "Compare with state-of-the-arts methods" where two embodiments are discloses: a) the NERV representation is trained on the all the frames of the 7 videos and b) the NERV representation is trained for each of 7 videos, i.e. each subset of video frames has its own parameters)]
and wherein the training comprises, for each of the source video frames
processing the representation of the source frame time using the encoder scene representation neural network to generate the encoder scene representation output for the source frame time [(Chen, see section 3.2 where the model parameters of the NERV representation of the video in section 3.1 are encoded and see 2nd paragraph of section 4.3 "Compare with state-of-the-arts methods" where two embodiments are discloses: a) the NERV representation is trained on the all the frames of the 7 videos and b) the NERV representation is trained for each of 7 videos, i.e. each subset of video frames has its own parameters)]
rendering the image that depicts the scene at the source frame time using the encoder scene representation output for the source frame time [Chen, Fig.2 ]
and updating the parameters of the encoder scene representation neural network using an objective function that characterizes an error between the source video frame and the rendered the image that depicts the scene at the source frame time [(Ben “Training Details” page 17, optimization; and in page 2 “to optimize this model by minimizing the error between each observed image and the corresponding views rendered from our representation” )]
Regarding Claims 21-23: see analysis of claims 2-4
Claims 9-12 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Ye779 in view of Cheng.
Regarding Claim 10. Chen in view of Ye779 does not explicitly show determining the frame times dependent upon a metric of a rate of change of content of the video, such that a time interval between successive image frames of the decoded video is decreased when the metric indicates an increased rate of change.
However, in the same/related field of endeavor, Cheng teaches determining the frame times dependent upon a metric of a rate of change of content of the video, such that a time interval between successive image frames of the decoded video is decreased when the metric indicates an increased rate of change.[[(Cheng Column 4 lines 49-59 and column 5 lines 21-29; column 3 lines 1-10)]
Therefore, in light of above discussion it would have been obvious to one of the ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teaching of the prior arts because such combination would provide predictable result with no change of their respective functionalities.
Claim 11. The method of claim 10, wherein determining the frame times comprises
rendering a first set of image frames, one for each of a first set of frame times
in response to the metric, determining one or more additional frame times temporally between successive frame times of the first set of frame times and rendering one or more additional image frames corresponding to the additional frame times by processing representations of the additional frame times using the scene representation neural network .[[(Cheng Column 4 lines 49-59 and column 5 lines 21-29)]
Claim 12. The method of claim 11, wherein rendering one or more additional image frames corresponding to the additional frame times comprises rendering only part of the additional image frames that is determined by the metric to have an increased rate of change [(Cheng column 5 lines 35-39)]
Claim 9. The method of claim 1, further comprising
determining a frame rate for the video and determining the number of video frames in each set of video frames dependent upon the frame rate. .[[(Cheng Column 4 lines 49-59 and column 5 lines 21-29; column 3 lines 1-10)]
Claims 13 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Ye779 in view of Bohme.
Regarding Claim 13. Chen in view of Ye779 does not explicitly show obtaining a gaze direction for an observer of the image frames and rendering part of one or more additional image frames corresponding to additional frame times temporally between successive frame times of the set of frame times, by processing representations of the additional frame times using the scene representation neural network
wherein the part of the one or more additional image frames comprises at least a part to which the observer is directing their gaze.
However, in the same/related field of endeavor, Bohme teaches obtaining a gaze direction for an observer of the image frames [(section 2 1s para)] and rendering part of one or more additional image frames corresponding to additional frame times temporally between successive frame times of the set of frame times [(Figure 2 and section 2.5)]
wherein the part of the one or more additional image frames comprises at least a part to which the observer is directing their gaze. [(section 2 1s para)]
Therefore, in light of above discussion it would have been obvious to one of the ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teaching of the prior arts because such combination would provide predictable result with no change of their respective functionalities.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Shahan Rahaman whose telephone number is (571)270-1438. The examiner can normally be reached on 7am - 3:30pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nasser Goodarzi can be reached at telephone number (571) 272-4195. The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of an application may be obtained from Patent Center. Status information for published applications may be obtained from Patent Center. Status information for unpublished applications is available through Patent Center for authorized users only. Should you have questions about access to Patent Center, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) Form at https://www.uspto.gov/patents/uspto-automated- interview-request-air-form.
/SHAHAN UR RAHAMAN/Primary Examiner, Art Unit 2426