Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawing(s) filed on 02/21/2023 are accepted by the Examiner.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/12/2025 has been entered.
Response to Amendment
Applicant’s Amendments filed on 11/12/2025 has been entered and made of record.
Currently pending Claim(s):
Independent Claim(s):
Amended Claim(s):
Canceled Claim(s):
1–12, 14–21
1, 11, 20 and 21
1, 11, 20 and 21
13
Response to Applicant’s Arguments
This office action is responsive to Applicant’s Arguments/Remarks Made in an Amendment received on 11/12/2025.
In view of applicant Arguments/Remarks and amendment filed on 11/12/2025 with respect to independent claims 1, 11, 20 and 21 under 35 U.S.C 103, claim rejection has been fully considered and the arguments are not found to be persuasive (See Page 9-12), therefore the claim rejection with respect to 35 U.S.C. 103 still applies.
Applicant argues, in summary the applied prior art (Andrei) does not disclose or suggest see (pages 9–12):
“performing the second training stage in conjunction with training another machine learning model”
However, the Examiner respectfully disagrees with the Applicant’s line of reasoning. The Examiner has thoroughly reviewed the Applicants arguments but respectfully believes that the cited reference to reasonably and properly meet the claimed limitations.
Under broadest reasonable interpretation the phrase “in conjunction with” does not require simultaneity (i.e., it does not require both models to be updated at the exact same time), but rather includes co-training as part of the same training stage. Andrei expressly in ¶ [0065] teaches that, in a second stage, “the discriminator … may also be enabled” and that “the adversarial loss is used in addition to MSE and perceptual losses for training” (i.e., the enhancement model is trained using a combined loss that includes discriminator-derived adversarial loss while the discriminator is part of the training process). Further, Andrei in ¶ [0058] explains that “the discriminator … and generator … are trained in turns by first training one and then the other” which is a conventional form of adversarial co-training where both models are trained as part of the same training regimen, even if updates are alternating rather than simultaneous. Accordingly, Andrei teaches performing a post-first-stage training phase in which the generator is trained with discriminator-related loss while the discriminator is trained as well, meeting the claim’s “in conjunction with training another machine learning model that distinguishes…” limitation. Moreover, Andrei in ¶ [0057] teaches “the discriminator 613 is to compare the “fake” image to a corresponding higher resolution image from a higher resolution dataset 603 to determine if the image is “real” or “fake” (evaluates whether the generate image belongs in the training data set of higher resolution images or not)” that the discriminator performs the claimed distinguishing function by comparing the generated fake image to a higher resolution real image to determine whether it is real or fake.
Therefore, with this broad interpretation, Reda in combination with Andrei teaches, discloses or suggests the Applicant’s invention temporally aligning a video to a second video and producing an visually enhanced version of the first video using a machine learning model; wherein the machine learning model is trained with one or more losses between the output frame and a target frame and performing a first training stage for a machine learning model using one or more losses and second stage training in conjunction with training another model using a combination of the one or more losses and the one or more additional losses. Thus, due to the Applicant’s broad claim language, Applicant’s invention is not far removed from the art of record. Accordingly, these limitations do not render claims patentably distinct over the prior art of record. As a result, it is respectfully submitted that the present application is not in condition for allowance.
Thus, the Examiner maintains that limitations as presented and as rejected were properly and adequately met. The rejection as presented in the Non-Final rejection is maintained regarding to the above limitation. Additional citations and/or modified citations may be present to more concisely address limitations. However, the grounds of rejection remain the same.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
Claim(s) 1–3, 6–8, 11–12, 17–18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Reda et al. (US 2021/0067735 A1, hereafter, "Reda") in view of Xie et al. (CN 107465940 A, hereafter, "Xie") further in view of Andrei et al. (US 20200349681 A1, hereafter, “Andrei”).
Regarding claim 1, Reda discloses a computer-implemented method for performing remastering of video content, the computer-implemented method comprising (See Reda, [Abstract], Apparatuses, systems, and techniques to enhance video):
[determining a first input frame corresponding to a first frame included in a first video and a first target frame corresponding to a second frame included in a second video based on one or more alignments between the first frame and the second frame];
executing a machine learning model to convert the first input frame into a first output frame, the first output frame having higher perceptual quality than the first input frame (See Reda, ¶ [0055], In at least one embodiment, this fine-tuned model can be used to infer interpolated frames that can be interspersed with corresponding input frames to generate a higher frame rate video 312, or one free of frame drops as discussed previously. In at least one embodiment, such an approach can also be used to infer higher resolution video frames (e.g., HD or 4K from SD), which can be used to produce a higher resolution video stream 314 based on a lower resolution input video stream); and
[training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises:
performing a first training stage that trains the machine learning model based on the one or more losses; and
after the first training stage is complete, performing a second training stage on the machine learning model, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, based on a combination of the one or more losses and the one or more additional losses].
However, Reda fail(s) to teach determining a first input frame corresponding to a first frame included in a first video and a first target frame corresponding to a second frame included in a second video based on one or more alignments between the first frame and the second frame; training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises: performing a first training stage that trains the machine learning model based on the one or more losses; and after the first training stage is complete, performing a second training stage on the machine learning model, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, based on a combination of the one or more losses and the one or more additional losses.
Xie, working in the same field of endeavor, teaches: determining a first input frame corresponding to a first frame included in a first video and a first target frame corresponding to a second frame included in a second video based on one or more alignments between the first frame and the second frame (See Xie, ¶ [0064], Step S101: performing a time alignment operation of a first video signal to a second video signal according to time association information).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference to determining a first input frame corresponding to a first frame included in a first video and a first target frame corresponding to a second frame included in a second video based on one or more alignments between the first frame and the second frame based on the method of Xie’s reference. The suggestion/motivation would have been to achieve accurate time mapping and point-by-point alignment between video images (See Xie, ¶ [0005–0008, 0050]).
However, Reda and Xie fail(s) to teach training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises: performing a first training stage that trains the machine learning model based on the one or more losses; and after the first training stage is complete, performing a second training stage on the machine learning model, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, based on a combination of the one or more losses and the one or more additional losses.
Andrei, working in the same field of endeavor, teaches: training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises (See Andrei, ¶ [0058], The constructed image of the generator with filters 611 is also subtracted from a corresponding image of the higher resolution image dataset 601 which indicates a perceptual loss. ¶ [0056], In some embodiments, the CNN 310 (or at least the CNN layers) is a component of a generative adversarial network (GAN). In particular, in these embodiments the CNN 310 is a generator component of a GAN. In some embodiments, an artifact removal layer (or layers) 317 removes artifacts from the input image 305. See also [FIG. 6], PERCEPTUAL LOSS. Note: the loss is based off the enhance output of the GAN and a second different higher quality dataset. The examiner is interpreting the images as frames).
performing a first training stage that trains the machine learning model based on the one or more losses (See Andrei, ¶ [0058], The constructed image of the generator with filters 611 is also subtracted from a corresponding image of the higher resolution image dataset 601 which indicates a perceptual loss. See also [FIG. 6], PERCEPTUAL LOSS); and
after the first training stage is complete, performing a second training stage on the machine learning model (See Andrei, ¶ [0065], A second stage, …, The adversarial loss is used in addition to MSE and perceptual losses for training the generator with filters 611), in conjunction with training another machine learning model (See Andrei, ¶ [0057], The discriminator 613 is a fully connected neural network. ¶ [0058], Note that in some embodiments the discriminator 613 and generator with filters 611 are trained in turns by first training one and then the other. Note: the examiner is interpreting the other machine learning model as the discriminator) that distinguishes between frames generated by the machine learning model and frames in the second video (See Andrei, ¶ [0057], the discriminator 613 is to compare the “fake” image to a corresponding higher resolution image from a higher resolution dataset 603 to determine if the image is “real” or “fake” (evaluates whether the generate image belongs in the training data set of higher resolution images or not)), based on a combination of the one or more losses and the one or more additional losses (See Andrei, ¶ [0058], This perceptual loss is added to the output of the discriminator 613 to produce a generator loss which is fed back to the generator with filters 611 to update the weights of the generator with filters 611 to help train it).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises: performing a first training stage that trains the machine learning model based on the one or more losses; and after the first training stage is complete, performing a second training stage on the machine learning model, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, based on a combination of the one or more losses and the one or more additional losses based on the method of Andrei’s reference. The suggestion/motivation would have been to remove artifacts and visually enhance the video (See Andrei, ¶ [0039–0044]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Xie and Andrei with Reda to obtain the invention as specified in claim 1.
Regarding claim 2, Reda teaches the computer-implemented method of claim 1, further comprising executing the machine learning model to convert a second input frame included in a third video into a second output frame (See Reda, ¶ [0047], In at least one embodiment, a trained network can be used to produce arbitrarily high frame-per-second (FPS) videos (e.g., 60 or 90 FPS) from any of these lower FPS videos (e.g., 30 FPS)).
Regarding claim 3, Reda in view of Xie further in view of Andrei teaches the computer-implemented method of claim 2, further comprising: [executing a discriminator model to generate a prediction associated with the second output frame; and
training the machine learning model based on one or more additional losses associated with the prediction].
However, Reda and Xie fail(s) to teach executing a discriminator model to generate a prediction associated with the second output frame; and training the machine learning model based on one or more additional losses associated with the prediction.
Andrei, working in the same field of endeavor, teaches: executing a discriminator model to generate a prediction associated with the second output frame (See Andrei, ¶ [0057], The output of the discriminator 613 is a probability that the generated image is fake. The discriminator 613 is a fully connected neural network. If the generator 611 is performing well (generating good "fakes") then the discriminator 613 will return a value indicating a higher probability of the generated images being real); and
training the machine learning model based on one or more additional losses associated with the prediction (See Andrei, ¶ [0058], This perceptual loss is added to the output of the discriminator 613 to produce a generator loss which is fed back to the generator with filters 611 to update the weights of the generator with filters 611 to help train it. A discriminator loss is the inverse of the generator loss and is fed back to the discriminator 613 to update its weights).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference to executing a discriminator model to generate a prediction associated with the second output frame; and training the machine learning model based on one or more additional losses associated with the prediction based on the method of Andrei’s reference. The suggestion/motivation would have been to remove artifacts and visually enhance the video (See Andrei, ¶ [0039–0044]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Andrei with Reda and Xie to obtain the invention as specified in claim 3.
Regarding claim 6, Reda in view of Xie further in view of Andrei teaches the computer-implemented method of claim 1, [wherein determining the first input frame and the first target frame comprises determining at least one of the first input frame or the first target frame based on a temporal alignment between the first frame and the second frame].
However, Reda and Andrei fail(s) to teach wherein determining the first input frame and the first target frame comprises determining at least one of the first input frame or the first target frame based on a temporal alignment between the first frame and the second frame.
Xie, working in the same field of endeavor, teaches: wherein determining the first input frame and the first target frame comprises determining at least one of the first input frame or the first target frame based on a temporal alignment between the first frame and the second frame (See Xie, ¶ [0064], Step S101: performing a time alignment operation of a first video signal to a second video signal according to time association information).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference wherein determining the first input frame and the first target frame comprises determining at least one of the first input frame or the first target frame based on a temporal alignment between the first frame and the second frame based on the method of Xie’s reference. The suggestion/motivation would have been to achieve accurate time mapping and point-by-point alignment between video images (See Xie, ¶ [0005–0008, 0050]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Xie with Reda and Andrei to obtain the invention as specified in claim 6.
Regarding claim 7, Reda in view of Xie further in view of Andrei teaches the computer-implemented method of claim 1, [wherein determining the first input frame and the first target frame comprises generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame].
However, Reda and Andrei fail(s) to teach wherein determining the first input frame and the first target frame comprises generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame.
Xie, working in the same field of endeavor, teaches: wherein determining the first input frame and the first target frame comprises generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame (See Xie, ¶ [0064], Step S101: performing a time alignment operation of a first video signal to a second video signal according to time association information. ¶ [0065], Specifically, step S101 reads the time mark from the first video signal to obtain the time-related information and geometric mark information. The geometric marker information may include the shape, size, coordinates in the image and other parameters that describe the geometric marker).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference wherein determining the first input frame and the first target frame comprises generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame based on the method of Xie’s reference. The suggestion/motivation would have been to achieve accurate time mapping and point-by-point alignment between video images (See Xie, ¶ [0005–0008, 0050]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Xie with Reda and Andrei to obtain the invention as specified in claim 7.
Regarding claim 8, Reda in view of Xie further in view of Andrei teaches the computer-implemented method of claim 1, [wherein determining the first input frame comprises resizing the first frame].
However, Reda and Andrei fail(s) to teach wherein determining the first input frame comprises resizing the first frame.
Xie, working in the same field of endeavor, teaches: wherein determining the first input frame comprises resizing the first frame (See Xie, ¶ [0033], Each image in the first video signal is translated and scaled toward the second video signal according to the spatial inverse affine matrix G, and each image in the first video signal is reversely cropped or patched so that the first video signal and the second video signal have the same width and height).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference wherein determining the first input frame comprises resizing the first frame based on the method of Xie’s reference. The suggestion/motivation would have been to achieve accurate time mapping and point-by-point alignment between video images (See Xie, ¶ [0005–0008, 0050]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Xie with Reda and Andrei to obtain the invention as specified in claim 8.
Regarding claim 11, claim 11 is rejected the same as claim 1 and the arguments similar to that presented above for claim 1 are equally applicable to the claim 11, and all of the other limitations similar to claim 1 are not repeated herein, but incorporated by reference. Furthermore, Reda teaches One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of (See Reda, ¶ [0389], In at least one embodiment, a computer-readable storage medium is a non-transitory computer-readable storage medium that excludes transitory signals (e.g., a propagating transient electric or electromagnetic transmission) but includes nontransitory data storage circuitry ( e.g., buffers, cache, and queues) within transceivers of transitory signals).
Regarding claim 12, Reda in view of Xie further in view of Andrei teaches the one or more non-transitory computer-readable media of claim 11, wherein the instructions further cause the one or more processors to perform the steps of: executing the machine learning model to convert a second input frame included in a third video into a second output frame (See Reda, ¶ [0047], In at least one embodiment, a trained network can be used to produce arbitrarily high frame-per-second (FPS) videos (e.g., 60 or 90 FPS) from any of these lower FPS videos (e.g., 30 FPS));
[executing a discriminator model to generate a first prediction associated with the second output frame; and
training the machine learning model based on one or more additional losses associated with the first prediction].
However, Reda and Xie fail(s) to teach executing a discriminator model to generate a first prediction associated with the second output frame; and training the machine learning model based on one or more additional losses associated with the first prediction.
Andrei, working in the same field of endeavor, teaches: executing a discriminator model to generate a first prediction associated with the second output frame (See Andrei, ¶ [0057], The output of the discriminator 613 is a probability that the generated image is fake. The discriminator 613 is a fully connected neural network. If the generator 611 is performing well (generating good "fakes") then the discriminator 613 will return a value indicating a higher probability of the generated images being real); and
training the machine learning model based on one or more additional losses associated with the first prediction (See Andrei, ¶ [0058], This perceptual loss is added to the output of the discriminator 613 to produce a generator loss which is fed back to the generator with filters 611 to update the weights of the generator with filters 611 to help train it. A discriminator loss is the inverse of the generator loss and is fed back to the discriminator 613 to update its weights).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference executing a discriminator model to generate a first prediction associated with the second output frame; and training the machine learning model based on one or more additional losses associated with the first prediction based on the method of Andrei’s reference. The suggestion/motivation would have been to remove artifacts and visually enhance the video (See Andrei, ¶ [0039–0044]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Andrei with Reda and Xie to obtain the invention as specified in claim 12.
Regarding claim 17, Reda in view of Xie further in view of Andrei teaches the one or more non-transitory computer-readable media of claim 11, wherein determining the first input frame and the first target frame comprises:
[determining a temporal alignment between the first frame and the second frame; and
generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame].
However, Reda and Andrei fail(s) to teach determining a temporal alignment between the first frame and the second frame; and generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame.
Xie, working in the same field of endeavor, teaches: determining a temporal alignment between the first frame and the second frame (See Xie, ¶ [0064], Step S101: performing a time alignment operation of a first video signal to a second video signal according to time association information); and
generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame (See Xie, ¶ [0064], Step S101: performing a time alignment operation of a first video signal to a second video signal according to time association information. ¶ [0065], Specifically, step S101 reads the time mark from the first video signal to obtain the time-related information and geometric mark information. The geometric marker information may include the shape, size, coordinates in the image and other parameters that describe the geometric marker).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference determining a temporal alignment between the first frame and the second frame; and generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame based on the method of Xie’s reference. The suggestion/motivation would have been to achieve accurate time mapping and point-by-point alignment between video images (See Xie, ¶ [0005–0008, 0050]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Xie with Reda and Andrei to obtain the invention as specified in claim 17.
Regarding claim 18, Reda in view of Xie further in view of Andrei teaches the one or more non-transitory computer-readable media of claim 11, wherein determining the first input frame comprises:
[determining an affine transformation based on a first set of spatial correspondences between the first frame and the second frame;
applying the affine transformation to the first frame to generate a transformed frame; and
generating the first input frame based on a second set of spatial correspondences between the transformed frame and the second frame].
However, Reda and Andrei fail(s) to teach determining an affine transformation based on a first set of spatial correspondences between the first frame and the second frame; applying the affine transformation to the first frame to generate a transformed frame; and generating the first input frame based on a second set of spatial correspondences between the transformed frame and the second frame.
Xie, working in the same field of endeavor, teaches: determining an affine transformation based on a first set of spatial correspondences between the first frame and the second frame (See Xie, ¶ [0115], First, a spatial affine matrix of a geometric shape between geometric markers of the first video signal and geometric markers of the second video signal is obtained);
applying the affine transformation to the first frame to generate a transformed frame (See Xie, ¶ [0115], The first video signal is then spatially aligned to the second video signal according to a spatial affine matrix); and
generating the first input frame based on a second set of spatial correspondences between the transformed frame and the second frame (See Xie, ¶ [0148], Each image in the first video signal is translated and scaled toward the second video signal according to the spatial inverse affine matrix G, and each image in the first video signal is reversely cropped or patched so that the first video signal and the second video signal have the same width and height); and
generating at least one of the first input frame or the first target frame based on a geometric alignment between the first frame and the second frame (See Xie, ¶ [0064], Step S101: performing a time alignment operation of a first video signal to a second video signal according to time association information. ¶ [0065], Specifically, step S101 reads the time mark from the first video signal to obtain the time-related information and geometric mark information. The geometric marker information may include the shape, size, coordinates in the image and other parameters that describe the geometric marker).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference determining an affine transformation based on a first set of spatial correspondences between the first frame and the second frame; applying the affine transformation to the first frame to generate a transformed frame; and generating the first input frame based on a second set of spatial correspondences between the transformed frame and the second frame based on the method of Xie’s reference. The suggestion/motivation would have been to achieve accurate time mapping and point-by-point alignment between video images (See Xie, ¶ [0005–0008, 0050]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Xie with Reda and Andrei to obtain the invention as specified in claim 18.
Regarding claim 20, Reda teaches a system, comprising:
one or more memories that store instructions (See Reda, [FIG. 11], MEMORY 1120, PROCESSOR 1102), and
one or more processors that are coupled to the one or more memories (See Reda, [FIG. 11], MEMORY 1120, PROCESSOR 1102). and,
when executing the instructions, are configured to perform the steps of:
determining a first input frame corresponding to a first frame included in a
[first video and a first target frame corresponding to a second frame included in a second video based a temporal alignment between the first frame and the second frame and a geometric alignment between the first frame and the second frame];
executing a machine learning model to convert the first input frame into a first output frame, the first output frame having higher perceptual quality than the first input frame (See Reda, ¶ [0055], In at least one embodiment, this fine-tuned model can be used to infer interpolated frames that can be interspersed with corresponding input frames to generate a higher frame rate video 312, or one free of frame drops as discussed previously. In at least one embodiment, such an approach can also be used to infer higher resolution video frames (e.g., HD or 4K from SD), which can be used to produce a higher resolution video stream 314 based on a lower resolution input video stream); and
[training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises:
performing a first training stage that trains the machine learning model based on the one or more losses; and
after the first training stage is complete, performing a second training stage on the machine learning model, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, based on a combination of the one or more losses and the one or more additional losses].
However, Reda fail(s) to teach first video and a first target frame corresponding to a second frame included in a second video based a temporal alignment between the first frame and the second frame and a geometric alignment between the first frame and the second frame; training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises: performing a first training stage that trains the machine learning model based on the one or more losses; and after the first training stage is complete, performing a second training stage on the machine learning model, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, based on a combination of the one or more losses and the one or more additional losses.
Xie, working in the same field of endeavor, teaches: first video and a first target frame corresponding to a second frame included in a second video based a temporal alignment between the first frame and the second frame (See Xie, ¶ [0064], Step S101: performing a time alignment operation of a first video signal to a second video signal according to time association information) and a geometric alignment between the first frame and the second frame (See Xie, ¶ [0114], After time alignment, execute step S102 to perform spatial alignment operation of the first video signal to the second video signal according to the geometric information of the geometric mark).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference first video and a first target frame corresponding to a second frame included in a second video based a temporal alignment between the first frame and the second frame and a geometric alignment between the first frame and the second frame based on the method of Xie’s reference. The suggestion/motivation would have been to achieve accurate time mapping and point-by-point alignment between video images (See Xie, ¶ [0005–0008, 0050]).
However, Reda and Xie fail(s) to teach training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises: performing a first training stage that trains the machine learning model based on the one or more losses; and after the first training stage is complete, performing a second training stage on the machine learning model, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, based on a combination of the one or more losses and the one or more additional losses.
Andrei, working in the same field of endeavor, teaches: training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises (See Andrei, ¶ [0058], The constructed image of the generator with filters 611 is also subtracted from a corresponding image of the higher resolution image dataset 601 which indicates a perceptual loss. ¶ [0056], In some embodiments, the CNN 310 (or at least the CNN layers) is a component of a generative adversarial network (GAN). In particular, in these embodiments the CNN 310 is a generator component of a GAN. In some embodiments, an artifact removal layer (or layers) 317 removes artifacts from the input image 305. See also [FIG. 6], PERCEPTUAL LOSS. Note: the loss is based off the enhance output of the GAN and a second different higher quality dataset. The examiner is interpreting the images as frames):
performing a first training stage that trains the machine learning model based on the one or more losses (See Andrei, ¶ [0058], The constructed image of the generator with filters 611 is also subtracted from a corresponding image of the higher resolution image dataset 601 which indicates a perceptual loss. See also [FIG. 6], PERCEPTUAL LOSS); and
after the first training stage is complete, performing a second training stage on the machine learning model (See Andrei, ¶ [0065], A second stage, …, The adversarial loss is used in addition to MSE and perceptual losses for training the generator with filters 611), in conjunction with training another machine learning model (See Andrei, ¶ [0057], The discriminator 613 is a fully connected neural network. ¶ [0058], Note that in some embodiments the discriminator 613 and generator with filters 611 are trained in turns by first training one and then the other. Note: the examiner is interpreting the other machine learning model as the discriminator) that distinguishes between frames generated by the machine learning model and frames in the second video (See Andrei, ¶ [0057], the discriminator 613 is to compare the “fake” image to a corresponding higher resolution image from a higher resolution dataset 603 to determine if the image is “real” or “fake” (evaluates whether the generate image belongs in the training data set of higher resolution images or not)), based on a combination of the one or more losses and the one or more additional losses (See Andrei, ¶ [0058], This perceptual loss is added to the output of the discriminator 613 to produce a generator loss which is fed back to the generator with filters 611 to update the weights of the generator with filters 611 to help train it).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference training the machine learning model based on one or more losses associated with the first output frame and the first target frame, wherein the machine learning model is trained to enhance visual quality of frames in a video in the absence of a master copy of the video, wherein training the machine learning model comprises: performing a first training stage that trains the machine learning model based on the one or more losses; and after the first training stage is complete, performing a second training stage on the machine learning model, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, based on a combination of the one or more losses and the one or more additional losses based on the method of Andrei’s reference. The suggestion/motivation would have been to remove artifacts and visually enhance the video (See Andrei, ¶ [0039–0044]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Xie and Andrei with Reda to obtain the invention as specified in claim 20.
Claim(s) 4 and 15–16 are rejected under 35 U.S.C. 103 as being unpatentable over Reda et al. (US 2021/0067735 A1, hereafter, "Reda") in view of Xie et al. (CN 107465940 A, hereafter, "Xie") further in view of Andrei et al. (US 20200349681 A1, hereafter, “Andrei”) and further in view of Schroers et al. (US 2019/0130530 A1, hereafter, “Schroers”).
Regarding claim 4, Reda in view of Xie further in view of Andrei teaches the computer-implemented method of claim 2, further comprising: [generating a first set of feature maps associated with the second output frame; and
training the machine learning model based on a perceptual loss computed between the first set of feature maps and a second set of feature maps associated with a second target frame included in a fourth video].
However, Reda, Xie and Andrei fail(s) to teach generating a first set of feature maps associated with the second output frame; and training the machine learning model based on a perceptual loss computed between the first set of feature maps and a second set of feature maps associated with a second target frame included in a fourth video.
Schroers, working in the same field of endeavor, teaches: generating a first set of feature maps associated with the second output frame (See Schroers, ¶ [0039], The perceptual loss is defined as the squared difference between the feature maps of images, {circumflex over (X)}, included in second, super-resolution video sequence 166/466); and
training the machine learning model based on a perceptual loss computed between the first set of feature maps and a second set of feature maps associated with a second target frame included in a fourth video (See Schroers, ¶ [0039], The perceptual loss is defined as the squared difference between the feature maps of images, {circumflex over (X)}, included in second, super-resolution video sequence 166/466 and their respective “ground truth” counterpart images, X, in first video sequence 160/460).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference generating a first set of feature maps associated with the second output frame; and training the machine learning model based on a perceptual loss computed between the first set of feature maps and a second set of feature maps associated with a second target frame included in a fourth video based on the method of Schroers’s reference. The suggestion/motivation would have been to enable super-resolution of unconstrained real-world video and progressively improve performance (See Schroers, ¶ [0002–0003, 0009–0013]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Schroers with Reda, Xie and Andrei to obtain the invention as specified in claim 4.
Regarding claim 15, Reda in view of Xie further in view of Andrei teaches the one or more non-transitory computer-readable media of claim 12, wherein the instructions further cause the one or more processors to [perform the step of training the discriminator model based on a loss that is computed from the first prediction and a second prediction generated by the discriminator model from a second target frame associated with the second input frame].
However, Reda, Xie and Andrei fail(s) to teach perform the step of training the discriminator model based on a loss that is computed from the first prediction and a second prediction generated by the discriminator model from a second target frame associated with the second input frame.
Schroers, working in the same field of endeavor, teaches: perform the step of training the discriminator model based on a loss that is computed from the first prediction and a second prediction generated by the discriminator model from a second target frame associated with the second input frame (See Schroers, ¶ [0030], In contrast to a GAN, which in a super-resolution application typically predicts the probability that an input image X is real, i.e., p(reallX), EBGAN discriminator module 474 learns an energy function that describes the real image manifold, assigning high energy to real images X, and low energy to super-resolution images X); and
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference perform the step of training the discriminator model based on a loss that is computed from the first prediction and a second prediction generated by the discriminator model from a second target frame associated with the second input frame based on the method of Schroers’s reference. The suggestion/motivation would have been to enable super-resolution of unconstrained real-world video and progressively improve performance (See Schroers, ¶ [0002–0003, 0009–0013]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Schroers with Reda, Xie and Andrei to obtain the invention as specified in claim 15.
Regarding claim 16, Reda in view of Xie further in view of Andrei teaches the one or more non-transitory computer-readable media of claim 11, wherein training the machine learning model comprises:
[generating a first set of feature maps associated with the first output frame and a second set of feature maps associated with the first target frame; and
training the machine learning model based on a perceptual loss computed between the first set of feature maps and the second set of feature maps].
However, Reda, Xie and Andrei fail(s) to teach generating a first set of feature maps associated with the first output frame and a second set of feature maps associated with the first target frame; and training the machine learning model based on a perceptual loss computed between the first set of feature maps and the second set of feature maps.
Schroers, working in the same field of endeavor, teaches: generating a first set of feature maps associated with the first output frame and a second set of feature maps associated with the first target frame (See Schroers, ¶ [0039], The perceptual loss is defined as the squared difference between the feature maps of images, {circumflex over (X)}, included in second, super-resolution video sequence 166/466); and
training the machine learning model based on a perceptual loss computed between the first set of feature maps and the second set of feature maps (See Schroers, ¶ [0039], The perceptual loss is defined as the squared difference between the feature maps of images, {circumflex over (X)}, included in second, super-resolution video sequence 166/466 and their respective “ground truth” counterpart images, X, in first video sequence 160/460); and
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference generating a first set of feature maps associated with the first output frame and a second set of feature maps associated with the first target frame; and training the machine learning model based on a perceptual loss computed between the first set of feature maps and the second set of feature maps based on the method of Schroers’s reference. The suggestion/motivation would have been to enable super-resolution of unconstrained real-world video and progressively improve performance (See Schroers, ¶ [0002–0003, 0009–0013]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Schroers with Reda, Xie and Andrei to obtain the invention as specified in claim 16.
Claim(s) 5 is rejected under 35 U.S.C. 103 as being unpatentable over Reda et al. (US 2021/0067735 A1, hereafter, "Reda") in view of Xie et al. (CN 107465940 A, hereafter, "Xie") further in view of Andrei et al. (US 20200349681 A1, hereafter, “Andrei”) and further in view of Shirani et al. (US 9,294,711 B2, hereafter, “Shirani”).
Regarding claim 5, Reda in view of Xie further in view of Andrei teaches the computer-implemented method of claim 1, wherein determining the first input frame comprises: [separating the first frame into a first set of scan lines and a second set of scan lines; and
generating the first input frame from the first set of scan lines].
However, Reda, Xie and Andrei fail(s) to teach separating the first frame into a first set of scan lines and a second set of scan lines; and generating the first input frame from the first set of scan lines.
Shirani, working in the same field of endeavor, teaches: separating the first frame into a first set of scan lines and a second set of scan lines (See Shirani, [Col. 5, ln. 8–11], For example, the de-interlacing tool may receive as input the interlaced digital video frame, apply the interlaced digital video frame to the NLM filter to determine the missing lines 10 of resolution (e.g., either the even lines or the odd lines)); and
generating the first input frame from the first set of scan lines (See Shirani, [Col. 5, ln. 12–14], output the de-interlaced digital video. As a result, a de-interlaced digital video frame including both even and odd lines of resolution may be generated. See also [FIG. 2], 220 Interlaced Digital Video, 210 De-Interlaced Digital Video).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference to separating the first frame into a first set of scan lines and a second set of scan lines; and generating the first input frame from the first set of scan lines based on the method of Shirani’s reference. The suggestion/motivation would have been to provide compatibility with displays that can’t render interlaced video (See Shirani, Col. 1, ln. 17–44, Col. 4, ln. 37–51).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Shirani with Reda, Xie and Andrei to obtain the invention as specified in claim 5.
Claim(s) 9, 10 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Reda et al. (US 2021/0067735 A1, hereafter, "Reda") in view of Xie et al. (CN 107465940 A, hereafter, "Xie") further in view of Andrei et al. (US 20200349681 A1, hereafter, “Andrei”) and further in view of Smolic et al. (EP 3913572 A1, hereafter, “Smolic”).
Regarding claim 9, Reda in view of Xie further in view of Andrei teaches the computer-implemented method of claim 1, [wherein the one or more losses are computed based on a first Fast Fourier Transform (FFT) decomposition of the first output frame and a second FFT decomposition of the first target frame].
However, Reda, Xie and Andrei fail(s) to teach wherein the one or more losses are computed based on a first Fast Fourier Transform (FFT) decomposition of the first output frame and a second FFT decomposition of the first target frame.
Smolic, working in the same field of endeavor, teaches: wherein the one or more losses are computed based on a first Fast Fourier Transform (FFT) decomposition of the first output frame and a second FFT decomposition of the first target frame (Smolic, ¶ [0034], According to the present disclosure, to improve this translation, image data is translated into the frequency domain before the frequency loss factor is calculated. Asa non-limiting example, such a loss function can be based on a discrete cosine transformation (DCT) or a fast fourier transformation (FFT). ¶ [0036], Similarly, we can also define FFTFinal(I1, I2). Essentially during training, the frequency loss function (e.g. DCT or the FFT) of ground truth (i.e. a correctly exposed image) and predictions (i.e. a translation derived from an underexposed version of the correctly exposed image) are computed and the mean of absolute difference between the two is then calculated).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference wherein the one or more losses are computed based on a first Fast Fourier Transform (FFT) decomposition of the first output frame and a second FFT decomposition of the first target frame based on the method of Smolic’s reference. The suggestion/motivation would have been to improve image quality and to correct image data errors (See Smolic, ¶ [0002–0004, 0010–0012]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Smolic with Reda, Xie and Andrei to obtain the invention as specified in claim 9.
Regarding claim 10, Reda in view of Xie further in view of Andrei teaches the computer-implemented method of claim 1, [wherein the one or more losses comprise an L1 loss].
However, Reda, Xie and Andrei fail(s) to teach wherein the one or more losses comprise an L1 loss.
Smolic, working in the same field of endeavor, teaches: wherein the one or more losses comprise an L1 loss (See Smolic, ¶ [0044], The model was trained three times, once using only the L1 loss function, twice with the L1 loss and a frequency loss function (once with a FFT loss function and once with a DCT loss function)).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference wherein the one or more losses comprise an L1 loss based on the method of Smolic’s reference. The suggestion/motivation would have been to improve image quality and to correct image data errors (See Smolic, ¶ [0002 - 0004, 0010 - 0012]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Smolic with Reda, Xie and Andrei to obtain the invention as specified in claim 10.
Regarding claim 19, Reda in view of Xie further in view of Andrei teaches the one or more non-transitory computer-readable media of claim 11, [wherein the one or more losses comprise an L1 loss that is computed between a first Fast Fourier Transform (FFT) decomposition of the first output frame and a second FFT decomposition of the first target frame].
However, Reda, Xie and Andrei fail(s) to teach wherein the one or more losses comprise an L1 loss that is computed between a first Fast Fourier Transform (FFT) decomposition of the first output frame and a second FFT decomposition of the first target frame.
Smolic, working in the same field of endeavor, teaches: wherein the one or more losses comprise an L1 loss that is computed between a first Fast Fourier Transform (FFT) decomposition of the first output frame and a second FFT decomposition of the first target frame (See Smolic, ¶ [0034], According to the present disclosure, to improve this translation, image data is translated into the frequency domain before the frequency loss factor is calculated. Asa non-limiting example, such a loss function can be based on a discrete cosine transformation (DCT) or a fast fourier transformation (FFT). ¶ [0036], Similarly, we can also define FFTFinal(I1, I2). Essentially during training, the frequency loss function (e.g. DCT or the FFT) of ground truth (i.e. a correctly exposed image) and predictions (i.e. a translation derived from an underexposed version of the correctly exposed image) are computed and the mean of absolute difference between the two is then calculated. ¶ [0044], The model was trained three times, once using only the L1 loss function, twice with the L1 loss and a frequency loss function (once with a FFT loss function and once with a DCT loss function)).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference wherein the one or more losses comprise an L1 loss that is computed between a first Fast Fourier Transform (FFT) decomposition of the first output frame and a second FFT decomposition of the first target frame based on the method of Smolic’s reference. The suggestion/motivation would have been to improve image quality and to correct image data errors (See Smolic, ¶ [0002–0004, 0010–0012]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Smolic with Reda, Xie and Andrei to obtain the invention as specified in claim 19.
Claim(s) 14 is rejected under 35 U.S.C. 103 as being unpatentable over Reda et al. (US 2021/0067735 A1, hereafter, "Reda") in view of Xie et al. (CN 107465940 A, hereafter, "Xie") further in view of Andrei et al. (US 20200349681 A1, hereafter, “Andrei”) and further in view of Nayak et al. (US 2022/0374714 A1, hereafter “Nayak”).
Regarding claim 14, Reda in view of Xie further in view of Andrei teaches the one or more non-transitory computer-readable media of claim 12, [wherein the one or more additional losses comprise a weighted combination of a first loss that is computed based on the second output frame a second loss that is computed based on the first prediction].
However, Reda, Xie and Andrei fail(s) to teach wherein the one or more additional losses comprise a weighted combination of a first loss that is computed based on the second output frame a second loss that is computed based on the first prediction.
Nayak, working in the same field of endeavor, teaches: wherein the one or more additional losses comprise a weighted combination of a first loss that is computed based on the second output frame a second loss that is computed based on the first prediction (See Nayak, ¶ [0027], In at least one embodiment, the loss function used to train such an enhancement network can be a weighted combination of, for example, an L1 loss and a multi-scale structural similarity (MS-SSIM) loss, …, In one example, a three-layer pyramid can be utilized with window sizes ranging from, for example, eleven to forty, which can be consistent with the receptive field of the network. By computing and comparing the mean and standard deviation of different windows between the ground truth and corresponding output image, the network will gradually learn how to recognize and remove artifacts while keeping the original details and textures intact).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference to wherein the one or more additional losses comprise a weighted combination of a first loss that is computed based on the second output frame a second loss that is computed based on the first prediction based on the method of Nayak’s reference. The suggestion/motivation would have been to improve the content quality such as reduction of visual artifact or increase of resolution (See Nayak, ¶ [0001, 0019–0021]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Nayak with Reda, Xie and Andrei to obtain the invention as specified in claim 14.
Claim(s) 21 is rejected under 35 U.S.C. 103 as being unpatentable over Reda et al. (US 2021/0067735 A1, hereafter, "Reda") in view of Teng et al. (CN 113077385 A, hereafter “Teng”) and further in view of Andrei et al. (US 20200349681 A1, hereafter, “Andrei”).
Regarding claim 21, Reda discloses a computer-implemented method for performing remastering of video content, the method comprising (See Reda, [Abstract], Apparatuses, systems, and techniques to enhance video):
determining a first input frame corresponding to a first frame included in a first video, wherein the first input frame is associated with a first level of quality (See Reda, ¶ [0047], In at least one embodiment, a trained network can be used to produce arbitrarily high frame-per-second (FPS) videos (e.g., 60 or 90 FPS) from any of these lower FPS videos (e.g., 30 FPS));
executing a machine learning model to convert the first input frame into a first output frame, wherein the first output frame is associated with a second level of quality that is higher than the first level of quality (See Reda, ¶ [0055], In at least one embodiment, this fine-tuned model can be used to infer interpolated frames that can be interspersed with corresponding input frames to generate a higher frame rate video 312, or one free of frame drops as discussed previously. In at least one embodiment, such an approach can also be used to infer higher resolution video frames (e.g., HD or 4K from SD), which can be used to produce a higher resolution video stream 314 based on a lower resolution input video stream), and [wherein the machine learning model is trained using a set of input frames that is associated with the first level of quality and a set of target frames that is temporally aligned with the set of input frames and associated with the second level of quality; and wherein the machine learning model comprises a model trained using:
a first training stage that trains the machine learning model based on the one or more losses, and
after the first training stage is complete, a second training stage, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, that trains the machine learning model based on a combination of the one or more losses and the one or more additional losses]; and
generating a second video that includes the first output frame (See Reda, ¶ [0055], In at least one embodiment, this fine-tuned model can be used to infer interpolated frames that can be interspersed with corresponding input frames to generate a higher frame rate video 312, or one free of frame drops as discussed previously. In at least one embodiment, such an approach can also be used to infer higher resolution video frames (e.g., HD or 4K from SD), which can be used to produce a higher resolution video stream 314 based on a lower resolution input video stream).
However, Reda fail(s) to teach wherein the machine learning model is trained using a set of input frames that is associated with the first level of quality and a set of target frames that is temporally aligned with the set of input frames and associated with the second level of quality; and wherein the machine learning model comprises a model trained using: a first training stage that trains the machine learning model based on the one or more losses, and after the first training stage is complete, a second training stage, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, that trains the machine learning model based on a combination of the one or more losses and the one or more additional losses.
Teng, working in the same field of endeavor, teaches: wherein the machine learning model is trained using a set of input frames that is associated with the first level of quality and a set of target frames that is temporally aligned with the set of input frames and associated with the second level of quality (Teng, ¶ [n0079], For specific film and television scenes, the present invention directly uses the original continuous frames of old films and televisions as the input of the network, and the corresponding high-definition restored version continuous frames as the target of the network. ¶ [n0080], Second, align the time axis of the original video and the corresponding high-definition restored version, select videos within the same time period of the original video and the high-definition restored version, and there is no scene switching within the selected time period, and then convert the video into continuous frames. 482 Next, the high-resolution continuous frames and low-resolution continuous frames are rotated and cropped (the operation parameters must correspond one to one) to expand the training set).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference wherein the machine learning model is trained using a set of input frames that is associated with the first level of quality and a set of target frames that is temporally aligned with the set of input frames and associated with the second level of quality based on the method of Teng’s reference. The suggestion/motivation would have been to improve accuracy and quality of the reconstruction of old movies and TV shows (See Teng, ¶ [n0002–n0008]).
However, Reda and Teng fail(s) to teach wherein the machine learning model comprises a model trained using: a first training stage that trains the machine learning model based on the one or more losses, and after the first training stage is complete, a second training stage, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, that trains the machine learning model based on a combination of the one or more losses and the one or more additional losses.
Andrei, working in the same field of endeavor, teaches: wherein the machine learning model comprises a model trained using: a first training stage that trains the machine learning model based on the one or more losses (See Andrei, ¶ [0058], The constructed image of the generator with filters 611 is also subtracted from a corresponding image of the higher resolution image dataset 601 which indicates a perceptual loss. See also [FIG. 6], PERCEPTUAL LOSS), and
and after the first training stage is complete, a second training stage (See Andrei, ¶ [0065], A second stage, …, The adversarial loss is used in addition to MSE and perceptual losses for training the generator with filters 611), in conjunction with training another machine learning model (See Andrei, ¶ [0057], The discriminator 613 is a fully connected neural network. ¶ [0058], Note that in some embodiments the discriminator 613 and generator with filters 611 are trained in turns by first training one and then the other. Note: the examiner is interpreting the other machine learning model as the discriminator) that distinguishes between frames generated by the machine learning model and frames in the second video (See Andrei, ¶ [0057], the discriminator 613 is to compare the “fake” image to a corresponding higher resolution image from a higher resolution dataset 603 to determine if the image is “real” or “fake” (evaluates whether the generate image belongs in the training data set of higher resolution images or not)), that trains the machine learning model based on a combination of the one or more losses and the one or more additional losses (See Andrei, ¶ [0058], This perceptual loss is added to the output of the discriminator 613 to produce a generator loss which is fed back to the generator with filters 611 to update the weights of the generator with filters 611 to help train it).
Thus, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify Reda’s reference wherein the machine learning model comprises a model trained using: a first training stage that trains the machine learning model based on the one or more losses, and after the first training stage is complete, a second training stage, in conjunction with training another machine learning model that distinguishes between frames generated by the machine learning model and frames in the second video, that trains the machine learning model based on a combination of the one or more losses and the one or more additional losses based on a combination of the one or more losses and the one or more additional losses based on the method of Andrei’s reference. The suggestion/motivation would have been to remove artifacts and visually enhance the video (See Andrei, ¶ [0039–0044]).
Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Therefore, it would have been obvious to combine Teng and Andrei with Reda to obtain the invention as specified in claim 21.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Peng et al. (See NPL attached, “A Progressive Fusion Generative Adversarial Network for Realistic and Consistent Video Super-Resolution”) teaches To this end, we propose a novel progressive fusion network for video SR, in which frames are processed in a way of progressive separation and fusion for the thorough utilization of spatio-temporal information. We particularly incorporate multi-scale structure and hybrid convolutions into the network to capture a wide range of dependencies. We further propose a non-local operation to extract long-range spatio-temporal correlations directly, taking place of traditional motion estimation and motion compensation (ME&MC).
Shi et al. (US 20180075581 A1) teaches The first network is trained to generate super-resolved image data from low-resolution image data and the training includes modifying first weights and biases to optimize processed visual data based on the comparison between the characteristics of the training data and the characteristics of the reference dataset.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DION J SATCHER whose telephone number is (703)756-5849. The examiner can normally be reached Monday - Thursday 5:30 am - 2:30 pm, Friday 5:30 am - 9:30 am PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Henok Shiferaw can be reached at (571) 272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DION J SATCHER/Patent Examiner, Art Unit 2676
/Henok Shiferaw/Supervisory Patent Examiner, Art Unit 2676