DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on 12/15/2025 have been considered and placed in the applicant file.
Response to Arguments
Applicant’s arguments (see remarks), filed 12/22/2025, with respect to the claim 1-7, 9-17 and 19-20 have been fully considered but respectfully, are not persuasive.
The applicant argues on page 12, “Fleishman, however, does not disclose or suggest an image reconstruction machine learning model input based on a previous reconstruction image having a second different resolution. Instead, Fleishman discloses inputting current semantic features and historically- influenced semantic features into a neural network where the features are in the form of tensors of matrices. But a disclosure of current and historical features in the form of tensors of matrices are not a disclosure of, or suggestion of, "a previous reconstruction image of a previous frame having a different second resolution," as recited by amended claim 1.”
In response, the Office agrees. As discussed during a telephonic interview with the applicant representative on 10/27/2025, the Office explained that the prior art by FLEISHMAN (US 20190043203 A1) does not explicitly teach "a previous reconstruction image of a previous frame having a different second resolution,". FLEISHMAN states at paragraph [0045], “process 400 may include “obtain a video sequence of frames of image data and comprising a current frame” 402. This operation may include obtaining pre-processed raw image data. The pre-processing could include resolution reduction.”
However, the combination of FLEISHMAN (US 20190043203 A1) and BAE (US 20210352307 A1) explicitly teaches this aspect of the claimed invention as well as the remaining limitations of independent claims 1, 12 and 19.
Specifically, based on the breadth of the claim language, FLEISHMAN (US 20190043203 A1) explicitly teaches generating a reconstruction image using an image reconstruction machine learning model provided input based on the previous reconstruction image, the obtained frame image, and the semantic map (Fig. 4. Paragraph [0048]-FLEISHMAN disclosed process 400 may include “extract historically-influenced semantically semantic features of the semantic segmentation map” 406. The result of such extraction may be considered historically-influenced features that represent the semantic labeling in the segmentation map. In paragraph [0049]-FLIESHMAN discloses process 400 may include “extract current semantic features of the current frame” 408 (wherein the extraction of both historical and current semantic features is performed by a neural network). In paragraph [0050]-FLEISHMAN discloses process 400 may include “generate a current and historical semantically segmented frame comprising using both the current semantic features and the historically-influenced semantic features as input to a neural network that indicates semantic labels for areas of the current historical semantically segmented frame” 410 (wherein the model may be generated by concatenating current semantic features and historically-influenced semantic features and inputting them into a neural network, such as a CNN). Further in paragraph [0052]-FLEISHMAN discloses process 400 may include “semantically update the 3D semantic segmentation model comprising using the current and historical semantically segmented frame” 412, which refers to registering the semantic labels or probabilities of the segmentation frame to the 3D semantic model), and including a second object having a visualization property indicated by the semantic map (Fig. 5. Paragraph [0065]-FLEISHMAN discloses referring to FIG. 7, process 500 may include “render segmentation map from 3D semantic model” 516, and this may involve obtaining the k-pose of the current frame being analyzed, and then projecting the 3D semantic model to an image plane formed by a camera at the k-pose. An example segmentation map 700 is provided at the current pose (or k-pose) of the current frame, and where the walls 702, chairs 704, and floor 706 shown in the map 700 are segmented from each other and each have an initial, historically-based (or influenced or based on information of previous frames) semantic label (wall, chair, floor for example). Objects that are adjacent each other and have the same label may not show as separate components on the segmentation map).
Although FLEISHMAN explicitly teach generating a reconstruction image using an image reconstruction machine learning model provided input based on the previous reconstruction image, the obtained frame image, and the semantic map, and including a second object having a visualization property indicated by the semantic map.
FLEISHMAN fails to explicitly teach obtaining a previous reconstruction image of a previous frame having a different second resolution; and generating a reconstruction image using an image reconstruction machine learning model provided input based on the previous reconstruction image, the obtained frame image, and the semantic map, having the different second resolution.
However, BAE (US 20210352307 A1) explicitly teaches obtaining a previous reconstruction image of a previous frame (Fig. 8. Paragraph [0029]-BAE discloses the video encoding process can identify and keep useful information that can be used to reconstruct a picture and disregard less important information for the reconstruction. In paragraph [0030]-BAE discloses the useful information of a picture being encoded (referred to as a “current picture”) include changes with respect to a reference picture (e.g., a picture previously encoded and reconstructed) (wherein changes can include position changes, luminosity changes, or color changes of the pixels, etc.). In paragraph [0101]-BAE discloses process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. Input picture 802 is fed to stage 804, where the temporal importance of input picture 802 can be determined. In pparagraph [0039]-BAE discloses a “spatial importance” herein refers to a level of importance or a level of interest associated with visual contents in a picture (e.g., a region of interest in the picture). A “temporal importance” herein refers to a level of importance or a level of interest associated with a picture (e.g., I-, P-, or B-picture) among a set of consecutive pictures (wherein I-picture uses no reference picture, P-picture uses the previous picture as a reference picture and B-picture uses both the previous and subsequent pictures as reference pictures, and each type has a level of temporal importance). Please also see Fig. 7) having a different second resolution (Fig. 7. Paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In paragraph [0093]-BAE discloses input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined (wherein spatial importance can be determined by semantic segmentation, instance segmentation and/or mask RNN); and
generating a reconstruction image (Fig. 8. Paragraph [0029]-BAE discloses the video encoding process can identify and keep useful information that can be used to reconstruct a picture and disregard less important information for the reconstruction. In paragraph [0034]-BAE discloses the reconstruction-based techniques can simulate an image formation process (e.g., a linear mapping process for simulating blurring and noises) of a low-resolution image from a high-resolution image and solve an inversion of the image formation process to generate a high-resolution version of the image. In paragraph [0101]-BAE discloses process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. In paragraph [0039]-BAE discloses a “spatial importance” herein refers to a level of importance or a level of interest associated with visual contents in a picture (e.g., a region of interest in the picture). A “temporal importance” herein refers to a level of importance or a level of interest associated with a picture (e.g., I-, P-, or B-picture) among a set of consecutive pictures. Please also see Fig. 7) using an image reconstruction machine learning model (Fig. 8. Paragraph [0035]-BAE discloses the learning-based techniques (also referred to as “example-based techniques”) can predict details of a low-resolution image based on knowledge obtained by training based on a set of examples (referred to as a “training set”). The learning-based techniques can analyze statistical relationships between example low-resolution images and corresponding example high-resolution images to obtain such knowledge (wherein a machine learning, deep learning, and/or heuristic model of visual saliency may used for determining semantic importance and constructing enhanced pictures)) provided input based on the previous reconstruction image, the obtained frame image (Fig. 8. Paragraph [0041]-BAE discloses video sequence 100 can be a real-life video, a computer-generated video (e.g., computer game video), or a combination thereof (e.g., a real-life video with augmented-reality effects). In paragraph [0042]-BAE discloses video sequence 100 can include a series of pictures arranged temporally along a timeline, including pictures 102, 104, 106, and 108. Picture 102 is an I-picture. Picture 104 is a P-picture. Picture 106 is a B-picture (wherein each type has a level of temporal importance, I-picture uses no reference picture, P-picture uses the previous picture as a reference picture and B-picture uses both the previous and subsequent pictures as reference pictures)). In paragraph [0101]-BAE discloses FIG. 8, input picture 802 is fed to stage 804, where the temporal importance of input picture 802 can be determined (wherein input pictures may be I-, P-, or B- pictures)), and the semantic map (Fig. 7. Paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In paragraph [0093]-BAE discloses input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined (wherein spatial importance can be determined by semantic segmentation, instance segmentation and/or mask RNN). In paragraph [0096]-BAE discloses if the semantic segmentation technique is used at stage 704, each pixel of the picture can be labeled with a class or label (e.g., a vehicle, an individual, a building, a tree, or any classification of visual contents) of what is represented. In paragraph [0097]-BAE discloses if the instance segmentation technique is used at stage 704, each pixel of an image can be further associated with a label of an instance of objects of the same class (wherein each class or label is assigned a level of spatial importance. In paragraph [0099]-BAE discloses if the spatial importance level of input region 702 is determined to be a second level, a DNN-based resolution-enhancement technique can be applied to it at stage 708. Please also read paragraph [0118-0119]), having the different second resolution (Fig. 8. Paragraph [0106]-BAE discloses after all regions of enhanced picture 810 are upscaled by process 700, at stage 812, the upscaled regions (e.g., including upscaled region 710) can be stitched to form a stitched picture. In paragraph [0107]-BAE discloses after stage 814, upscaled picture 816 can be outputted. By applying the hierarchical approach based on temporal importance levels and spatial importance levels, process 800 can further improve the enhancement quality for pictures of a video. Please also see Fig. 9 and read paragraph [0108-0113]).
The applicant argues on page 12, “Accordingly, Fleishman fails to disclose or suggest "generating a reconstruction image using an image reconstruction machine learning model provided input based on the previous reconstruction image, the obtained frame image, and the semantic map," as recited in claim 1.”
In response, the Office respectfully disagrees for the reasons stated above and below.
The applicant argues on page 12, “None of the other cited references cure his deficiency of Fleishman.”
In response, the Office respectfully disagrees for the reasons stated above and below.
The applicant argues on page 12, “Based on at least the foregoing, Applicant respectfully submits that independent claim 1 is patentable over the applied references.”
In response, the Office respectfully disagrees for the reasons stated above and below.
The applicant argues on page 12, “Applicant respectfully submits that independent claims 12 and 20 recites similar features as noted above with respect to claim 1, and is patentable for similar reasons.”
In response, the Office respectfully disagrees for the reasons stated above and below.
The applicant argues on page 13, “In view of the above, reconsideration and allowance of this application are now believed to be in order, and such actions are hereby solicited.”
In response, the Office respectfully disagrees for the reasons stated above and below.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 6-7, 11-15, 17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over FLEISHMAN et al. (US 20190043203 A1), hereinafter referenced as FLEISHMAN in view of BAE et al. (US 20210352307 A1), hereinafter referenced as BAE.
Regarding claim 1, FLEISHMAN explicitly teaches a processor-implemented method (Fig. 4. Paragraph [0034]-a system and method is disclosed herein that recurrently uses historical semantic data to perform semantic segmentation of a current frame and to be used to update a 3D semantic model. In paragraph [0105]-FLEISHMAN discloses in various implementations, content services device(s) 1030 may include a cable television box, personal computer, network, telephone, Internet enabled devices or appliance capable of delivering digital information and/or content, and any other similar device capable of unidirectionally or bidirectionally communicating content between content providers and platform 1002 and/display 1020, via network 1060 or directly. Examples of content may include any media information including gaming information. Please also see Fig. 6), the method comprising:
generating a semantic map indicating a visualization property assigned to a first object of an obtained frame image having a first resolution (Fig. 4. Paragraph [0045]-FLEISHMAN discloses process 400 may include “obtain a video sequence of frames of image data and comprising a current frame” 402. This operation may include obtaining pre-processed raw image data. The pre-processing could include resolution reduction), wherein the obtained frame image is a current frame image and the semantic map is a current semantic map of the current frame image (Fig. 4. Paragraph [0044]-FLEISHMAN discloses referring to FIG. 4, a process 400 is provided for a method and system of recurrent semantic segmentation for imaging processing. In paragraph [0047]-FLEISHMAN discloses process 400 may optionally include “recurrently generate a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map from a 3D semantic segmentation model, wherein individual semantic segmentation maps are each associated with a different current frame from the video sequence” 404. Once established, the 3D semantic segmentation model may be projected to an image plane to form a segmentation map with the semantic labels from the model that have pixels or voxels on that plane. In paragraph [0047]-FLEISHMAN discloses the 3D semantic segmentation model may be updated with semantic segment labels each current frame being semantically analyzed so that the 3D semantic model reflects or represents the history of the semantic segmentation of the 3D space represented by the 3D semantic model up to a current point in time);
Although FLEISHMAN explicitly teaches generating a reconstruction image using an image reconstruction machine learning model provided input based on the previous reconstruction image, the obtained frame image, and the semantic map (Fig. 4. Paragraph [0048]-FLEISHMAN disclosed process 400 may include “extract historically-influenced semantically semantic features of the semantic segmentation map” 406. The result of such extraction may be considered historically-influenced features that represent the semantic labeling in the segmentation map. In paragraph [0049]-FLIESHMAN discloses process 400 may include “extract current semantic features of the current frame” 408 (wherein the extraction of both historical and current semantic features is performed by a neural network). In paragraph [0050]-FLEISHMAN discloses process 400 may include “generate a current and historical semantically segmented frame comprising using both the current semantic features and the historically-influenced semantic features as input to a neural network that indicates semantic labels for areas of the current historical semantically segmented frame” 410 (wherein the model may be generated by concatenating current semantic features and historically-influenced semantic features and inputting them into a neural network, such as a CNN). Further in paragraph [0052]-FLEISHMAN discloses process 400 may include “semantically update the 3D semantic segmentation model comprising using the current and historical semantically segmented frame” 412, which refers to registering the semantic labels or probabilities of the segmentation frame to the 3D semantic model), and including a second object having a visualization property indicated by the semantic map (Fig. 5. Paragraph [0065]-FLEISHMAN discloses referring to FIG. 7, process 500 may include “render segmentation map from 3D semantic model” 516, and this may involve obtaining the k-pose of the current frame being analyzed, and then projecting the 3D semantic model to an image plane formed by a camera at the k-pose. An example segmentation map 700 is provided at the current pose (or k-pose) of the current frame, and where the walls 702, chairs 704, and floor 706 shown in the map 700 are segmented from each other and each have an initial, historically-based (or influenced or based on information of previous frames) semantic label (wall, chair, floor for example). Objects that are adjacent each other and have the same label may not show as separate components on the segmentation map).
FLEISHMAN fails to explicitly teach obtaining a previous reconstruction image of a previous frame having a different second resolution; and generating a reconstruction image using an image reconstruction machine learning model provided input based on the previous reconstruction image, the obtained frame image, and the semantic map, having the different second resolution.
However, BAE explicitly teaches obtaining a previous reconstruction image of a previous frame (Fig. 8. Paragraph [0029]-BAE discloses the video encoding process can identify and keep useful information that can be used to reconstruct a picture and disregard less important information for the reconstruction. In paragraph [0030]-BAE discloses the useful information of a picture being encoded (referred to as a “current picture”) include changes with respect to a reference picture (e.g., a picture previously encoded and reconstructed) (wherein changes can include position changes, luminosity changes, or color changes of the pixels, etc.). In paragraph [0101]-BAE discloses process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. Input picture 802 is fed to stage 804, where the temporal importance of input picture 802 can be determined. In pparagraph [0039]-BAE discloses a “spatial importance” herein refers to a level of importance or a level of interest associated with visual contents in a picture (e.g., a region of interest in the picture). A “temporal importance” herein refers to a level of importance or a level of interest associated with a picture (e.g., I-, P-, or B-picture) among a set of consecutive pictures (wherein I-picture uses no reference picture, P-picture uses the previous picture as a reference picture and B-picture uses both the previous and subsequent pictures as reference pictures, and each type has a level of temporal importance). Please also see Fig. 7) having a different second resolution (Fig. 7. Paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In paragraph [0093]-BAE discloses input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined (wherein spatial importance can be determined by semantic segmentation, instance segmentation and/or mask RNN); and
generating a reconstruction image (Fig. 8. Paragraph [0029]-BAE discloses the video encoding process can identify and keep useful information that can be used to reconstruct a picture and disregard less important information for the reconstruction. In paragraph [0034]-BAE discloses the reconstruction-based techniques can simulate an image formation process (e.g., a linear mapping process for simulating blurring and noises) of a low-resolution image from a high-resolution image and solve an inversion of the image formation process to generate a high-resolution version of the image. In paragraph [0101]-BAE discloses process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. In paragraph [0039]-BAE discloses a “spatial importance” herein refers to a level of importance or a level of interest associated with visual contents in a picture (e.g., a region of interest in the picture). A “temporal importance” herein refers to a level of importance or a level of interest associated with a picture (e.g., I-, P-, or B-picture) among a set of consecutive pictures. Please also see Fig. 7) using an image reconstruction machine learning model (Fig. 8. Paragraph [0035]-BAE discloses the learning-based techniques (also referred to as “example-based techniques”) can predict details of a low-resolution image based on knowledge obtained by training based on a set of examples (referred to as a “training set”). The learning-based techniques can analyze statistical relationships between example low-resolution images and corresponding example high-resolution images to obtain such knowledge (wherein a machine learning, deep learning, and/or heuristic model of visual saliency may used for determining semantic importance and constructing enhanced pictures)) provided input based on the previous reconstruction image, the obtained frame image (Fig. 8. Paragraph [0041]-BAE discloses video sequence 100 can be a real-life video, a computer-generated video (e.g., computer game video), or a combination thereof (e.g., a real-life video with augmented-reality effects). In paragraph [0042]-BAE discloses video sequence 100 can include a series of pictures arranged temporally along a timeline, including pictures 102, 104, 106, and 108. Picture 102 is an I-picture. Picture 104 is a P-picture. Picture 106 is a B-picture (wherein each type has a level of temporal importance, I-picture uses no reference picture, P-picture uses the previous picture as a reference picture and B-picture uses both the previous and subsequent pictures as reference pictures)). In paragraph [0101]-BAE discloses FIG. 8, input picture 802 is fed to stage 804, where the temporal importance of input picture 802 can be determined (wherein input pictures may be I-, P-, or B- pictures)), and the semantic map (Fig. 7. Paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In paragraph [0093]-BAE discloses input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined (wherein spatial importance can be determined by semantic segmentation, instance segmentation and/or mask RNN). In paragraph [0096]-BAE discloses if the semantic segmentation technique is used at stage 704, each pixel of the picture can be labeled with a class or label (e.g., a vehicle, an individual, a building, a tree, or any classification of visual contents) of what is represented. In paragraph [0097]-BAE discloses if the instance segmentation technique is used at stage 704, each pixel of an image can be further associated with a label of an instance of objects of the same class (wherein each class or label is assigned a level of spatial importance. In paragraph [0099]-BAE discloses if the spatial importance level of input region 702 is determined to be a second level, a DNN-based resolution-enhancement technique can be applied to it at stage 708. Please also read paragraph [0118-0119]), having the different second resolution (Fig. 8. Paragraph [0106]-BAE discloses after all regions of enhanced picture 810 are upscaled by process 700, at stage 812, the upscaled regions (e.g., including upscaled region 710) can be stitched to form a stitched picture. In paragraph [0107]-BAE discloses after stage 814, upscaled picture 816 can be outputted. By applying the hierarchical approach based on temporal importance levels and spatial importance levels, process 800 can further improve the enhancement quality for pictures of a video. Please also see Fig. 9 and read paragraph [0108-0113]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN of having a processor-implemented method, the method, with the teachings of BAE of having obtaining a previous reconstruction image of a previous frame having a different second resolution; and generating a reconstruction image using an image reconstruction machine learning model provided input based on the previous reconstruction image, the obtained frame image, and the semantic map, having the different second resolution.
Wherein FLEISHMAN’s method having obtaining a previous reconstruction image of a previous frame having a different second resolution; and generating a reconstruction image using an image reconstruction machine learning model provided input based on the previous reconstruction image, the obtained frame image, and the semantic map, having the different second resolution and including a second object having a visualization property indicated by the semantic map.
The motivation behind the modification would have been to obtain a method that improves the efficiency of semantic segmentation as well as the image quality for important regions, since both FLEISHMAN and BAE concern semantic segmentation and image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while BAE’s systems and methods greatly improve the image quality for the more important portions after upscaling while also not greatly increasing overall computational costs for resolution enhancement and transcoding. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and BAE et al. (US 20210352307 A1), Abstract and Paragraph [0040].
Regarding claim 2, FLEISHMAN in view of BAE explicitly teach the method of claim 1, FLEISHMAN further teaches wherein the generating of the semantic map (Fig. 4. Paragraph [0044]-FLEISHMAN discloses referring to FIG. 4, a process 400 is provided for a method and system of recurrent semantic segmentation for imaging processing) comprises:
obtaining semantic data for the visualization property (Fig. 4. Paragraph [0047]-FLEISHMAN discloses process 400 may include “recurrently generate a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map from a 3D semantic segmentation model, wherein individual semantic segmentation maps are each associated with a different current frame from the video sequence” 404. Once established, the 3D semantic segmentation model may be projected to an image plane to form a segmentation map with the semantic labels from the model that have pixels or voxels on that plane); and
generating, based on an object identifier map comprising the semantic data and regions classified by plural objects of the obtained frame image, a semantic map indicating a corresponding visualization property of a corresponding object of the plural objects (Fig. 5. Paragraph [0065]-FLEISHMAN discloses referring to FIG. 7, process 500 may include “render segmentation map from 3D semantic model” 516, and this may involve obtaining the k-pose of the current frame being analyzed, and then projecting the 3D semantic model to an image plane formed by a camera at the k-pose. An example segmentation map 700 is provided at the current pose (or k-pose) of the current frame, and where the walls 702, chairs 704, and floor 706 shown in the map 700 are segmented from each other and each have an initial, historically-based (or influenced or based on information of previous frames) semantic label (wall, chair, floor for example). Objects that are adjacent each other and have the same label may not show as separate components on the segmentation map).
Regarding claim 3, FLEISHMAN in view of BAE explicitly teach the method of claim 2, FLEISHMAN fails to explicitly teach wherein the obtaining of the semantic data comprises receiving a user input to assign, for the plural objects of the obtained frame image, a corresponding visualization property for one or more corresponding objects.
However, BAE explicitly teaches wherein the obtaining of the semantic data (Fig. 7. Paragraph [0093]-BAE discloses FIG. 7 illustrates a schematic diagram illustrating an example process 700 of video processing. In FIG. 7, input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined. In paragraph [0095]-BAE discloses the object detection technique can identify a bounding region (e.g., a rectangular box) in the picture, which encloses an identified object. Based on whether input region 702 is in the bounding region, a spatial importance level can be assigned to input region 702. In paragraph [0096]-BAE discloses if the semantic segmentation technique is used at stage 704, each pixel of the picture can be labeled with a class or label (e.g., a vehicle, an individual, a building, a tree, or any classification of visual contents) of what is represented. In paragraph [0097]-BAE discloses if the instance segmentation technique is used at stage 704, each pixel of an image can be further associated with a label of an instance of objects of the same class. For a class of “individuals,” the instance segmentation technique can differentiate and associate each pixel in the class with labels of “person 1,” “person 2,” and so on (wherein the semantic segmentation technique can be used to determine spatial importance levels of different classes, and the instance segmentation technique can be applied to each class to determine spatial importance levels of different instances in the same class)) comprises receiving a user input to assign, for the plural objects of the obtained frame image, a corresponding visualization property for one or more corresponding objects (Fig. 7. Paragraph [0096]-BAE discloses different classes can be predetermined with different spatial importance levels based on how interested a viewer can be of each class. The higher the value of the spatial importance level of a class, the more interested the viewer can be of the class. For example, a class of “background” can be associated with a spatial importance level of 0, a class of “buildings” can be associated with a spatial importance level of 1, a class of “vehicle” can be associated with a spatial importance level of 2, a class of “individuals” can be associated with a spatial importance level of 3, or the like. In paragraph [0098]-BAE discloses the associations between the classes (or objects) and spatial importance levels can be assigned by a user before performing stage 704. Further in paragraph [0103]-BAE discloses scenes of fast actions (e.g., fighting scenes), close-up shots, or stunning visual effects can have higher temporal importance levels than other scenes. The associations between the pictures and the temporal importance levels can be assigned by a user before performing stage 804).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having a processor-implemented method, with the teachings of BAE of having wherein the obtaining of the semantic data comprises receiving a user input to assign, for the plural objects of the obtained frame image, a corresponding visualization property for one or more corresponding objects.
Wherein FLEISHMAN’s method having wherein the obtaining of the semantic data comprises receiving a user input to assign, for the plural objects of the obtained frame image, a corresponding visualization property for one or more corresponding objects.
The motivation behind the modification would have been to obtain a method that improves the efficiency of semantic segmentation as well as the image quality for important regions, since both FLEISHMAN and BAE concern semantic segmentation and image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while BAE’s systems and methods greatly improve the image quality for the more important portions after upscaling while also not greatly increasing overall computational costs for resolution enhancement and transcoding. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and BAE et al. (US 20210352307 A1), Abstract and Paragraph [0040].
Regarding claim 6, FLEISHMAN in view of BAE explicitly teach the method of claim 1, FLEISHMAN further teaches wherein the generating of the semantic map comprises indicating, for a plurality of objects in the obtained frame image, a visualization property for each object of the plurality of objects (Fig. 4. Paragraph [0057]-FLEISHMAN discloses the segmentation output unit 822 outputs semantic labels, or class or probabilities for the labels or classes, and provides them as part of the current-historical (C-H) semantically segmented (or just segmented) frame 824. Further in paragraph [0065]-FLEISHMAN discloses referring to FIG. 7, process 500 may include “render segmentation map from 3D semantic model” 516, and this may involve obtaining the k-pose of the current frame being analyzed, and then projecting the 3D semantic model to an image plane formed by a camera at the k-pose. An example segmentation map 700 is provided at the current pose (or k-pose) of the current frame, and where the walls 702, chairs 704, and floor 706 shown in the map 700 are segmented from each other and each have an initial, historically-based (or influenced or based on information of previous frames) semantic label (wall, chair, floor for example)).
Regarding claim 7, FLEISHMAN in view of BAE explicitly teach the method of claim 1, FLEISHMAN further teaches wherein the image reconstruction machine learning model (Fig. 4. Paragraph [0037]-FLEISHMAN discloses in the present solution, the recurrent segmentation operation (usage of the semantic information from previous frames as reflected in the 3D semantic model) is learned from the data and tailored to specific scenarios. In paragraph [0044]-FLEISHMAN discloses referring to FIG. 4, a process 400 is provided for a method and system of recurrent semantic segmentation for imaging processing (wherein the operations may be performed by a neural network). Please also read paragraph [0036 and 0051]) is a machine learning model trained using an objective function (Fig. 4. Paragraph [0078]-FLEISHMAN discloses the training of the architecture in a supervised-learning settings may include a training set of RGBD video-sequences, where the frames in each sequence have semantic information. Such a video can be obtained using either (i) a labor intensive method manually segmenting each frame, (ii) segmenting a reconstructed 3D model, or (iii) using synthetic data. See, Dai at el., “Richly-annotated 3D Reconstructions of Indoor Scenes”, Computer Vision and Pattern Recognition (CVPR) (2017). In paragraph [0079]-FLEISHMAN discloses training a recurrent network requires rendered semantic maps of the 3D semantic model. The training may be performed in several operations) calculated based on a second visualization property of a third object of a temporary output image and a third visualization property indicated by a training semantic map together with a difference between a temporary output image and a true value output image obtained from a training input image and the training semantic map (Fig. 4. Paragraph [0080]-FLEISHMAN discloses the first training operation may involve initialization by training a standard semantic-segmentation network. First, a standard single frame CNN-based semantic segmentation algorithm is trained. This resulting initial network may be denoted as n.sub.1 for example. In paragraph [0081]-FLEISHMAN discloses the next training operation may involve data preparation, which refers to generating training data for the recurrent architecture. Given the current network, training data was generated for the next recurrent phase in the form of a triplet (RGBD frame, rendered semantic map of the 3D semantic model, ground truth semantic segmentation). The system runs as shown in FIGS. 6 and 8 with the current network on short sequences of N frames, where N is a tunable parameter. A matching semantic map was rendered for the last frame from the last camera pose in each sequence, and then saved with the frame as training data for next stage. The semantic map was represented as an image of H*W pixels (the size of the frame) with C (the number of classes that the system supports) channels. Since only X<C probabilities are remembered in each voxel, lower C-X probabilities are truncated to zero, and the remaining X probabilities are renormalized to be a proper distribution).
Regarding claim 11, FLEISHMAN in view of BAE explicitly teach the method of claim 1, FLEISHMAN fails to explicitly teach wherein the different second resolution is higher than the first resolution.
However, BAE explicitly teaches wherein the different second resolution is higher than the first resolution (Fig. 8. Paragraph [0101]-BAE discloses FIG. 8 illustrates a schematic diagram illustrating an example process 800 of video processing. Process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. In paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In process 700, different regions of different spatial importance levels can be upscaled using techniques).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having a processor-implemented method, with the teachings of BAE of having wherein the different second resolution is higher than the first resolution.
Wherein FLEISHMAN’s method having wherein the different second resolution is higher than the first resolution.
The motivation behind the modification would have been to obtain a method that improves the efficiency of semantic segmentation as well as the image quality for important regions, since both FLEISHMAN and BAE concern semantic segmentation and image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while BAE’s systems and methods greatly improve the image quality for the more important portions after upscaling while also not greatly increasing overall computational costs for resolution enhancement and transcoding. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and BAE et al. (US 20210352307 A1), Abstract and Paragraph [0040].
Regarding claim 12, FLEISHMAN explicitly teaches an apparatus (Fig. 9, #900 called an image processing system. Paragraph [0087]. Please also see Fig. 1 and 10), comprising:
a processor (Fig. 9, #920 called processors. Paragraph [0093]-FLEISHMAN discloses the image processing system 900 may have one or more processors 920. Please also read paragraph [0083]) configured to:
generate a semantic map indicating a first visualization property assigned to a first object within a frame image having a first resolution (Fig. 4. Paragraph [0045]-FLEISHMAN discloses process 400 may include “obtain a video sequence of frames of image data and comprising a current frame” 402. This operation may include obtaining pre-processed raw image data. The pre-processing could include resolution reduction), wherein the frame image is a current frame image and the semantic map is a current semantic map of the current frame image (Fig. 4. Paragraph [0047]-FLEISHMAN discloses process 400 may optionally include “recurrently generate a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map from a 3D semantic segmentation model, wherein individual semantic segmentation maps are each associated with a different current frame from the video sequence” 404. Once established, the 3D semantic segmentation model may be projected to an image plane to form a segmentation map with the semantic labels from the model that have pixels or voxels on that plane. The 3D semantic segmentation model may be updated with semantic segment labels each current frame being semantically analyzed so that the 3D semantic model reflects or represents the history of the semantic segmentation of the 3D space represented by the 3D semantic model up to a current point in time. Please also read paragraph [0048, 0050, 0052 and 0065]); and
Although FLEISHMAN explicitly teaches generating a reconstruction image, by using an image reconstruction machine learning model provided the previous reconstruction image, the frame image and the semantic map, (Fig. 4. Paragraph [0048]-FLEISHMAN disclosed process 400 may include “extract historically-influenced semantically semantic features of the semantic segmentation map” 406. The result of such extraction may be considered historically-influenced features that represent the semantic labeling in the segmentation map. In paragraph [0049]-FLIESHMAN discloses process 400 may include “extract current semantic features of the current frame” 408 (wherein the extraction of both historical and current semantic features is performed by a neural network). In paragraph [0050]-FLEISHMAN discloses process 400 may include “generate a current and historical semantically segmented frame comprising using both the current semantic features and the historically-influenced semantic features as input to a neural network that indicates semantic labels for areas of the current historical semantically segmented frame” 410 (wherein the model may be generated by concatenating current semantic features and historically-influenced semantic features and inputting them into a neural network, such as a CNN). Further in paragraph [0052]-FLEISHMAN discloses process 400 may include “semantically update the 3D semantic segmentation model comprising using the current and historical semantically segmented frame” 412, which refers to registering the semantic labels or probabilities of the segmentation frame to the 3D semantic model) and including a second object having a second visualization property indicated by the semantic map (Fig. 5. Paragraph [0065]-FLEISHMAN discloses referring to FIG. 7, process 500 may include “render segmentation map from 3D semantic model” 516, and this may involve obtaining the k-pose of the current frame being analyzed, and then projecting the 3D semantic model to an image plane formed by a camera at the k-pose. An example segmentation map 700 is provided at the current pose (or k-pose) of the current frame, and where the walls 702, chairs 704, and floor 706 shown in the map 700 are segmented from each other and each have an initial, historically-based (or influenced or based on information of previous frames) semantic label (wall, chair, floor for example). Objects that are adjacent each other and have the same label may not show as separate components on the segmentation map).
FLEISHMAN fails to explicitly teach obtaining a previous reconstruction image of a previous frame having a different second resolution; and generating a reconstruction image, by using an image reconstruction machine learning model provided the previous reconstruction image, the frame image and the semantic map, having the different second resolution and including a second object having a second visualization property indicated by the semantic map.
However, BAE explicitly teaches obtaining a previous reconstruction image of a previous frame (Fig. 8. Paragraph [0029]-BAE discloses the video encoding process can identify and keep useful information that can be used to reconstruct a picture and disregard less important information for the reconstruction. In paragraph [0030]-BAE discloses the useful information of a picture being encoded (referred to as a “current picture”) include changes with respect to a reference picture (e.g., a picture previously encoded and reconstructed) (wherein changes can include position changes, luminosity changes, or color changes of the pixels, etc.). In paragraph [0101]-BAE discloses process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. Input picture 802 is fed to stage 804, where the temporal importance of input picture 802 can be determined. In pparagraph [0039]-BAE discloses a “spatial importance” herein refers to a level of importance or a level of interest associated with visual contents in a picture (e.g., a region of interest in the picture). A “temporal importance” herein refers to a level of importance or a level of interest associated with a picture (e.g., I-, P-, or B-picture) among a set of consecutive pictures (wherein I-picture uses no reference picture, P-picture uses the previous picture as a reference picture and B-picture uses both the previous and subsequent pictures as reference pictures, and each type has a level of temporal importance). Please also see Fig. 7) having a different second resolution (Fig. 7. Paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In paragraph [0093]-BAE discloses input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined (wherein spatial importance can be determined by semantic segmentation, instance segmentation and/or mask RNN. Please also see Fig. 8-9);
and generating a reconstruction image (Fig. 8. Paragraph [0029]-BAE discloses the video encoding process can identify and keep useful information that can be used to reconstruct a picture and disregard less important information for the reconstruction. In paragraph [0034]-BAE discloses the reconstruction-based techniques can simulate an image formation process (e.g., a linear mapping process for simulating blurring and noises) of a low-resolution image from a high-resolution image and solve an inversion of the image formation process to generate a high-resolution version of the image. In paragraph [0101]-BAE discloses process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. In paragraph [0039]-BAE discloses a “spatial importance” herein refers to a level of importance or a level of interest associated with visual contents in a picture (e.g., a region of interest in the picture). A “temporal importance” herein refers to a level of importance or a level of interest associated with a picture (e.g., I-, P-, or B-picture) among a set of consecutive pictures. Please also see Fig. 7), by using an image reconstruction machine learning model (Fig. 8. Paragraph [0035]-BAE discloses the learning-based techniques (also referred to as “example-based techniques”) can predict details of a low-resolution image based on knowledge obtained by training based on a set of examples (referred to as a “training set”). The learning-based techniques can analyze statistical relationships between example low-resolution images and corresponding example high-resolution images to obtain such knowledge (wherein a machine learning, deep learning, and/or heuristic model of visual saliency may used for determining semantic importance and constructing enhanced pictures)) provided the previous reconstruction image, the frame image (Fig. 8. Paragraph [0041]-BAE discloses video sequence 100 can be a real-life video, a computer-generated video (e.g., computer game video), or a combination thereof (e.g., a real-life video with augmented-reality effects). In paragraph [0042]-BAE discloses video sequence 100 can include a series of pictures arranged temporally along a timeline, including pictures 102, 104, 106, and 108. Picture 102 is an I-picture. Picture 104 is a P-picture. Picture 106 is a B-picture (wherein each type has a level of temporal importance, I-picture uses no reference picture, P-picture uses the previous picture as a reference picture and B-picture uses both the previous and subsequent pictures as reference pictures)). In paragraph [0101]-BAE discloses FIG. 8, input picture 802 is fed to stage 804, where the temporal importance of input picture 802 can be determined (wherein input pictures may be I-, P-, or B- pictures)) and the semantic maps (Fig. 7. Paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In paragraph [0093]-BAE discloses input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined (wherein spatial importance can be determined by semantic segmentation, instance segmentation and/or mask RNN). In paragraph [0096]-BAE discloses if the semantic segmentation technique is used at stage 704, each pixel of the picture can be labeled with a class or label (e.g., a vehicle, an individual, a building, a tree, or any classification of visual contents) of what is represented. In paragraph [0097]-BAE discloses if the instance segmentation technique is used at stage 704, each pixel of an image can be further associated with a label of an instance of objects of the same class (wherein each class or label is assigned a level of spatial importance. In paragraph [0099]-BAE discloses if the spatial importance level of input region 702 is determined to be a second level, a DNN-based resolution-enhancement technique can be applied to it at stage 708. Please also read paragraph [0118-0119]) having the different second resolution (Fig. 8. Paragraph [0106]-BAE discloses after all regions of enhanced picture 810 are upscaled by process 700, at stage 812, the upscaled regions (e.g., including upscaled region 710) can be stitched to form a stitched picture. In paragraph [0107]-BAE discloses after stage 814, upscaled picture 816 can be outputted. By applying the hierarchical approach based on temporal importance levels and spatial importance levels, process 800 can further improve the enhancement quality for pictures of a video. Please also see Fig. 9 and read paragraph [0108-0113]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN of having an apparatus, with the teachings of BAE of having obtaining a previous reconstruction image of a previous frame having a different second resolution; and generating a reconstruction image, by using an image reconstruction machine learning model provided the previous reconstruction image, the frame image and the semantic map, having the different second resolution.
Wherein FLEISHMAN’s apparatus having obtaining a previous reconstruction image of a previous frame having a different second resolution; and generating a reconstruction image, by using an image reconstruction machine learning model provided the previous reconstruction image, the frame image and the semantic map, having the different second resolution and including a second object having a second visualization property indicated by the semantic map.
The motivation behind the modification would have been to obtain an apparatus that improves the efficiency of semantic segmentation as well as the image quality for important regions, since both FLEISHMAN and BAE concern semantic segmentation and image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while BAE’s systems and methods greatly improve the image quality for the more important portions after upscaling while also not greatly increasing overall computational costs for resolution enhancement and transcoding. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and BAE et al. (US 20210352307 A1), Abstract and Paragraph [0040].
Regarding claim 13, FLEISHMAN in view of BAE explicitly teach the apparatus of claim 12, wherein the processor (Fig. 9, #920 called processors. Paragraph [0093]-FLEISHMAN discloses the image processing system 900 may have one or more processors 920. Please also read paragraph [0083]) is further configured to:
obtain semantic data for the first visualization property (Fig. 4. Paragraph [0044]-FLEISHMAN discloses referring to FIG. 4, a process 400 is provided for a method and system of recurrent semantic segmentation for imaging processing. In paragraph [0047]-FLEISHMAN discloses process 400 may include “recurrently generate a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map from a 3D semantic segmentation model, wherein individual semantic segmentation maps are each associated with a different current frame from the video sequence” 404. Once established, the 3D semantic segmentation model may be projected to an image plane to form a segmentation map with the semantic labels from the model that have pixels or voxels on that plane); and
generate, based on an object identifier map comprising the obtained semantic data and regions classified by plural objects within the frame image, a semantic map indicating a corresponding visualization property assigned to a corresponding object of the plural objects through a region corresponding to each object (Fig. 5. Paragraph [0065]-FLEISHMAN discloses referring to FIG. 7, process 500 may include “render segmentation map from 3D semantic model” 516, and this may involve obtaining the k-pose of the current frame being analyzed, and then projecting the 3D semantic model to an image plane formed by a camera at the k-pose. An example segmentation map 700 is provided at the current pose (or k-pose) of the current frame, and where the walls 702, chairs 704, and floor 706 shown in the map 700 are segmented from each other and each have an initial, historically-based (or influenced or based on information of previous frames) semantic label (wall, chair, floor for example). Objects that are adjacent each other and have the same label may not show as separate components on the segmentation map).
Regarding claim 14, FLEISHMAN in view of BAE explicitly teach the apparatus of claim 13, FLEISHMAN fails to explicitly teach wherein the processor is further configured to obtain the semantic data by receiving, for each object of the frame image, an input visualization property based on a user input as the corresponding visualization property of the corresponding object.
However, BAE explicitly teaches wherein the processor is further configured to obtain the semantic data by receiving, for each object of the frame image (Fig. 7. Paragraph [0093]-BAE discloses FIG. 7 illustrates a schematic diagram illustrating an example process 700 of video processing. In FIG. 7, input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined. In paragraph [0095]-BAE discloses the object detection technique can identify a bounding region (e.g., a rectangular box) in the picture, which encloses an identified object. Based on whether input region 702 is in the bounding region, a spatial importance level can be assigned to input region 702. In paragraph [0096]-BAE discloses if the semantic segmentation technique is used at stage 704, each pixel of the picture can be labeled with a class or label (e.g., a vehicle, an individual, a building, a tree, or any classification of visual contents) of what is represented. In paragraph [0097]-BAE discloses if the instance segmentation technique is used at stage 704, each pixel of an image can be further associated with a label of an instance of objects of the same class. For a class of “individuals,” the instance segmentation technique can differentiate and associate each pixel in the class with labels of “person 1,” “person 2,” and so on (wherein the semantic segmentation technique can be used to determine spatial importance levels of different classes, and the instance segmentation technique can be applied to each class to determine spatial importance levels of different instances in the same class)), an input visualization property based on a user input as the corresponding visualization property of the corresponding object (Fig. 7. Paragraph [0096]-BAE discloses different classes can be predetermined with different spatial importance levels based on how interested a viewer can be of each class. The higher the value of the spatial importance level of a class, the more interested the viewer can be of the class. For example, a class of “background” can be associated with a spatial importance level of 0, a class of “buildings” can be associated with a spatial importance level of 1, a class of “vehicle” can be associated with a spatial importance level of 2, a class of “individuals” can be associated with a spatial importance level of 3, or the like. In paragraph [0098]-BAE discloses the associations between the classes (or objects) and spatial importance levels can be assigned by a user before performing stage 704. Further in paragraph [0103]-BAE discloses scenes of fast actions (e.g., fighting scenes), close-up shots, or stunning visual effects can have higher temporal importance levels than other scenes. The associations between the pictures and the temporal importance levels can be assigned by a user before performing stage 804).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having an apparatus, with the teachings of BAE of having wherein the processor is further configured to obtain the semantic data by receiving, for each object of the frame image, an input visualization property based on a user input as the corresponding visualization property of the corresponding object.
Wherein FLEISHMAN’s apparatus having wherein the processor is further configured to obtain the semantic data by receiving, for each object of the frame image, an input visualization property based on a user input as the corresponding visualization property of the corresponding object.
The motivation behind the modification would have been to obtain an apparatus that improves the efficiency of semantic segmentation as well as the image quality for important regions, since both FLEISHMAN and BAE concern semantic segmentation and image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while BAE’s systems and methods greatly improve the image quality for the more important portions after upscaling while also not greatly increasing overall computational costs for resolution enhancement and transcoding. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and BAE et al. (US 20210352307 A1), Abstract and Paragraph [0040].
Regarding claim 15, FLEISHMAN in view of BAE explicitly teach the apparatus of claim 12, FLEISHMAN fails to explicitly teach wherein the processor is further configured to generate the semantic map to indicate one or more of a type, pattern, material, or shape of the first object, and wherein a value of the different second resolution is greater than a value of the first resolution.
However, BAE explicitly teaches wherein the processor is further configured to generate the semantic map to indicate one or more of a type, pattern, material, or shape of the first object (Fig. 7. Paragraph [0093]-BAE discloses FIG. 7 illustrates a schematic diagram illustrating an example process 700 of video processing. In FIG. 7, input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined. In paragraph [0095]-BAE discloses the object detection technique can identify a bounding region (e.g., a rectangular box) in the picture, which encloses an identified object. Based on whether input region 702 is in the bounding region, a spatial importance level can be assigned to input region 702. In paragraph [0096]-BAE discloses if the semantic segmentation technique is used at stage 704, each pixel of the picture can be labeled with a class or label (e.g., a vehicle, an individual, a building, a tree, or any classification of visual contents) of what is represented. In paragraph [0097]-BAE discloses if the instance segmentation technique is used at stage 704, each pixel of an image can be further associated with a label of an instance of objects of the same class. For a class of “individuals,” the instance segmentation technique can differentiate and associate each pixel in the class with labels of “person 1,” “person 2,” and so on (wherein the semantic segmentation technique can be used to determine spatial importance levels of different classes, and the instance segmentation technique can be applied to each class to determine spatial importance levels of different instances in the same class)), and wherein a value of the different second resolution is greater than a value of the first resolution (Fig. 8. Paragraph [0101]-BAE discloses FIG. 8 illustrates a schematic diagram illustrating an example process 800 of video processing. Process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. In paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In process 700, different regions of different spatial importance levels can be upscaled using techniques).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having an apparatus, with the teachings of BAE of having wherein the processor is further configured to generate the semantic map to indicate one or more of a type, pattern, material, or shape of the first object, and wherein a value of the different second resolution is greater than a value of the first resolution.
Wherein FLEISHMAN’s apparatus having wherein the processor is further configured to generate the semantic map to indicate one or more of a type, pattern, material, or shape of the first object, and wherein a value of the different second resolution is greater than a value of the first resolution.
The motivation behind the modification would have been to obtain an apparatus that improves the efficiency of semantic segmentation as well as the image quality for important regions, since both FLEISHMAN and BAE concern semantic segmentation and image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while BAE’s systems and methods greatly improve the image quality for the more important portions after upscaling while also not greatly increasing overall computational costs for resolution enhancement and transcoding. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and BAE et al. (US 20210352307 A1), Abstract and Paragraph [0040].
Regarding claim 17, FLEISHMAN in view of BAE explicitly teach the apparatus of claim 12, FLEISHMAN further teaches wherein the image reconstruction machine learning model (Fig. 4. Paragraph [0037]-FLEISHMAN discloses in the present solution, the recurrent segmentation operation (usage of the semantic information from previous frames as reflected in the 3D semantic model) is learned from the data and tailored to specific scenarios. In paragraph [0044]-FLEISHMAN discloses referring to FIG. 4, a process 400 is provided for a method and system of recurrent semantic segmentation for imaging processing (wherein the operations may be performed by a neural network). Please also read paragraph [0036 and 0051]) is machine learning model trained using an objective function (Fig. 4. Paragraph [0078]-FLEISHMAN discloses the training of the architecture in a supervised-learning settings may include a training set of RGBD video-sequences, where the frames in each sequence have semantic information. Such a video can be obtained using either (i) a labor intensive method manually segmenting each frame, (ii) segmenting a reconstructed 3D model, or (iii) using synthetic data. See, Dai at el., “Richly-annotated 3D Reconstructions of Indoor Scenes”, Computer Vision and Pattern Recognition (CVPR) (2017)) calculated based on a second visualization property of a third object of a temporary output image and a third visualization property indicated by a training semantic map together with a difference between a temporary output image obtained from a training input image and the training semantic map and a true value output image (Fig. 4. Paragraph [0079]-FLEISHMAN discloses training a recurrent network requires rendered semantic maps of the 3D semantic model. The training may be performed in several operations. In paragraph [0080]-FLEISHMAN discloses the first training operation may involve initialization by training a standard semantic-segmentation network. First, a standard single frame CNN-based semantic segmentation algorithm is trained. This resulting initial network may be denoted as n.sub.1 for example. In paragraph [0081]-FLEISHMAN discloses the next training operation may involve data preparation, which refers to generating training data for the recurrent architecture. Given the current network, training data was generated for the next recurrent phase in the form of a triplet (RGBD frame, rendered semantic map of the 3D semantic model, ground truth semantic segmentation). The system runs as shown in FIGS. 6 and 8 with the current network on short sequences of N frames, where N is a tunable parameter. A matching semantic map was rendered for the last frame from the last camera pose in each sequence, and then saved with the frame as training data for next stage. The semantic map was represented as an image of H*W pixels (the size of the frame) with C (the number of classes that the system supports) channels. Since only X<C probabilities are remembered in each voxel, lower C-X probabilities are truncated to zero, and the remaining X probabilities are renormalized to be a proper distribution).
Regarding claim 20, FLEISHMAN explicitly teaches a processor-implemented method (Fig. 4. Paragraph [0034]-FLEISHMAN discloses a system and method is disclosed herein that recurrently uses historical semantic data to perform semantic segmentation of a current frame and to be used to update a 3D semantic model. In paragraph [0044]-FLEISHMAN discloses referring to FIG. 4, a process 400 is provided for a method and system of recurrent semantic segmentation for imaging processing [0083]-FLEISHMAN discloses the operations of FIGS. 4, and 5A-5B may be undertaken in response to instructions provided by one or more computer program products. Such program products may include signal bearing media providing instructions that, when executed by a processor may provide the functionality described herein. Please also see Fig. 9-10 and read paragraph [0105]), the method comprising:
identifying objects within a frame image (Fig. 4. Paragraph [0035]-FLEISHMAN discloses a recurrent 3D semantic segmentation algorithm is used by taking as input both the rendered semantic segmentation map of the model in addition to an input image of the current frame. The recurrent 3D semantic segmentation algorithm may include CNN-based architecture that receives this paired input to synergistically analyze the distribution of the image data on the input together. Using the whole frame enables the system to learn what classes appear together. The output of the system is an updated 3D representation (model) of the world with individual voxels of the model being semantically classified);
generating a semantic map of the frame image, the semantic map including regions for corresponding objects, each object having a visualization property assigned thereto (Fig. 5. Paragraph [0065]-FLEISHMAN discloses referring to FIG. 7, process 500 may include “render segmentation map from 3D semantic model” 516, and this may involve obtaining the k-pose of the current frame being analyzed, and then projecting the 3D semantic model to an image plane formed by a camera at the k-pose. An example segmentation map 700 is provided at the current pose (or k-pose) of the current frame, and where the walls 702, chairs 704, and floor 706 shown in the map 700 are segmented from each other and each have an initial, historically-based (or influenced or based on information of previous frames) semantic label (wall, chair, floor for example). Objects that are adjacent each other and have the same label may not show as separate components on the segmentation map), wherein the frame image is a current frame image and the semantic map is a current semantic map of the current frame image (Fig. 4. Paragraph [0048]-FLEISHMAN disclosed process 400 may include “extract historically-influenced semantically semantic features of the semantic segmentation map” 406. The result of such extraction may be considered historically-influenced features that represent the semantic labeling in the segmentation map. In paragraph [0049]-FLIESHMAN discloses process 400 may include “extract current semantic features of the current frame” 408 (wherein the extraction of both historical and current semantic features is performed by a neural network). In paragraph [0050]-FLEISHMAN discloses process 400 may include “generate a current and historical semantically segmented frame comprising using both the current semantic features and the historically-influenced semantic features as input to a neural network that indicates semantic labels for areas of the current historical semantically segmented frame” 410 (wherein the model may be generated by concatenating current semantic features and historically-influenced semantic features and inputting them into a neural network, such as a CNN). Further in paragraph [0052]-FLEISHMAN discloses process 400 may include “semantically update the 3D semantic segmentation model comprising using the current and historical semantically segmented frame” 412, which refers to registering the semantic labels or probabilities of the segmentation frame to the 3D semantic model);
Although FLEISHMAN explicitly teaches generating, using an image reconstruction machine learning model, a reconstruction image from the previous reconstruction image, the frame image, and the semantic map (Fig. 4. Paragraph [0048]-FLEISHMAN disclosed process 400 may include “extract historically-influenced semantically semantic features of the semantic segmentation map” 406. The result of such extraction may be considered historically-influenced features that represent the semantic labeling in the segmentation map. In paragraph [0049]-FLIESHMAN discloses process 400 may include “extract current semantic features of the current frame” 408 (wherein the extraction of both historical and current semantic features is performed by a neural network). In paragraph [0050]-FLEISHMAN discloses process 400 may include “generate a current and historical semantically segmented frame comprising using both the current semantic features and the historically-influenced semantic features as input to a neural network that indicates semantic labels for areas of the current historical semantically segmented frame” 410 (wherein the model may be generated by concatenating current semantic features and historically-influenced semantic features and inputting them into a neural network, such as a CNN). Further in paragraph [0052]-FLEISHMAN discloses process 400 may include “semantically update the 3D semantic segmentation model comprising using the current and historical semantically segmented frame” 412, which refers to registering the semantic labels or probabilities of the segmentation frame to the 3D semantic model), wherein the frame image has a first resolution (Fig. 4. Paragraph [0045]-FLEISHMAN discloses process 400 may include “obtain a video sequence of frames of image data and comprising a current frame” 402. This operation may include obtaining pre-processed raw image data. The pre-processing could include resolution reduction).
FLEISHMAN fails to explicitly teach obtaining a previous reconstruction image of a previous frame having a second resolution; and generating, using an image reconstruction machine learning model, a reconstruction image from the previous reconstruction image, the frame image, and the semantic map, wherein the frame image has a first resolution, and wherein the reconstruction image has the second resolution which is greater than the first resolution.
However, BAE explicitly teaches obtaining a previous reconstruction image of a previous frame (Fig. 8. Paragraph [0029]-BAE discloses the video encoding process can identify and keep useful information that can be used to reconstruct a picture and disregard less important information for the reconstruction. In paragraph [0030]-BAE discloses the useful information of a picture being encoded (referred to as a “current picture”) include changes with respect to a reference picture (e.g., a picture previously encoded and reconstructed) (wherein changes can include position changes, luminosity changes, or color changes of the pixels, etc.). In paragraph [0101]-BAE discloses process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. Input picture 802 is fed to stage 804, where the temporal importance of input picture 802 can be determined. In pparagraph [0039]-BAE discloses a “spatial importance” herein refers to a level of importance or a level of interest associated with visual contents in a picture (e.g., a region of interest in the picture). A “temporal importance” herein refers to a level of importance or a level of interest associated with a picture (e.g., I-, P-, or B-picture) among a set of consecutive pictures (wherein I-picture uses no reference picture, P-picture uses the previous picture as a reference picture and B-picture uses both the previous and subsequent pictures as reference pictures, and each type has a level of temporal importance). Please also see Fig. 7) having a second resolution (Fig. 7. Paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In paragraph [0093]-BAE discloses input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined (wherein spatial importance can be determined by semantic segmentation, instance segmentation and/or mask RNN). Please also see Fig. 8-9);
and generating, using an image reconstruction machine learning model (Fig. 8. Paragraph [0035]-BAE discloses the learning-based techniques (also referred to as “example-based techniques”) can predict details of a low-resolution image based on knowledge obtained by training based on a set of examples (referred to as a “training set”). The learning-based techniques can analyze statistical relationships between example low-resolution images and corresponding example high-resolution images to obtain such knowledge (wherein a machine learning, deep learning, and/or heuristic model of visual saliency may used for determining semantic importance and constructing enhanced pictures)), a reconstruction image (Fig. 8. Paragraph [0029]-BAE discloses the video encoding process can identify and keep useful information that can be used to reconstruct a picture and disregard less important information for the reconstruction. In paragraph [0034]-BAE discloses the reconstruction-based techniques can simulate an image formation process (e.g., a linear mapping process for simulating blurring and noises) of a low-resolution image from a high-resolution image and solve an inversion of the image formation process to generate a high-resolution version of the image. In paragraph [0101]-BAE discloses process 800 illustrates an example hierarchical approach of enhancing a video based on temporal importance levels and spatial importance levels. In paragraph [0039]-BAE discloses a “spatial importance” herein refers to a level of importance or a level of interest associated with visual contents in a picture (e.g., a region of interest in the picture). A “temporal importance” herein refers to a level of importance or a level of interest associated with a picture (e.g., I-, P-, or B-picture) among a set of consecutive pictures. Please also see Fig. 7) from the previous reconstruction image, the frame image (Fig. 8. Paragraph [0041]-BAE discloses video sequence 100 can be a real-life video, a computer-generated video (e.g., computer game video), or a combination thereof (e.g., a real-life video with augmented-reality effects). In paragraph [0042]-BAE discloses video sequence 100 can include a series of pictures arranged temporally along a timeline, including pictures 102, 104, 106, and 108. Picture 102 is an I-picture. Picture 104 is a P-picture. Picture 106 is a B-picture (wherein each type has a level of temporal importance, I-picture uses no reference picture, P-picture uses the previous picture as a reference picture and B-picture uses both the previous and subsequent pictures as reference pictures)). In paragraph [0101]-BAE discloses FIG. 8, input picture 802 is fed to stage 804, where the temporal importance of input picture 802 can be determined (wherein input pictures may be I-, P-, or B- pictures)), and the semantic map (Fig. 7. Paragraph [0105]-BAE discloses after applying the resolution techniques, enhanced picture 810 can be generated. Process 700 can be performed on enhanced picture 810 region by region for further enhancing resolutions of its regions (e.g., including input region 702) based on spatial importance levels. In paragraph [0093]-BAE discloses input region 702 of a picture is fed to stage 704, where the spatial importance of input region 702 can be determined (wherein spatial importance can be determined by semantic segmentation, instance segmentation and/or mask RNN). In paragraph [0096]-BAE discloses if the semantic segmentation technique is used at stage 704, each pixel of the picture can be labeled with a class or label (e.g., a vehicle, an individual, a building, a tree, or any classification of visual contents) of what is represented. In paragraph [0097]-BAE discloses if the instance segmentation technique is used at stage 704, each pixel of an image can be further associated with a label of an instance of objects of the same class (wherein each class or label is assigned a level of spatial importance. In paragraph [0099]-BAE discloses if the spatial importance level of input region 702 is determined to be a second level, a DNN-based resolution-enhancement technique can be applied to it at stage 708. Please also read paragraph [0118-0119]), wherein the frame image has a first resolution, and wherein the reconstruction image has the second resolution which is greater than the first resolution (Fig. 8. Paragraph [0106]-BAE discloses after all regions of enhanced picture 810 are upscaled by process 700, at stage 812, the upscaled regions (e.g., including upscaled region 710) can be stitched to form a stitched picture. In paragraph [0107]-BAE discloses after stage 814, upscaled picture 816 can be outputted. By applying the hierarchical approach based on temporal importance levels and spatial importance levels, process 800 can further improve the enhancement quality for pictures of a video. Please also see Fig. 9 and read paragraph [0108-0113]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN of having a processor-implemented method, with the teachings of BAE of having obtaining a previous reconstruction image of a previous frame having a second resolution; and generating, using an image reconstruction machine learning model, a reconstruction image from the previous reconstruction image, the frame image, and the semantic map, wherein the frame image has a first resolution, and wherein the reconstruction image has the second resolution which is greater than the first resolution.
Wherein FLEISHMAN’s method having obtaining a previous reconstruction image of a previous frame having a second resolution; and generating, using an image reconstruction machine learning model, a reconstruction image from the previous reconstruction image, the frame image, and the semantic map, wherein the frame image has a first resolution, and wherein the reconstruction image has the second resolution which is greater than the first resolution.
The motivation behind the modification would have been to obtain a method that improves the efficiency of semantic segmentation as well as the image quality for important regions, since both FLEISHMAN and BAE concern semantic segmentation and image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while BAE’s systems and methods greatly improve the image quality for the more important portions after upscaling while also not greatly increasing overall computational costs for resolution enhancement and transcoding. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and BAE et al. (US 20210352307 A1), Abstract and Paragraph [0040].
Claim 4 are rejected under 35 U.S.C. 103 as being unpatentable over FLEISHMAN et al. (US 20190043203 A1), hereinafter referenced as FLEISHMAN in view of BAE et al. (US 20210352307 A1), hereinafter referenced as BAE and in further view of PALMARO et al. (US 20210327112 A1), hereinafter referenced as PALMARO.
Regarding claim 4, FLEISHMAN in view of BAE explicitly teach the method of claim 1, although FLEISHMAN explicitly teach wherein the generating of the semantic map comprises generating a semantic map indicating one or more of a type of the first object (Fig. 2A. Paragraph [0057]-FLEISHMAN discloses the segmentation output unit 822 outputs semantic labels, or class or probabilities for the labels or classes, and provides them as part of the current-historical (C-H) semantically segmented (or just segmented) frame 824). In paragraph [0065] Referring to FIG. 7, process 500 may include “render segmentation map from 3D semantic model” 516. An example segmentation map 700 is provided at the current pose (or k-pose) of the current frame, and where the walls 702, chairs 704, and floor 706 shown in the map 700 are segmented from each other and each have an initial, historically-based (or influenced or based on information of previous frames) semantic label (wall, chair, floor for example). Please also read paragraph [0068]).
FLEISHMAN in view of BAE fail to explicitly teach wherein the generating of the semantic map comprises generating a semantic map indicating pattern, material, or shape of the first object.
However, PALMARO explicitly teaches wherein the generating of the semantic map (Fig. 2A. Paragraph [0033]-PALMARO discloses the memory 122 may also include a semantic map compositor module 128 configured to perform a series of tasks or operations with reference to FIG. 2-6. In paragraph [0035]-PALMARO discloses the populator module 126 can be configured to create and modify a semantic map of an environment, and help populate the environment with digital objects using the semantic map, wherein the semantic map includes contextual information of the environment and the digital objects (wherein semantic maps includes a local semantic map, which may represent an area or volume and/or topology associated with the object and be similar to a floor plan of a 3D room, and/or a composite semantic map, which may be a plurality of semantic map regions and/or semantic maps blended and composited (e.g., layerer) into a single composited semantic map)) comprises generating a semantic map indicating pattern, material, or shape of the first object (Fig. 2A. Paragraph [0051]-PALMARO discloses the object placeholder including data describing the determined object type (e.g., including one or more traits, a volume size, a position, and an orientation for a potential object that may occupy the cell (e.g., as determined by the populator module 126 in operation 308). The one or more properties or traits providing semantic information regarding the potential object. Based on the local semantic map being of a “bedroom” type associated with a bedroom populator object, the populator module 126 may determine to insert an object placeholder with trait “desk” at a specified location (e.g., up against a wall by a window within the bedroom populator object), and with a specified size (e.g., a desk size that fits under the window). In paragraph [0061]-PALMARO discloses the specific digital asset including data describing a 3D model (e.g., including mesh data, material data, animation data, and the like) of a digital object (e.g., such as a digital asset representing a digital object of a chair, or a bed, or a lamp, or any other digital object). The cabin populator object includes traits and additional semantic data associated with a cabin (e.g., data describing a cabin as a small shelter, structure or house, typically made of wood, typically in a forested or remote area, and more)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having a processor-implemented method, with the teachings of PALMARO of having wherein the generating of the semantic map comprises generating a semantic map indicating pattern, material, or shape of the first object.
Wherein FLEISHMAN’s method having wherein the generating of the semantic map comprises generating a semantic map indicating one or more of a type, pattern, material, or shape of the first object.
The motivation behind the modification would have been to obtain a method that improves environment modeling and semantic map generation, since both FLEISHMAN and PALMARO concern image analysis and 3D modeling. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while PALMARO’s systems and methods improves the generation of semantic maps and environment models. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and PALMARO et al. (US 20210327112 A1), Abstract and Paragraph [0026-0027].
Claim 5 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over FLEISHMAN et al. (US 20190043203 A1), hereinafter referenced as FLEISHMAN in view of BAE et al. (US 20210352307 A1), hereinafter referenced as BAE and in further view of KANAMORI et al. (US 20100290713 A1), hereinafter referenced as KANAMORI.
Regarding claim 5, FLEISHMAN in view of BAE explicitly teach the method of claim 1, although FLEISHMAN explicitly teaches wherein the generating of the semantic map (Fig. 3. Paragraph [0044]-FLEISHMAN discloses referring to FIG. 4, a process 400 is provided for a method and system of recurrent semantic segmentation for imaging processing. In paragraph [0047]-FLEISHMAN discloses process 400 may optionally include “recurrently generate a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map from a 3D semantic segmentation model, wherein individual semantic segmentation maps are each associated with a different current frame from the video sequence” 404. Please also see Fig. 5A-B) comprises indicating rendering information including one or more of a color, a depth of the obtained frame image together with the visualization property (Fig. 3. Paragraph [0039]-FLEISHMAN discloses referring to FIG. 3, an image 300 shows the room 202 (now 301) from image 200 except now with the disclosed example 3D semantic segmentation applied. Each voxel color or shade represents a class of an object that it belongs to. With such semantic segmentation, actions can be taken depending on the semantic label of the segment whether for computer vision or other applications such as with virtual or augmented reality for example. In paragraph [0045]-FLEISHMAN discloses process 400 may include “obtain a video sequence of frames of image data and comprising a current frame” 402. This operation may include obtaining pre-processed raw image data with RGB, YUV, or other color space values in addition to luminance values for a number of frames of a video sequence. The color and luminance values may be provided in many different additional forms such as gradients, histograms, and so forth. In paragraph [0046]-FLEISHMAN discloses this operation also may include obtaining depth data when the depth data is used for segmentation analysis. In paragraph [0060]-FLEISHMAN discloses process 500 may include “generate depth map” 504, where a depth map for the current image may be formed to establish a 3D space for the video sequence being analyzed, and eventually used to generate a 3D geometric model. Please also read paragraph [0056]).
FLEISHMAN in view of BAE fail to explicitly teach rendering information including a diffuse color, a normal line, a specular reflection, or an albedo.
However, KANAMORI explicitly teaches rendering information including a diffuse color (Fig. 5. Paragraph [0002]-KANAMORI discloses the present invention relates to an image processing method and apparatus for reconstructing the surface shape of an object based on a polarized image. In paragraph [0004]-KANAMORI discloses even when the high-resolution trend hits a plateau, the image quality can still be improved by adding numerous pieces of information about various physical properties, which can be used to generate an image with the aid of computer graphics, to the image information of the object itself. In paragraph [0229]-KANAMORI discloses the object information is entered in two separate forms as a monochrome specular reflected image 2204 and a color diffuse reflected image 2205. Further in paragraph [0233]-KANAMORI discloses a diffuse reflected image resolution increasing section 2804 performs an inverse conversion from the high-resolution pseudo-albedo image that has been obtained by the pseudo-albedo image resolution increasing section 2802 into a high-resolution color diffuse reflected image), a normal line (Fig. 5. Paragraph [0109]-KANAMORI discloses the normal estimating section 110 performs the processing steps S403, S406 and S407. The normal updating section 108 performs the processing step S408. And the shape output section 109 performs the processing step S409. Further in paragraph [0235]-KANAMORI discloses the parameter estimating section 2805 estimates parameters representing the object based on the object's normal information and the specular reflected image), a specular reflection (Fig. 5. Paragraph [0108]-KANAMORI discloses the segmentation section 103 performs the processing steps S402 and S411 shown in FIG. 5. The specular/diffuse component separating section 104 performs the processing steps S404 and S410 shown in FIG. 5. In paragraph [0175]-KANAMORI discloses using the outputs of the segmentation section 103 and the specular/diffuse component separating section 104, the specular/diffuse reflected image output section 1502 separates the object's image captured with light intensity into an image consisting of specular reflection components and an image consisting of diffuse reflection components. Further in paragraph [0240]-KANAMORI discloses the rendering section 2809 synthesizes together the color diffuse reflected image and the color specular reflected image with the increased resolutions, thereby generating color high-resolution image), or an albedo (Fig. 5. Paragraph [0231]-KANAMORI discloses the pseudo-albedo image estimating section 2801 estimates a pseudo-albedo image of the object based on the color diffuse reflected image 2205, the light source information 2202 and the surface normal image 2206. In paragraph [0233]-KANAMORI discloses a diffuse reflected image resolution increasing section 2804 performs an inverse conversion from the high-resolution pseudo-albedo image that has been obtained by the pseudo-albedo image resolution increasing section 2802 into a high-resolution color diffuse reflected image).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having a processor-implemented method, with the teachings of KANOMORI of having rendering information including a diffuse color, a normal line, a specular reflection, or an albedo.
Wherein FLEISHMAN’s method having wherein the generating of the semantic map comprises indicating rendering information including one or more of a color, a diffuse color, a depth, a normal line, a specular reflection, or an albedo of the obtained frame image together with the visualization property.
The motivation behind the modification would have been to obtain a method that improves the semantic segmentation and compensates for the lack of object information, since both FLEISHMAN and KANAMORI concern image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while KANAMORI’s systems and methods can obtain the object's shape in a general environment or scene and light source information in a shooting scene by a completely passive method using polarization information as long as the refractive index data of the object is known, and can compensate for the lack of information by performing high-resolution processing. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and KANAMORI et al. (US 20100290713 A1), Abstract and Paragraph [0003-0005 and 246-0247].
Regarding claim 16, FLEISHMAN in view of BAE explicitly teach the apparatus of claim 12, although FLEISHMAN explicitly teaches wherein the processor is further configured to generate the semantic map (Fig. 3. Paragraph [0044]-FLEISHMAN discloses referring to FIG. 4, a process 400 is provided for a method and system of recurrent semantic segmentation for imaging processing. In paragraph [0047]-FLEISHMAN discloses process 400 may optionally include “recurrently generate a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map from a 3D semantic segmentation model, wherein individual semantic segmentation maps are each associated with a different current frame from the video sequence” 404. Please also see Fig. 5A-B) to indicate rendering information including one or more of a color, a depth of the frame image together with the first visualization property (Fig. 3. Paragraph [0039]-FLEISHMAN discloses referring to FIG. 3, an image 300 shows the room 202 (now 301) from image 200 except now with the disclosed example 3D semantic segmentation applied. Each voxel color or shade represents a class of an object that it belongs to. With such semantic segmentation, actions can be taken depending on the semantic label of the segment whether for computer vision or other applications such as with virtual or augmented reality for example. In paragraph [0045]-FLEISHMAN discloses process 400 may include “obtain a video sequence of frames of image data and comprising a current frame” 402. This operation may include obtaining pre-processed raw image data with RGB, YUV, or other color space values in addition to luminance values for a number of frames of a video sequence. The color and luminance values may be provided in many different additional forms such as gradients, histograms, and so forth. In paragraph [0046]-FLEISHMAN discloses this operation also may include obtaining depth data when the depth data is used for segmentation analysis. In paragraph [0060]-FLEISHMAN discloses process 500 may include “generate depth map” 504, where a depth map for the current image may be formed to establish a 3D space for the video sequence being analyzed, and eventually used to generate a 3D geometric model. Please also read paragraph [0056]).
FLEISHMAN in view of BAE fail to explicitly teach rendering information including a diffuse color, a normal line, a specular reflection, or an albedo.
However, KANAMORI explicitly teaches rendering information including a diffuse color (Fig. 5. Paragraph [0002]-KANAMORI discloses the present invention relates to an image processing method and apparatus for reconstructing the surface shape of an object based on a polarized image. In paragraph [0004]-KANAMORI discloses even when the high-resolution trend hits a plateau, the image quality can still be improved by adding numerous pieces of information about various physical properties, which can be used to generate an image with the aid of computer graphics, to the image information of the object itself. In paragraph [0229]-KANAMORI discloses the object information is entered in two separate forms as a monochrome specular reflected image 2204 and a color diffuse reflected image 2205. Further in paragraph [0233]-KANAMORI discloses a diffuse reflected image resolution increasing section 2804 performs an inverse conversion from the high-resolution pseudo-albedo image that has been obtained by the pseudo-albedo image resolution increasing section 2802 into a high-resolution color diffuse reflected image)), a normal line (Fig. 5. Paragraph [0109]-KANAMORI discloses the normal estimating section 110 performs the processing steps S403, S406 and S407. The normal updating section 108 performs the processing step S408. And the shape output section 109 performs the processing step S409. Further in paragraph [0235]-KANAMORI discloses the parameter estimating section 2805 estimates parameters representing the object based on the object's normal information and the specular reflected image), a specular reflection (Fig. 5. Paragraph [0108]-KANAMORI discloses the segmentation section 103 performs the processing steps S402 and S411 shown in FIG. 5. The specular/diffuse component separating section 104 performs the processing steps S404 and S410 shown in FIG. 5. In paragraph [0175]-KANAMORI discloses using the outputs of the segmentation section 103 and the specular/diffuse component separating section 104, the specular/diffuse reflected image output section 1502 separates the object's image captured with light intensity into an image consisting of specular reflection components and an image consisting of diffuse reflection components. Further in paragraph [0240]-KANAMORI discloses the rendering section 2809 synthesizes together the color diffuse reflected image and the color specular reflected image with the increased resolutions, thereby generating color high-resolution image), or an albedo (Fig. 5. Paragraph [0231]-KANAMORI discloses the pseudo-albedo image estimating section 2801 estimates a pseudo-albedo image of the object based on the color diffuse reflected image 2205, the light source information 2202 and the surface normal image 2206. In paragraph [0233]-KANAMORI discloses a diffuse reflected image resolution increasing section 2804 performs an inverse conversion from the high-resolution pseudo-albedo image that has been obtained by the pseudo-albedo image resolution increasing section 2802 into a high-resolution color diffuse reflected image).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having an apparatus, with the teachings of KANOMORI of having rendering information including a diffuse color, a normal line, a specular reflection, or an albedo.
Wherein FLEISHMAN’s apparatus having wherein the processor is further configured to generate the semantic map to indicate rendering information including one or more of a color, a diffuse color, a depth, a normal line, a specular reflection, or an albedo of the frame image together with the first visualization property.
The motivation behind the modification would have been to obtain an apparatus that improves the semantic segmentation and compensates for the lack of object information, since both FLEISHMAN and KANAMORI concern image analysis. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while KANAMORI’s systems and methods can obtain the object's shape in a general environment or scene and light source information in a shooting scene by a completely passive method using polarization information as long as the refractive index data of the object is known, and can compensate for the lack of information by performing high-resolution processing. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and KANAMORI et al. (US 20100290713 A1), Abstract and Paragraph [0003-0005 and 246-0247].
Claims 9 are rejected under 35 U.S.C. 103 as being unpatentable over FLEISHMAN et al. (US 20190043203 A1), hereinafter referenced as FLEISHMAN in view of BAE et al. (US 20210352307 A1), hereinafter referenced as BAE and in further view of GOLINSKI et al. (US 20210281867 A1), hereinafter referenced as GOLINSKI.
Regarding claim 9, FLEISHMAN in view of BAE explicitly teach the method of claim 1, although FLEISHMAN further teaches wherein the reconstructing of the current frame image into the reconstruction image of the current frame comprises:
reconstructing the current frame image into the reconstruction image of the current frame by implementing a machine learning model provided input based on the obtained image together with the current frame image and the semantic map (Fig. 4. Paragraph [0048]-FLEISHMAN disclosed process 400 may include “extract historically-influenced semantically semantic features of the semantic segmentation map” 406. The result of such extraction may be considered historically-influenced features that represent the semantic labeling in the segmentation map. In paragraph [0049]-FLIESHMAN discloses process 400 may include “extract current semantic features of the current frame” 408 (wherein the extraction of both historical and current semantic features is performed by a neural network). In paragraph [0050]-FLEISHMAN discloses process 400 may include “generate a current and historical semantically segmented frame comprising using both the current semantic features and the historically-influenced semantic features as input to a neural network that indicates semantic labels for areas of the current historical semantically segmented frame” 410 (wherein the model may be generated by concatenating current semantic features and historically-influenced semantic features and inputting them into a neural network, such as a CNN). Further in paragraph [0052]-FLEISHMAN discloses process 400 may include “semantically update the 3D semantic segmentation model comprising using the current and historical semantically segmented frame” 412, which refers to registering the semantic labels or probabilities of the segmentation frame to the 3D semantic model).
FLEISHMAN in view of BAE fail to explicitly teach obtaining a warped image by warping the reconstruction image of the previous frame to the current frame based on a motion vector map between the current frame image and the reconstruction image of the previous frame; and reconstructing the current frame image into the reconstruction image of the current frame by implementing a machine learning model provided input based on the obtained warped image together with the current frame image, wherein the input provided to the image reconstruction machine learning model is further based on the obtained warped image.
However, GOLINSKI explicitly teaches obtaining a warped image by warping the reconstruction image of the previous frame to the current frame based on a motion vector map between the current frame image and the reconstruction image of the previous frame (Fig. 5A. Paragraph [0144]-Based on processing of the output data, the one or more post-recurrent layers 572 of the decoder 538 output reconstructed motion estimation data {circumflex over (f)}.sub.t and reconstructed residual data {circumflex over (r)}.sub.t (wherein the reconstructed motion estimation data {circumflex over (f)}.sub.t can include optical flow data (e.g., a displacement vector) or block-based motion data (e.g., a motion vector)). The reconstructed motion estimation data {circumflex over (f)}.sub.t is used by a warping engine 567 of the decoder 538 to warp (e.g., by performing motion compensation) the previously reconstructed frame {circumflex over (x)}.sub.I-frame, resulting in the generation of a warped frame. The pixels of the previously reconstructed frame {circumflex over (x)}.sub.I-frame can be moved to new locations or otherwise modified according to the reconstructed motion estimation data {circumflex over (f)}.sub.t. The warped frame output by the warping engine 567 is output to an adder engine 569. The adder engine 569 adds (or combines) the warped frame to the reconstructed residual data {circumflex over (r)}.sub.t to generate a reconstructed frame {circumflex over (x)}.sub.t for the current time step t. Please also see Fig. 5B and 6-7); and
reconstructing the current frame image into the reconstruction image of the current frame by implementing a machine learning model provided input (Fig. 5A. Paragraph [0145]-GOLINSKI discloses in addition to the recurrent state information (e.g., recurrent state h.sub.t), the reconstructed frame {circumflex over (x)}.sub.t can also be fed back to the neural network system 511 for a next time step t+1 of operation. As shown in FIG. 5A, the reconstructed frame {circumflex over (x)}.sub.t and the recurrent state h.sub.t are provided as input to the encoder 533. The previously reconstructed motion estimation data {circumflex over (f)}.sub.t and the previously reconstructed residual data {circumflex over (r)}.sub.t can also be fed back as input to the encoder 533 at the next time step t+1 of operation. At time step t+1, the encoder 533 can process a current input video frame {circumflex over (x)}.sub.t+1, the previously reconstructed frame {circumflex over (x)}.sub.t, the recurrent state h.sub.t, the previously reconstructed motion estimation data {circumflex over (f)}.sub.t, and the previously reconstructed residual data {circumflex over (r)}.sub.t to generate a coded version of the current input video frame {circumflex over (x)}.sub.t+1) based on the obtained warped image together with the current frame image, wherein the input provided to the image reconstruction machine learning model is further based on the obtained warped image (Fig. 5A. Paragraph [0146]-GOLINSKI discloses the one or more pre-recurrent decoder layers 570, the one or more recurrent layers 571, and the one or more post-recurrent layers 572 of the decoder 538 can process the output data for the current time step t+1 and can output reconstructed motion estimation data {circumflex over (f)}.sub.t+1 and reconstructed residual data {circumflex over (r)}.sub.t+1. The warping engine 567 of the decoder 538 can warp (e.g., by performing motion compensation) the previously reconstructed frame {circumflex over (x)}.sub.t to generate a warped frame for the current time step t+1. The adder engine 569 can add the warped frame to the reconstructed residual data {circumflex over (r)}.sub.t+1 to generate a reconstructed frame {circumflex over (x)}.sub.t+1 for the current time step t+1. Please also see Fig. 5b).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having a processor-implemented method, with the teachings of GOLINSKI of having obtaining a warped image by warping the reconstruction image of the previous frame to the current frame based on a motion vector map between the current frame image and the reconstruction image of the previous frame; and reconstructing the current frame image into the reconstruction image of the current frame by implementing a machine learning model provided input based on the obtained warped image together with the current frame image, wherein the input provided to the image reconstruction machine learning model is further based on the obtained warped image.
Wherein FLEISHMAN’s method having obtaining a warped image by warping the reconstruction image of the previous frame to the current frame based on a motion vector map between the current frame image and the reconstruction image of the previous frame; and reconstructing the current frame image into the reconstruction image of the current frame by implementing a machine learning model provided input based on the obtained warped image together with the current frame image and the semantic map, wherein the input provided to the image reconstruction machine learning model is further based on the obtained warped image.
The motivation behind the modification would have been to obtain a method that improves semantic segmentation and the quality and efficiency of image reconstruction, since both FLEISHMAN and GOLINSKI concern image analysis and image reconstruction. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while GOLINSKI’s systems and methods improve the quality and efficiency of frame reconstruction. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and GOLINSKI et al. (US 20210281867 A1), Abstract and Paragraph [0065, 0173 and 0193].
Claims 10 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over FLEISHMAN et al. (US 20190043203 A1), hereinafter referenced as FLEISHMAN in view of BAE et al. (US 20210352307 A1), hereinafter referenced as BAE and in further view of TOVEY et al. (US 20230298133 A1), hereinafter referenced as TOVEY.
Regarding claim 10, FLEISHMAN in view of BAE explicitly teach the method of claim 1, FLEISHMAN in view of BAE fail to explicitly teach wherein the reconstructing of the current frame image into the reconstruction image of the current frame comprises: generating a disocclusion map indicating whether a corresponding object is in a previous frame image through a region corresponding to each object of the current frame image; and reconstructing the current frame image into the reconstruction image of the current frame based on a previous frame image being masked based on the generated disocclusion map.
However, TOVEY explicitly teaches wherein the reconstructing of the current frame image into the reconstruction image of the current frame (Fig. 1. Paragraph [0024]-TOVEY discloses FIG. 1 illustrates an example device 100 in which one or more features described herein, such as a super resolution upscaler 332 (FIG. 3), can be implemented (wherein device 100 includes, for example, a gaming device). In paragraph [0037]-TOVEY discloses the Accelerated Processing Device 116 is configured to implement features of the present disclosure by executing a plurality of functions. The APD 116 is configured to implement a super resolution upscaler 332 that receives a low-resolution rendered frame 502 of video stream. The super resolution upscaler 332 spatially upscales the low-resolution rendered frame 502 by using temporal feedback (e.g., a previously upscaled frame(s) of the video stream) to reconstruct a high-resolution frame 508 representing the rendered frame. In paragraph [0062]-TOVEY discloses the upscaler 332 uses the current frame input and the previous frame input to generate a super resolution upscaled (and anti-aliased) frame 508 (also referred to herein as “upscaled frame 508” for brevity), which corresponds to the rendered frame 502, at the target presentation resolution) comprises:
generating a disocclusion map indicating whether a corresponding object is in a previous frame image through a region corresponding to each object of the current frame image (Fig. 6. Paragraph [0074]-TOVEY discloses the depth clip component 518 processes this input to produce a disocclusion mask/map 618 indicating disoccluded areas of the current rendered frame 502. As the camera moves from an initial position (previous frame) to a new position (current frame), a pixel that was initially occluded from the viewpoint of the camera's previous position can become visible (disoccluded) from the viewpoint of the camera's current position. The disocclusion mask 618 is a texture including a value indicating how much a corresponding pixel of the current frame 502 has been disoccluded. A value of 0 indicates that the pixel was entirely occluded in the previous frame and is now disoccluded, and a value of 1 indicates the pixel was fully visible in the previous frame and is fully visible in the current frame 502. Values between 0 and 1 indicate that the pixel was visible in the previous frame to an extent proportional to the value. Please also read paragraph [0071 and 0076]); and
reconstructing the current frame image into the reconstruction image of the current frame based on a previous frame image being masked based on the generated disocclusion map (Fig. 6. Paragraph [0082]-TOVEY discloses in the reproject and accumulate stage 611, the reproject and accumulate component 522 takes as input the disocclusion mask 618, the dilated motion vector buffer 614, the reactivity mask 602, the output buffer 506-1 of the previous frame. The reproject and accumulate component 522 processes this input to generate an output buffer (texture) 506-2 for the current frame 502 at the target presentation resolution/size and to also generate reprojected pixel locks (texture) 622 from the previous frame that are mappable to the current frame 502).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having a processor-implemented method, with the teachings of TOVEY of having wherein the reconstructing of the current frame image into the reconstruction image of the current frame comprises: generating a disocclusion map indicating whether a corresponding object is in a previous frame image through a region corresponding to each object of the current frame image; and reconstructing the current frame image into the reconstruction image of the current frame based on a previous frame image being masked based on the generated disocclusion map.
Wherein FLEISHMAN’s method having wherein the reconstructing of the current frame image into the reconstruction image of the current frame comprises: generating a disocclusion map indicating whether a corresponding object is in a previous frame image through a region corresponding to each object of the current frame image; and reconstructing the current frame image into the reconstruction image of the current frame based on a previous frame image being masked based on the generated disocclusion map.
The motivation behind the modification would have been to obtain a method that improves semantic segmentation and upscaling of images, since both FLEISHMAN and TOVEY concern image analysis and image reconstruction. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while LIU’s systems and methods improve upscaling by using temporal feedback to reconstruct high-resolution images while maintaining and improving image quality compared to native rendering. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and TOVEY et al. (US 20230298133 A1), Abstract and Paragraph [0023].
Regarding claim 19, FLEISHMAN in view of BAE explicitly teach the apparatus of claim 12, although FLEISHMAN explicitly teaches the image reconstruction machine learning model (Fig. 4. Paragraph [0036]-FLIESHMAN discloses in the method and system disclosed herein, the recurrent 3D semantic segmentation algorithm may merge efficient geometric segmentation with the high performance 3D semantic segmentation. Using both dense SLAM (simultaneous localization and mapping) based on RGB-D data (data from RGB and depth cameras, e.g., Intel RealSense depth sensors) and the semantic segmentation with convolutional neural networks (CNN) in a recurrent way as described herein. Thus, the 3D semantic segmentation may include: (i) dense RGBD-SLAM for 3D reconstruction of geometry; (ii) CNN-based recurrent segmentation which receives as an input the current frame and 3D semantic information from previous frames; and (iii) a copy of the results of (ii) to the 3D semantic model. It can be stated that operation (ii) uses the past frames and performs both segmentation and update of the model).
FLEISHMAN in view of BAE fail to explicitly teach wherein the processor is further configured to: generate a disocclusion map indicating whether a corresponding object is in a previous frame image through a region corresponding to each object of the current frame image, wherein the input provided to the image reconstruction machine learning model is further based on a masking of the previous frame image based on the generated disocclusion map.
However, TOVEY explicitly teaches wherein the processor is further configured to:
generate a disocclusion map indicating whether a corresponding object is in a previous frame image through a region corresponding to each object of the current frame image (Fig. 6. Paragraph [0074]-TOVEY discloses the depth clip component 518 processes this input to produce a disocclusion mask/map 618 indicating disoccluded areas of the current rendered frame 502. As the camera moves from an initial position (previous frame) to a new position (current frame), a pixel that was initially occluded from the viewpoint of the camera's previous position can become visible (disoccluded) from the viewpoint of the camera's current position. The disocclusion mask 618 is a texture including a value indicating how much a corresponding pixel of the current frame 502 has been disoccluded. A value of 0 indicates that the pixel was entirely occluded in the previous frame and is now disoccluded, and a value of 1 indicates the pixel was fully visible in the previous frame and is fully visible in the current frame 502. Values between 0 and 1 indicate that the pixel was visible in the previous frame to an extent proportional to the value. Please also read paragraph [0071 and 0076]), wherein the input provided to the image reconstruction machine learning model is further based on a masking of the previous frame image based on the generated disocclusion map (Fig. 6. Paragraph [0082]-TOVEY discloses in the reproject and accumulate stage 611, the reproject and accumulate component 522 takes as input the disocclusion mask 618, the dilated motion vector buffer 614, the reactivity mask 602, the output buffer 506-1 of the previous frame. The reproject and accumulate component 522 processes this input to generate an output buffer (texture) 506-2 for the current frame 502 at the target presentation resolution/size and to also generate reprojected pixel locks (texture) 622 from the previous frame that are mappable to the current frame 502).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of FLEISHMAN in view of BAE of having an apparatus, with the teachings of TOVEY of having wherein the processor is further configured to: generate a disocclusion map indicating whether a corresponding object is in a previous frame image through a region corresponding to each object of the current frame image, wherein the input provided to the image reconstruction model is further based on a masking of the previous frame image based on the generated disocclusion map.
Wherein FLEISHMAN’s apparatus having wherein the processor is further configured to: generate a disocclusion map indicating whether a corresponding object is in a previous frame image through a region corresponding to each object of the current frame image, wherein the input provided to the image reconstruction machine learning model is further based on a masking of the previous frame image based on the generated disocclusion map.
The motivation behind the modification would have been to obtain an apparatus that improves semantic segmentation and upscaling of images, since both FLEISHMAN and TOVEY concern image analysis and image reconstruction. Wherein FLEISHMAN’s provides improves the accuracy of the semantic labels by using historical data along with the efficiency of semantic segmentation, which, in turn, permits the sematic segmentation to be performed on smaller devices, while LIU’s systems and methods improve upscaling by using temporal feedback to reconstruct high-resolution images while maintaining and improving image quality compared to native rendering. Please see FLEISHMAN et al. (US 20190043203 A1), Paragraph [0032, 0040, and 0056] and TOVEY et al. (US 20230298133 A1), Abstract and Paragraph [0023].
Conclusion
Listed below are the prior arts made of record and not relied upon but are considered pertinent to applicant`s disclosure.
LIU et al. (US 20200202128 A1)- A method for computing a dominant class of a scene includes: receiving an input image of a scene; generating a segmentation map of the input image, the segmentation map including a plurality of pixels, each of the pixels being labeled with a corresponding class of a plurality of classes; computing a plurality of area ratios based on the segmentation map, each of the area ratios corresponding to a different class of the plurality of classes of the segmentation map; applying inference to generate a plurality of ranked labels based on the area ratios; and outputting a detected dominant class of the scene based on the plurality of ranked labels....................... Please see Fig. 2-4 and 6-7. Abstract.
LIU et al. (CN 117557714 A)- The invention relates to the technical field of three-dimensional modeling, in particular to a three-dimensional reconstruction method, electronic equipment and a readable storage medium. The method comprises the following steps: acquiring a multi-view image of a first object under natural illumination; performing three- dimensional geometric reconstruction based on the multi-view image to obtain a first model of the first object in a three-dimensional space coordinate system, wherein the first model is presented in a three-dimensional geometric model mode; performing UV expansion on the first model to obtain a UV map of the first model; and analyzing and processing the multi-view image, the first model and the UV chartlet to generate a geometric material chartlet and a PBR material chartlet of the UV space of the first model. According to the three-dimensional reconstruction method provided by the embodiment of the invention, feature processing is performed on the multi-view image, the first model and the UV map of the first model, so that various material maps, namely geometric material maps and PBR material maps, of the UV space can be generated, and more real and beautiful material performance is generated....................... Please see Paragraph [0082-0083, 0148, 0181-0182, 0185, 0187, 0189, 0197 and 0320]. Abstract.
LIN et al. (US 20220398747 A1)- Systems and techniques are described herein for performing optical flow estimation for one or more frames. For example, a process can include determining an optical flow prediction associated with a plurality of frames. The process can include determining a position of at least one feature associated with a first frame and determining, based on the position of the at least one feature in the first frame and the optical flow prediction, a position estimate of a search area for searching for the at least one feature in a second frame. The process can include determining, from within the search area, a position of the at least one feature in the second frame..................... Please see Fig. 1-2 and 5. Abstract.
ZHANG et al. (US 20220058452 A1)- Systems, methods, and non-transitory media are provided for providing spatiotemporal recycling networks (e.g., for video segmentation). For example, a method can include obtaining video data including a current frame and one or more reference frames. The method can include determining, based on a comparison of the current frame and the one or more reference frames, a difference between the current frame and the one or more reference frames. Based on the difference being below a threshold, the method can include performing semantic segmentation of the current frame using a first neural network. The semantic segmentation can be performed based on higher-spatial resolution features extracted from the current frame by the first neural network and lower-resolution features extracted from the one or more reference frames by a second neural network. The first neural network has a smaller structure and/or a lower processing cost than the second neural network...................... Please see Fig. 1-5. Abstract.
Weinzaepfel (US 20200160065 A1)- A method for training a convolutional recurrent neural network for semantic segmentation in videos, includes (a) training, using a set of semantically segmented training images, a first convolutional neural network;(b) training, using a set of semantically segmented training videos, a convolutional recurrent neural network, corresponding to the first convolutional neural network, wherein a convolutional layer has been replaced by a recurrent module having a hidden state. The training of the convolutional recurrent neural network, for each pair of successive frames (t−1, t ∈ custom-character1; Tcustom-character.sup.2) of a video of the set of semantically segmented training videos includes warping an internal state of a recurrent layer according to an estimated optical flow between the frames of the pair of successive frames, so as to adapt the internal state to the motion of pixels between the frames of the pair and learning parameters of at least the recurrent module....................... Please see Fig. 4-6. Abstract.
WANG et al. (US 20220012536 A1)- A method, computer readable medium, and system are disclosed for creating an image utilizing a map representing different classes of specific pixels within a scene. One or more computing systems use the map to create a preliminary image. This preliminary image is then compared to an original image that was used to create the map. A determination is made whether the preliminary image matches the original image, and results of the determination are used to adjust the computing systems that created the preliminary image, which improves a performance of such computing systems. The adjusted computing systems are then used to create images based on different input maps representing various object classes of specific pixels within a scene...................... Please see Fig. 1 and 5-6. Abstract.
RATNER et al. (US 20170337711 A1)- Embodiments of the present disclosure relate to image processing. In at least one embodiment, a method comprises: receiving the video file; segmenting the video file, determining foreground in the video file, estimating motion in the video file, determining objects in the video file, partitioning the video file and encoding the video file........................ Please see Fig. 2 and 5 and para. [0111-0117, 0132-0140 and 0356]. Abstract.
MASSAL et al. (US 20230177649 A1)- Apparatuses, systems, and techniques are presented to reconstruct one or more images. In at least one embodiment, one or more objects in an image are caused to be generated based, at least in part, on applying one or more offsets to a motion of the one or more objects relative to one or more prior images..................... Please see Fig. 2-4. Abstract.
KULKARNI et al. (US 20240005587 A1)- Systems and methods for machine learning based controllable animation of still images is provided. In one embodiment, a still image including a fluid element is obtained. Using a flow refinement machine learning model, a refined dense optical flow is generated for the still image based on a selection mask that includes the fluid element and a dense optical flow generated from a motion hint that indicates a direction of animation. The refined dense optical flow indicates a pattern of apparent motion for the at least one fluid element. Thereafter, a plurality of video frames is generated by projecting a plurality of pixels of the still image using the refined dense optical flow..................... Please see Fig. 5-8. Abstract.
MCCORMAC et al. (US 20190147220 A1)- Certain examples described herein enable semantically-labelled representations of a three-dimensional (3D) space to be generated from video data. In described examples, a 3D representation is a surface element or ‘surfel’ representation, where the geometry of the space is modelled using a plurality of surfaces that are defined within a 3D co-ordinate system. Object-label probability values for spatial elements of frames of video data may be determined using a two-dimensional image classifier. Surface elements that correspond to the spatial elements are identified based on a projection of the surface element representation using an estimated pose for a frame. Object-label probability values for the surface elements are then updated based on the object-label probability values for corresponding spatial elements. This results in a semantically-labelled 3D surface element representation of objects present in the video data. This data enables computer vision and/or robotic applications to make better use of the 3D representation.................... Please see Fig. 5-8. Abstract.
Shacklett et al. (US 20220108421 A1)- Apparatuses, systems, and techniques are presented to generate images. In at least one embodiment, at least a first optical flow network (OFN) and at least a first reconstruction network (RN) can be used to generate one or more images based, at least in part, upon the OFN and the RN using a shared loss function..................... Please see Fig. 1-4. Abstract.
MALLYA et al. (US 20210374552 A1)- Apparatuses, systems, and techniques are presented to synthesize consistent images or video. In at least one embodiment, one or more neural networks are used to generate one or more second images based, at least in part, on one or more point cloud representations of one or more first images...................... Please see Fig. 4. Abstract.
LIU et al. (US 20200394752 A1)- Systems, methods, and computer-readable media for context-aware synthesis for video frame interpolation are provided. Bidirectional flow may be used in combination with flexible frame synthesis neural network to handle occlusions and the like, and to accommodate inaccuracies in motion estimation. Contextual information may be used to enable frame synthesis neural network to perform informative interpolation. Optical flow may be used to provide initialization for interpolation. Other embodiments may be described and/or claimed....................... Please see Fig. 1-2. Abstract.
Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner
should be directed to Aaron Bonansinga whose telephone number is (703) 756-5380 The examiner can normally be reached on Monday-Friday, 9:00 a.m. - 6:00 p.m. ET.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s
supervisor, Chineyere Wills-Burns can be reached by phone at (571) 272-9752. The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/AARON TIMOTHY BONANSINGA/Examiner, Art Unit 2673
/CHINEYERE WILLS-BURNS/Supervisory Patent Examiner, Art Unit 2673