DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are currently pending in U.S. Patent Application No. 18/634,316 and an Office action on the merits follows.
Claim Objections
Claims 10 and 15 are objected to because of the following informalities:
Claim 10 recites the language “wherein the second machine learning is” which appears to be a typo missing language “model” after the term “learning”. Claim 15 is the system claim corresponding to method claim 10 and features similar language – “the… learning is trained” that may be intended to read “the… learning model is trained”.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
1. Claims 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Marie et al. “Evaluation of Image Quality Assessment Metrics for Semantic Segmentation in a Machine-to-Machine Communication Scenario” in view of Zhang et al. “Quality Assessment of Screen Content Images via Convolutional-Neural-Network-Based Synthetic/Natural Segmentation”.
As to claim 16, Marie discloses a method of training a neural network, comprising:
generating an input image that includes content overlaying other media (generating Î based on Ci, Section III-A, in view of “The #D = 500 validation images from the Cityscapes [5] dataset are considered in this study, which contains urban landscapes as seen from a driving car perspective”; wherein ‘content’ may be e.g. pedestrians, vehicles/car/ bus/truck, buildings, traffic signs, etc., overlaying media as permissibly interpreted to include background scene elements e.g. sky, road, terrain, etc.,; see note below);
generating a segmentation mask based on the content included in the input image (generating segmentation masks P/P̂, Fig. 1); and
training the neural network to reproduce the segmentation mask based on the input image (progressive training as applied to segmentation models SCi to reproduce P̂ corresponding to P for each coding configuration Ci, page 4 section A Built dataset “DNN model DeepLabV3+ [2] with a ResNet50 [11] backbone is employed for this purpose, using MMSegmentation [4] implementation. We denote the model trained with pristine image I and GT labels as S0. Once trained, pseudo GT predictions P can be obtained by inputting the #D validation images I to the DNN model S0. As shown by the literature, a DNN trained on losslessly compressed images I such as original Cityscapes dataset generalizes poorly to compressed images Î, as DNN would encounter artifacts that were not present at training time [7], [17], [18]. To mitigate this bias, progressive training [18] is employed to obtain segmentation models SCi that are resilient to artifacts generated from coding configuration Ci, i ∈ {1, 2, 3 . . . ,#C}. In a nutshell, progressive training allows training one DNN model on multiple coding configurations at once by progressively strengthening the degradation level as training progresses. Once trained, predictions on compressed images P̂ are obtained by inputting the #D validation images I to SCi.”).
Applicant may assert that permissible interpretation of “content overlaying other media”, particularly in the context of claim 18, does not include what might be characterized as ‘natural images’ (or primarily natural image regions) similar to those of the cityscapes dataset, but instead ‘screen content images’.
Zhang evidences the obvious nature of generating training samples, also for use in the context of a quality assessment machine learning model (generating segmentation masks corresponding to each of a pristine/reference/original and a distorted/encoded version), comprising screen content images characterized by content overlaying other media (page 5113 Abs “The recent popularity of remote desktop software and live streaming of composited video has given rise to a growing number of applications that make use of the so-called screen content images that contain a mixture of text, graphics, and photographic imagery. Automatic quality assessment (QA) of screen-content images is necessary to enable tasks, such as quality monitoring, parameter adaptation, and other optimizations”, Section I “Consequently, the visual content presented to consumers is not limited to natural images, but can contain a mixture of sources which include natural images, computer-generated graphics, text, charts, maps, users’ hand-writings and drawings, and even some special symbols or patterns (e.g., logos, bar codes, QR codes, etc.) [1]. Such visual content is commonly referred to as screen content images (SCIs) … However, for many other types of SCIs, the compositing is performed at the sender side, and thus the entire SCI is subjected to distortion due to processing, transmission, and compression”, “To investigate QA of SCIs, three screen content image databases have been developed (i.e., SIQAD [7], [8], SCD [9], SCID [1]), and several screen content image quality assessment (SC-IQA) algorithms have been proposed”, etc.,). Zhang further suggests the increasingly popular nature of such imagery (as of 2018), in addition to the benefit of training an IQA model on such datasets with samples similar to whatever images are most and/or preferentially expected (particularly because different elements thereof may be subject to differing levels of distortion without significant changes in perceived quality, page 5114), given one or more use scenarios. Zhang further sets forth the manner in which particularly for those instances wherein the SCI is composited ‘sender side’, natural-image regions as well as plain text and computer graphics will be subjected to distortion.
It would have been obvious to a person of ordinary skill in the art, before the effective filing date, to modify the system and method of Marie such that training samples/data set relied upon is further modified to include screen content image datasets as taught/suggested by Zhang, the motivation as similarly taught/suggested therein and readily recognized by POSITA that the use of such training samples would better ensure a model with the benefit during inference, of having learned the most correct/ relevant patterns/weights and avoiding what might otherwise be a high degree of inaccuracy, skewed predictions, etc., caused by distribution shifts associated with changes in domain (NI vs SCI and sub-classes therein) – better adapting the model of Marie for frequently encountered use scenarios given the prevalence of such imagery surrounding user smartphone use, remote desktop sharing/ collaborative video conferencing, etc. (Zhang section I).
As to claim 17, Marie in view of Zhang teaches/suggests the method of claim 16.
Marie in view of Zhang further teaches/suggests the method wherein the content includes text, geometric shapes, or icons (Marie cityscapes dataset wherein content includes e.g. geometric shapes for various traffic signs, buildings, cross-walks, etc.,, but more preferably for the case of Marie as modified by Zhang, see Zhang SCI page 5113, Abs, Section I, “synthetic content (text/graphics)”, “text, charts, maps, users’ hand-writings and drawings, and even some special symbols or patterns (e.g., logos, bar codes, QR codes, etc.) [1] …. but also computer-generated content such as text, charts, graphics, and icons, which have significantly different properties compared to natural content”, etc., in view of e.g. SIQAD, SCD, SCID datasets).
As to claim 18, Marie in view of Zhang teaches/suggests the method of claim 16.
Marie in view of Zhang further teaches/suggests the method wherein the content comprises screen content (see Zhang SCI above).
As to claim 19, Marie in view of Zhang teaches/suggests the method of claim 16.
Marie in view of Zhang further teaches/suggests the method wherein the segmentation mask includes the content and excludes the other media (Zhang page 5114 left column “Another potential limitation is that the current segmentation scheme does not filter out flat/solid-color regions, while common sense suggests that these regions play little or less important roles in guiding the HVS in judging SCI quality”, Zhang Fig. 2, wherein not only are flat/solid-color areas excluded from the segmentation masking at Stage 1, but at Stage 2 QA of SCI synthetic region is determined distinct from that for natural region and the synthetic region includes only those patches belonging to (1)-(2), for an adaptive combination/fusion at Stage 3, Zhang page 5115 – Examiner notes for clarity Reference SCI of Zhang Fig. 2 corresponds to image I of Marie – and Applicant’s received image prior to encoding, and distorted/compressed SCI of Zhang Fig 2 corresponds to Î of Marie, and Applicant’s encoded image; Zhang discloses a two mask based IQA analogous to that of Marie and e.g. that of claim 1).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date, to further modify the system and method of Marie in view of Zhang such that the segmentation masks exclude one or more media elements as taught/suggested by Zhang, the motivation as similarly taught/suggested therein that such an exclusion ensures those elements don’t adversely contribute to any finally calculated quality measure.
As to claim 20, Marie in view of Zhang teaches/suggests the method of claim 16.
Marie in view of Zhang further teaches/suggests the method wherein the segmentation mask is associated with an alpha channel of the input image (Zhang page 5115 text and graphics regions (1) and (2) in view of permissible interpretation for the language ‘associated with’ in further view of the manner in which SCI content typically includes content generated via alpha compositing/blending - such as text and computer graphics images/layers as blended with background elements (or optionally background portions that are entirely transparent); Stated differently Examiner understands at least the SCID and SIQAD datasets to comprise such compound/composite images – see also Zhang Fig. 3, Fig. 7, etc., and accordingly segmentation masks based thereon will be ‘associated with’ an alpha channel controlling transparency/opacity at least for instances of such imagery).
2. Claims 1-5 and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US 12,342,011 B1) in view of Marie et al. “Evaluation of Image Quality Assessment Metrics for Semantic Segmentation in a Machine-to-Machine Communication Scenario”.
As to claim 1, Wang discloses a method of image compression (Abs “A codec system dynamically selects a codec from multiple available codecs with a highest level of encoding quality at a computing device based at least in part on the available computing resources at a particular computing device”), comprising:
receiving an image for transmission over a communication channel (Fig. 1, receiving at e.g. computing device 102A/201 video (comprising a plurality of fames/images) to be encoded for transmission over channel/network 106 to e.g. device 102B, 102C, Fig .5 lines 20-45, etc.,);
encoding the image as a first encoded image based on a first image compression scheme (col 3 lines 1-15 “Multiple codecs available at a computing device 102A, 102B, 102C can include, but are not limited to, H264, H265, VP9, VPS, and/or AVl. Instead of the codec used being statically configured, a codec selection application (which can execute at the computing device 102A in some embodiments) can dynamically select a codec from multiple codecs. Moreover, a codec monitoring application (which can execute at the computing device 102A in some embodiments) can monitor the encoding and switch codecs based at least in part on the encoding performance”, etc.,);
calculating a first visual quality metric for the first encoded image Fig. 1, Encoding Score of 122, col 3 lines 30-40 “For example, the codec selection application can select the codec with a highest level of encoding quality that does not exceed eighty percent usage of computing resources of the computing device 102A. As another example, the codec selection application can select the codec with a highest level of encoding quality with an encoding time that does not drop more than five frames per second”, etc., in further view of the manner in which as suggested in col 3 lines 10-20 delays perceptible to a user are accounted for in score embodiments based at least in part on bit rate, (iii) total encoding time and (ii) computing resource usage, col 9 lines 50-65, col 12 lines 1-10 “The codec monitoring application 224 can determine a slowness indicator based at least in part on the encoding time, the encoding frame rate, and the hardware processor load”, etc.,); and
selectively transmitting the first encoded image over the communication channel based at least in part on the first visual quality metric (col 3 lines 40-50 “The first computing device 102A can use the selected codec to generate the encoded data 110A and transmit it over the network 106”, col 6 lines 20-40 “During encoding, the codec monitoring application 224 can monitor encoding performance at the computing device 102A, 102B, 102C. If the codec monitoring application 224 determines that a threshold is not satisfied, then the codec monitoring application 224 can switch the codec used for encoding at the computing device 102A, 102B, 102C”, col 11 lines 5-10 “the codec selection application 222 can select the codec with the highest encoding score, which can be customized for the particular computing device 102A”, etc.,).
Wang fails to disclose those two inferring steps recited, and fails to disclose any fidelity metric equivalent(s) as being based thereon accordingly.
Marie however evidences the obvious nature of a fidelity IQA comprising inferring a first segmentation mask from the image based on a first machine learning model (Marie page 3 Fig. 1 first segmentation mask that is that pseudo Ground Truth (GT) P, inferred from Image I, based on model(s) S0, “S0 denotes a segmentation algorithm where model weights were trained using pristine images”);
inferring a second segmentation mask from the first encoded image based on the first machine learning model (Fig. 1, encoder/compression specific masks P̂ obtained from encoded image(s) Î );
calculating a first visual fidelity metric for the first encoded image based on the first segmentation mask and the second segmentation mask (Fig. 1 Image Quality Assessment (IQA) based on green “Accuracy” comparison between first mask P and second mask P̂ , page 4 Section B Considered IQA Metrics “Multiple FR IQA metrics are considered in this study, namely PSNR, SSIM [29] and its multiscale variant MultiScale Structural SIMilarity (MS-SSIM) [28], FSIM [34], SR-SIM [32], Gradient Magnitude Similarity Deviation (GMSD) [30] and its multiscale variant MultiScale Gradient Magnitude Similarity Deviation (MS-GMSD) [31], Visual Saliency based Index (VSI) [33], Haar Perceptual Similarity Index (HaarPSI) [21], Mean Deviation Similarity Index (MDSI) [37], LPIPS [35] and DISTS [6]. For LPIPS, the VGG [25] network is used”, page 6 V. Conclusion “In this paper, we evaluated correlation of FR IQA metrics with DNN prediction accuracy for the semantic segmentation vision task on various coding configurations including multiple image resolutions and multiple encoders”). Marie further suggests that such a two mask based fidelity metric may more accurately indicate which encoding schemes produce encoded images that may be successfully relied upon for various tasks/post-processing after e.g. transmission (Fig. 5, page 6 section IV), and accordingly further suggests selectively utilizing one or more encoded images on the basis of such an IQA accordingly. See also Marie page 6 Section IV “Nowadays, most encoders are based on AVC, HEVC or VVC standards, which imply the use of a RDO algorithm to determine appropriate encoding decisions. RDO aims at minimizing blocks degradation according to a FR metric under a given rate constraint … Since all considered metrics fails to achieve high level of correlation with DNN accuracy, minimizing such metric may not translate in better DNN performance. … Based on the observation that existing metrics are not correlated with machine perception, it is desirable to propose a suitable metric in the VCM context”.
It would have been obvious to a person of ordinary skill in the art, before the effective filing date, to modify the system and method of Wang to further comprise those mask inferring steps recited and a quality metric based thereon as taught/suggested by Marie, further applicable in conjunction with those of Wang for encoder selection, the motivation as similarly taught/suggested in Marie that such a metric may provide better predictions (including DNN accuracy) on a local scale and further ensure the compressed stream/frames are better suited for any subsequent/downstream processing tasks (particularly if those tasks are also associated with machine perception).
As to claim 2, Wang in view of Marie teaches/suggests the method of claim 1.
Wang in view of Marie further teaches/suggests the method wherein the first visual fidelity metric comprises a peak signal-to-noise ratio (PSNR), a PSNR based on properties of the human visual system (PSNR-HVS), a PSNR-HVS with visual masking (PSNR-HVS-M), a video multimethod assessment fusion (VMAF) metric, or a learned perceptual image patch similarity (LPI PS) metric (see Marie as applied, Marie page 4 Section B identified above for the case of claim 1).
As to claim 3, Wang in view of Marie teaches/suggests the method of claim 1.
Wang in view of Marie teaches/suggests the method further comprising:
encoding the image as a second encoded image based on a second image compression scheme different than the first image compression scheme (Wang any of those additional encoders different from a ‘first’, e.g. 108B-C if 108A is the first; see also Marie as applied Ci for a different/second i/scheme, Marie page 3 Section III-A “JPEG, JM, x265 and VVenC Lossy compression algorithms”);
inferring a third segmentation mask from the second encoded image based on the first machine learning model (Marie P̂ for Ci corresponding to that from above, different from a first, Ci, i ∈ {1, 2, 3 . . . ,#C});
calculating a second visual fidelity metric for the second encoded image based on the first segmentation mask and the third segmentation mask (Marie’s IQA for a different/second Ci, as applied to encoder outputs of 108B/C of Wang under that same rationale as identified above for the case of claim 1 – see Wang Fig. 1 122 each codec as a respective encoding score); and
determining whether the first visual fidelity metric or the second visual fidelity metric indicates a higher image quality, the first encoded image being selectively transmitted over the communication channel based at least in part on whether the first visual fidelity metric or the second visual fidelity metric indicates a higher image quality (Wang as modified, in view of Wang col 10 line 65 – through col 11 line 10 “At sub-block 414, a codec can be selected. The codec selection application 222 can select the codec based at least in part on the respective encoding scores for each codec. In particular, the codec selection application 222 can select the first codec instead of the second codec based at least in part on the first encoding score and the seconding encoding score”, etc.,).
As to claim 4, Wang in view of Marie teaches/suggests the method of claim 3.
Wang in view of Marie teaches/suggests the method wherein the selective transmitting of the first encoded image comprises:
refraining from transmitting the first encoded image over the communication channel responsive to determining that the second visual fidelity metric indicates a higher image quality (Wang refrains from further reliance on a previously selected encoder, in view of an updated/newly acquired score, col 12 line 60-65 “Similar to the codec selection application 222 that calculates an encoding score, the codec monitoring application 224 can calculate an encoding score for the current codec over the sliding window”, col 13 lines 25-35 “At sub-block 426, a different codec can be selected instead of the current codec and the different codec can be used for encoding. The codec monitoring application 224”, etc.,).
As to claim 5, Wang in view of Marie teaches/suggests the method of claim 3.
Wang in view of Marie teaches/suggests the method further comprising:
transmitting the second encoded image, in lieu of the first encoded image, over the communication channel responsive to determining that the second visual fidelity metric indicates a higher image quality (Wang col 11 lines 1-10 for an instance that the ‘second’ encoder is associated with the higher/otherwise selected score, col 10 “col 11 line 10 “At sub-block 414, a codec can be selected. The codec selection application 222 can select the codec based at least in part on the respective encoding scores for each codec”, etc.,).
As to claim 11, this claim is the system claim (comprising generic processor and memory structure) corresponding to the method of claim 1 and is rejected accordingly.
As to claim 12, this claim is the system claim corresponding to the method of claim 3 and is rejected accordingly.
3. Claims 6-10 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US 12,342,011 B1) in view of Marie et al. “Evaluation of Image Quality Assessment Metrics for Semantic Segmentation in a Machine-to-Machine Communication Scenario” and Zhang et al. “Quality Assessment of Screen Content Images via Convolutional-Neural-Network-Based Synthetic/Natural Segmentation”.
As to claim 6, Wang in view of Marie teaches/suggests the method of claim 1.
Wang in view of Marie further teaches/suggests the method wherein the first machine learning model is trained to extract a first type of content from one or more input images (Marie Fig. 1, encoder/compression specific masks P̂ obtained from encoded image(s) Î, and P from I, featuring those various object classes illustrated – in further view of MMSegmentation [4] implementation disclosure in Marie, which allows users to define specific classes).
Zhang further discloses a model trained to extract types/classes of content in a SCI context (see Zhang as applied in the rejection of claim(s) 16-17 above, Zhang Fig. 2 CNN-SOE model extracting those content-specific masks of e.g. Fig. 3, 7, etc.,).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date, to further modify the system and method of Wang in view of Marie, in view of Zhang as identified above for the case of claim 16, under that same rationale previously presented.
As to claim 7, Wang in view of Marie and Zhang teaches/suggests the method of claim 6.
Wang in view of Marie and Zhang further teaches/suggests the limitations of claim 7 (see Zhang as applied in the rejection of claim 18 above).
As to claim 8, Wang in view of Marie and Zhang teaches/suggests the method of claim 6.
Wang in view of Marie and Zhang further teaches/suggests the limitations of claim 8 (see Zhang as applied in the rejection of claim 17 above).
As to claim 13, this claim is the system claim corresponding to the method of claim 6 and is rejected accordingly.
Additional References
Prior art made of record and not relied upon that is considered pertinent to applicant's disclosure:
Additionally cited references (see attached PTO-892) otherwise not relied upon above have been made of record in view of the manner in which they evidence the general state of the art.
Allowable Subject Matter
Claim(s) 9-10, 14-15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. As to claim 9, Wang in view of Marie and Zhang further teaches/suggests inferring a third segmentation mask from the image based on a second machine learning model different than the first machine learning model (mask inferred based on Zhang’s model, different from Marie’s model, for the same/corresponding reference SCI (corresponding to I) (uncompressed/pristine/non-distorted));
inferring a fourth segmentation mask from the first encoded image based on the second machine learning model (a mask for distorted SCI (corresponding to Î of Marie) as determined by Zhang’s model); and
calculating a second visual fidelity metric for the first encoded image based on the third segmentation mask and the fourth segmentation mask (Zhang’s IQA metric as distinguished from Marie’s). Literature of record broadly features various methods/models for performing image quality assessment (IQA), in both the context of natural images and screen content images (SCI). Examiner suspects that it might be possible to integrate any number of these various algorithms (i.e. implement both Zhang’s model and Marie’s model (without modification to either), to determine two distinct IQA scores pertinent to a single encoder/compression/ distortion scheme (and the same for a plurality of encoders/schemes as each Zhang and Marie disclose)), thereby deriving a multitude of such metrics, and weight/combine them in any manner as a design choice constraint, to ultimately derive a combined metric guiding selection of an optimal encoder/encoding scheme for any of a broad array of potential uses post transmission. Even Zang suggests a fused/combined score on the basis of at least two metrics, i.e. synthetic region and natural region specific metrics. So too may Yang’s SPQA method “Perceptual quality assessment of screen content images” as described at Zhang page 5116. It is arguably unclear as to what threshold number of such scores/models would be non-obvious in the proposed combination above, however references of record fail to explicitly disclose such a multi-fidelity-score basis (at most Wang discloses in e.g. col 13 considering a quality metric in conjunction with various slowness indicator embodiments) when selectively determining an encoder image for transmission, and claim(s) 9-10 (and corresponding 14-15) are identified as Allowable Subject Matter accordingly.
Inquiry
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN L LEMIEUX whose telephone number is (571)270-5796. The examiner can normally be reached Mon - Fri 9:00 - 6:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chan Park can be reached on 571-272-7409. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/IAN L LEMIEUX/Primary Examiner, Art Unit 2669