DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 7 April 2026 has been entered.
Response to Amendment
The amendment filed on 7 April 2026 has been entered.
The amendment of claims 1 and 8 has been acknowledged.
Response to Arguments
Applicant’s arguments filed on 7 April 2026, with respect to the pending claims, have been fully considered but are moot because the arguments rely on newly added and/or amended claim limitations (e.g., depth and segmentation results). The examiner has revised the rejections to match the new claim limitations
Claim Rejections - 35 USC § 103
Claim(s) 1, 2, 6, 7, 8, 9, 13, and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Besenbruch et al. (US 2023/0154055 A1), in view of Choi et al. (“RobustNet: Improving Domain Generalization in Urban-Scene Segmentation via Instance Selective Whitening,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11580-11590), hereinafter referred to as Besenbruch and Choi, respectively.
Regarding claim 1, Besenbruch teaches a multi-task learning device comprising:
a feature extraction layer configured to generate a first feature corresponding to a first image and generate a second feature corresponding to a second image by applying a feature extraction operation to the first image and the second image, wherein the first image is included in a first training data set corresponding to a first task, and the second image is included in a second training data set corresponding to a second task (Besenbruch Abstract: “encoding the input image using a first trained neural network … using a second trained neural network to produce an output image”; Besenbruch ¶¶0598: “Scalars are 0-dimensional …Vectors are 1-dimensional and denoted in boldface … typically comprise of inputs, biases, feature maps, latent, eigenvectors, and other quantities”; Besenbruch Fig. 10; Besenbruch ¶¶1384: “patch-wise features, feature-point extraction data such as FAST/SIFT … the feature-extraction reduces the problem’s dimensionality, from image space to feature space”);
a first decoding layer configured to generate a first task inference result corresponding to the first image by applying a first decoding operation to the first feature (Besenbruch Figs. 10, 12);
a second decoding layer configured to generate a second task inference result corresponding to the second image by applying a second decoding operation to the second feature (Besenbruch Figs. 10, 12);
a first loss layer configured to generate a first task loss with reference to the first task inference result and a first task ground truth (GT) result corresponding to the first task inference result (Besenbruch ¶¶0268: “ground-truth dependencies between x1, x2 are used as additional input”; Besenbruch ¶¶0271: “evaluating a loss function based on differences between he output pair”; Besenbruch Fig. 9 & ¶¶0383: “LS is the loss. The segmentation ground truth label xs may be of any type”; Besenbruch ¶¶0404: “FIG. 30 shows an example in which x represents the ground truth images, x̂ represents the distorted images and s represents the visual loss score. This figure represents a possible architecture to learn visual loss score”; Besenbruch ¶¶0578: “End-to-end training of the segmentation network together with the compression network may require ground truth labels for the desired segmentation output, or some type of ground truth label that can guide the segmentation module, whilst the compression network is training simultaneously. The training follows the bi-level principle, meaning that gradients from the compression network do not affect the segmentation module training, and the segmentation network gradients do not affect the compression network gradients”);
a second loss layer configured to generate a second task loss with reference to the second task inference result and a second task GT result corresponding to the second task inference result (Besenbruch Fig. 9, ¶¶0268, ¶¶0271, ¶¶0383, ¶¶0404, ¶¶0578 discussed above);
a feature loss layer configured to generate a feature loss with reference to the first feature and the second feature (Besenbruch ¶¶0864: “Perceptual losses, including the feature loss as described in existing literature between all intermediate layers of, but not limited to, any layers of a pre-trained classification networks”); and
a parameter updater configured to update parameters of at least some of the feature extraction layer, the first decoding layer, or the second decoding layer by using at least some of the first task loss, the second task loss, or the feature loss (Besenbruch ¶¶0510: “update the weight parameters of θ and Ω of the neural networks using the gradients ∂L/∂w”; Besenbruch ¶¶0563: “This method allows us to update the parameters of our model to a user-specific, desired goal. The goal is controlled by defining a loss-function that the network uses for backpropagation. Every parameter in the network is updated such that the loss is decreased as the network trains”; Besenbruch ¶¶1012: “This method relies on stochastic updates of the parameters of ƒGM based on an acceptance criterion”),
wherein the feature loss layer is further configured to:
generate a covariance matrix with reference to the first feature and the second feature; and generate the feature loss (Besenbruch ¶¶1239: “a multi-scale approach may be used, this is naturally the case when the transformation above provides multiple different scales, such as the case given the wavelet transform, where mutual information for each scale is computed and then aggregated. This approach may also be further generalised to a multivariate distribution where the tensor to be modelled is split into blocks (in spatial and or channel dimensions) of variable sizes and modeled using a multivariate normal distribution with a mean vector and co-variance matrix per block of elements”);
generate a first covariance matrix corresponding to the first feature of the first image and a second covariance matrix corresponding to the second feature of the second image (Besenbruch ¶¶0598: “consist of a mean vector and a covariance matrix”; Besenbruch ¶¶0668: “the quantity expressing the intervariable dependencies ”; Choi Fig. 4(b); Choi Eq. (14)-(15) & pg. 11584 left column: “V consists of elements of the variance of each covariance element across various photometric transformations”);
generate the covariance matrix with reference to the first covariance matrix and the second covariance matrix (Besenbruch ¶¶0598, ¶¶0668; Choi Fig. 4(b), Eq. (14)-(15) & pg. 11584 left column discussed above);
generate an output matrix by applying an element-wise multiplication operation to the covariance matrix and a mask matrix (Besenbruch ¶¶0608: “the point-wise (Hadamard) multiplication operator ⊙”; Besenbruch ¶¶0672: “obtain L=A⊙M”; 1161: “zb=xb⊙s(xa)+m(xa)”; Besenbruch ¶¶1632: “a binary mask M”; Besenbruch ¶¶1634: “the mask can be optimized on a per-input basis”; Besenbruch ¶¶1686: “we can think of it as an element-wise multiplication with a tensor that is dependent on its input. For instance, without loss of generality, the ReLU function in Equation (15.9) can be thought of as an element-wise multiplication between the input x and a mask R, consisting of 1s and as, that has been conditioned on the input x (15.10). Thus, ReLU can be restated as (15.11), where ⊙ is the element-wise multiplication operation”); and
generate the feature loss with reference to the output matrix and a GT matrix corresponding to the output matrix (Besenbruch Fig. 9 & ¶¶0383 discussed above – LS is the loss and the xs is the ground truth),
wherein the first task inference result includes depth prediction result (Besenbruch ¶¶0275: “The method may be one wherein the loss function includes using a single image depth-map estimation … and then measuring the distortion”; Besenbruch ¶¶0817: “The auxiliary function can be trained to minimize a loss such as MSE or any other distance metric”; Besenbruch ¶¶0859: “we can also employ a method by which we feed in all three images of the 2FAC into a network, asking the network to predict distances for each, which we send into a fully connected network to predict the result of the 2FAC. FIG. 33 shows a possible configuration for this method”; Besenbruch ¶¶1090-¶¶1091: “To model this constraint, the AI-based Compression pipeline requires and additional (differentiable) loss term … Possible loss terms are: 1. Single image depth-map estimation of x1, x2, x̂1, x̂2, and then measuring the distortion between the depths maps”), and
wherein the second task inference result includes segmentation prediction (Besenbruch ¶¶0225: “The method may be one wherein the segmentation algorithm is used in a bi-level fashion”; Besenbruch Fig. 9 & ¶¶0383: “an example pipeline of the training of the Segmentation Module … if the module is parameterized as a neural network, where LS is the loss. The segmentation ground truth label xs may be of any type required by the segmentation algorithm”).
However, Besenbruch does not appear to explicitly teach generating a variance matrix.
Pertaining to the same field of endeavor, Choi teaches generating a variance matrix (Choi Fig. 4: see “Variance” & “LISW”; note that Choi also teaches covariance matrix, see Choi Fig. 4: “The covariance matrix Σs is masked by the matrix M̃ to selectively suppress style-sensitive covariance by LISW”; Choi pg. 11584 right column: “we propose an instance selective whitening (ISW) loss that selectively suppresses only to the style-encoded covariances. Let the mask matrix M in Eq. (11) change to M̃ … The networks continue training for the remaining epochs incorporating the proposed ISW loss”; Choi further teaches performing element-wise multiplication, see Choi Fig. 3(c) & pg. 11583 left column discussed above in the response to arguments).
Besenbruch and Choi are considered to be analogous art because they are directed to image processing using neural networks. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method and system for loss image or video compression (as taught by Besenbruch) to generate a variance matrix between the features to determine the loss (as taught by Choi) because the combination can identify covariance elements with high variances that contain domain-specific style (Choi pg. 11584 left column).
Regarding claim 8, Besenbruch, in view of Choi, teaches that the multi-task learning device performs a multi-task learning method comprising the processes described in claim 1 (Besenbruch ¶¶0002: “computer-implemented methods and systems”). Therefore, claim 8 is rejected using the same rationale as applied to claim 1 discussed above.
Regarding claims 2 and 9, Besenbruch, in view of Choi, teaches the multi-task learning device and method of claims 1 and 8, wherein the parameter updater is configured to:
update parameters of the feature extraction layer and the first decoding layer by using the first task loss; and update parameters of the feature extraction layer and the second decoding layer by using the second task loss; and update parameters of the feature extraction layer by using the feature loss (Besenbruch ¶¶0510, ¶¶0563, ¶¶1012 discussed above & Algorithm 5.1).
Regarding claims 6 and 13, Besenbruch, in view of Choi, teaches the multi-task learning device and method of claims 1 and 8, wherein the first training data set includes a label corresponding to the first task, and wherein the second training data set includes a label corresponding to the second task (Besenbruch ¶¶0578: “End-to-end training of the segmentation network together with the compression network may require ground truth labels for the desired segmentation output, or some type of ground truth label that can guide the segmentation module, whilst the compression network is training simultaneously”; Besenbruch ¶¶0852: “We provide an automated method to generate labelled data, which is used to pre-train our DVL network before it is trained on HLD … we generate the labels for our distorted data using the bit-rate”; Besenbruch ¶¶0853: “This method provides us with a plethora of labelled data, without the need for human evaluators. This labelled data can be used to train and pre-train our DVL network”).
Regarding claims 7 and 14, Besenbruch, in view of Choi, teaches the multi-task learning device and method of claims 1 and 8, wherein:
the feature extraction layer is further configured to generate a test feature corresponding to a test image by applying a feature extraction operation to the test image (Besenbruch Abstract: “the second computer system using a second trained neural network to produce an output image from the quantized latent, wherein the output image is an approximation of the input image”; Besenbruch ¶¶0487: “FIG. 113 shows an example high-level overview of a neural compression pipeline with encoder-decoder modules. Given the input data, the encoder spends encoding time producing a bitstream”);
the first decoding layer is further configured to generate a first task inference result corresponding to the test image by applying a first decoding operation to the test feature (Besenbruch Abstract, ¶¶0487, Besenbruch Figs. 10, 12, 31, 32); and
the second decoding layer is further configured to generate a second task inference result corresponding to the test image by applying a second decoding operation to the test feature (Besenbruch Abstract, ¶¶0487, Besenbruch Figs. 10, 12, 31, 32).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOO J SHIN whose telephone number is (571)272-9753. The examiner can normally be reached M-F; 10-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Soo Shin/Primary Examiner, Art Unit 2667 571-272-9753
soo.shin@uspto.gov