DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. The Amendment filed 23 October 2025 has been entered and considered. Claims 1, 7, and 10-11 have been amended. Claims 12-15 have been added. Thus, claims 1-15 are all the claims pending in the application. Claims 1-4, 6-10, and 12-13 are rejected. Claims 5, 11, and 14-15 are objected to.
Response to Amendment
Claim Objections
In view of the amendments to claims 10 and 11, the claim objections are withdrawn.
Prior Art Rejections
In view of the amendments to independent claims 1 and 7, the previously-applied prior art rejections are withdrawn. Applicant’s arguments are rendered moot in view of the new grounds of rejection set forth below.
Claim Rejections - 35 USC § 103 (Part 1)
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 and 6-7 are rejected under 35 U.S.C. 103 as being obvious over “ERIC: Extracting Relations Inferred from Convolutions” by Townsend et al. (cited in the IDS filed 4/5/22; hereinafter “ERIC”) in view of “Visualizing and Understanding Convolutional Networks by Zeiler et al. (hereinafter “Zeiler”) and further in view of “Explaining Image Classifiers by Removing Input Features Using Generative Models” by Agarwal et al. (cited in IDS filed 4/5/22; hereinafter “Agarwal”).
According to MPEP § 2153.01(a), “If…the application names fewer joint inventors than a publication (e.g., the application names as joint inventors A and B, and the publication names as authors A, B and C), it would not be readily apparent from the publication that it is an inventor-originated disclosure and the publication would be treated as prior art under AIA 35 U.S.C. 102(a)(1)”.
The Examiner notes that ERIC appears to have a common author/inventor (Joe/Joseph Townsend) with the instant application. However, two additional people (Theodoros Kasioumis and Hiroya Inakoshi) are listed as co-authors of the ERIC publication. Therefore, it is not readily apparent from the ERIC publication that it is an inventor-originated disclosure. Based upon the earlier publication date of the ERIC reference (10/19/20) relative to the effective filing date of the subject application (8/2/21), according to MPEP 2153.01(a), it constitutes prior art under 35 U.S.C. 102(a)(1).
Further, the publication date of the reference falls within the 1 year grace period of the effective filing date of the subject application (8/2/21). Accordingly, this rejection under 35 U.S.C. 103 might be overcome by: (1) a showing under 37 CFR 1.130(a) that the subject matter disclosed in the reference was obtained directly or indirectly from the inventor of this application and is thus not prior art in accordance with 35 U.S.C.102(b)(1)(A); or (2) a showing under 37 CFR 1.130(b) of a prior public disclosure under 35 U.S.C. 102(b)(1)(B). See generally MPEP § 717.01.
As to independent claim 1, ERIC discloses a computer-implemented image classification method (Abstract and Section 4.1 and Fig. 2 disclose that ERIC is directed to approximating the behavior of kernels across multiple layers of a convolutional neural network using a logic program, wherein the CNN is trained to perform an image classification task) comprising: obtaining a convolutional neural network, CNN, trained to classify features in images using a training image dataset (Section 3 discloses obtaining a CNN M which is trained for an image classification task, which presupposes the use of a training image dataset): extracting a logic program from the CNN, the logic program being a symbolic approximation of outputs of kernels at an extraction layer of the CNN, and deriving from the logic program rules which use the kernels to explain the classification of images by the CNN (Section 3 discloses an “extracted program” or “logic program M*” which approximates kernel activations at the convolutional layers of the CNN M in order to “extract rules which describe the relationships between kernels”; See Table 2 which shows the extracted rules in which the kernels are represented symbolically by character combinations such as “LW” and “SG”; Section 4.5 further discloses that the rules “approximate the [classification] CNN’s behavior”; Section 1 further discloses the overall goal of “explainable AI”); obtaining a feature-labeled image dataset, and a record of each feature associated with each feature-labeled image in the feature-labeled image dataset, wherein the images in the feature-labeled image dataset include feature-labeled images, a first feature-labeled image being of a scene containing the feature associated with the first feature-labeled image and a second feature-labeled image of a scene without the feature (Sections 4.5-4.6 disclose a visualization and labeling process in which the kernels are labeled with feature names in order to translate the extracted rules from symbolic form to one understandable to a human; the images are manually annotated based on the features therein, e.g., “wall”, “car”, “cliff”, “crowd”, etc.; some of the images can be seen in Fig. 6; note that many of the “car” images are absent a “cliff” and vice-versa); forward-propagating the feature-labeled images through the logic program to obtain kernel activations at the extraction layer for features in the images; and calculating a correlation between each kernel in the logic program and each feature in the feature-labeled images using the obtained kernel activations and the features associated with the feature-labeled images; assigning to each kernel in the logic program the label of the feature with which the kernel has the highest correlation (Sections 4.5-4.6 disclose that ERIC “choose[s] a kernel’s label based on the 10 images that activate it most strongly with respect to l1 norm values obtained from a forward pass” through the logic program); and applying the assigned kernel labels to the kernels in the derived rules to obtain kernel-labeled rules (Section 4.6 discloses “assigning the labels in fig. 6 to the rules in table 2”, thus translating the symbolic rules into labeled rules that can be understood by a person; See Fig. 1 in which “Labels” are applied to the “Logic program M*”).
ERIC discloses that the rules are “conditioned on positive and negative instances” (Section 3), and Fig. 6 shows that the images used to perform the labeling are selected such that images containing a feature corresponding to one label do not have features corresponding to the others. However, ERIC does not expressly disclose that these images are pairs of a same scene, one with and one without the feature. That is, ERIC does not expressly disclose that the feature-labeled image dataset comprises pairs of feature-labeled images, the first feature-labeled image of the pair being of a scene containing the feature and the second feature-labeled image of the pair being of the same scene without the feature and including, instead of the feature, a portion of the scene which is occluded by the feature in the first feature-labelled image, wherein the pairs of feature-labeled images are forward-propagated through the logic program.
Zeiler, like ERIC, is directed to understanding convolutional neural networks by analyzing activations of the kernels therein and by manual visualization (Abstract and Section 2). Specifically, Zeiler contemplates occluding different portions of an input image with a grey square and monitoring the output of a classification CNN (Section 4.2 and Fig. 7). For example, Zeiler discloses that the activation of the feature map having the strongest response in an unoccluded image of a dog with the classification “Pomeranian” shows the strongest drop in activation when the grey square covers the face of the dog (Section 4.2 and Fig. 7). In that scenario, the classifier also shows the largest drop in probability of the correct class, as well (Section 4.2 and Fig. 7).
That is, Zeiler discloses pairs of feature-labeled images, one feature-labeled image of the pair being of a scene containing a feature (unoccluded image of the dog including the face “feature” of the dog) and the other feature-labeled image of the pair being of the same scene without the feature (occluded image of the dog in which it’s face “feature” is covered by the grey square), the pairs of feature-labeled images being forward-propagated through the network (both images are fed forward through the CNN, and the activation of kernel of interest decreases strongly, as does the probability of the correct class, when the face of the dog is occluded), as claimed.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify ERIC to use pairs of images – one including a feature and the other occluding the feature – in the process of visualizing and labeling the kernel activations, as taught by Zeiler, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. More specifically, ERIC’s use of images having non-overlapping features for classification CNN visualization and labeling as modified by Zeiler’s use of a pair of images of a same scene – one including a feature and the other occluding the feature – in the process of visualizing effects on a classification CNN can yield a predictable result of visualizing the feature which has a strongest effect on a particular kernel of the CNN since both references teach precisely that. Thus, a person of ordinary skill would have appreciated including in ERIC’s dataset the Zeiler’s pairs of images since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Agarwal, like ERIC and Zeiler, is directed to explaining the outputs of a convolutional neural network image classifier (Abstract). Agarwal references known techniques for doing so, including a technique similar to that taught by Zeiler which “slide[s] a gray, occlusion patch across the image and record[s] the probability changes as attribution values in corresponding locations in the heatmap” (Section 3.3 and Fig. 2(b-c)). Agarwal notes that such techniques and other similar known techniques “often produce unrealistic, out-of-distribution images” that yield heatmaps that are “unreliable…and not faithful” (Section 1). Accordingly, Agarwal et al. “propose to harness a state-of-the-art generative inpainting model…to remove pixels from an input image and fill in with content that is plausible” (Section 1). For example, the inpainting model is used to “fill in the background region” of an image from which the classification target object is removed, thus arriving at an image pair including a “real” image with the target object and a “delete” image in which the target object has been removed and replaced with background from the scene (See Fig. 2a and 2d which form a real and delete image pair, and Fig. 2e and 2h which form another real and delete image pair). Agarwal discloses testing such image pairs with the CNN image classifier to evaluate how the “in-filled image x^ (where the bird has been removed) is perceptually dissimilar to the original image”, and finding that “the inpainted images are consistently more dissimilar from the real images compared to the blurred and grayed-out images” (Section 4.2). That is, Agarwal discloses an image dataset comprises pairs of images, the first image of the pair being of a scene containing the feature and the second image of the pair being of the same scene without the feature and including, instead of the feature, a portion of the scene which is occluded by the feature in the first feature-labelled image (in-filled image x^ in which the target object has been replaced with inpainted background and original image x in which the target object is present are evaluated by the CNN classifier in order to explain the classifiers outputs).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of ERIC and Zeiler to replace Zeiler’s grey-square-based occlusion method with Agarwal’s background inpainting method to arrive at the claimed invention discussed above. Such a modification is the result of simple substitution of one known element for another producing a predictable result. More specifically, Zeiler’s grey-square-based occlusion method and Agarwal’s background inpainting method perform the same general and predictable function, the predictable function being removing salient aspects of a target object in an image in order to explain how a CNN-based classifier generates its outputs. Since each individual element and its function are shown in the prior art, albeit shown in separate references, the difference between the claimed subject matter and the prior art rests not on any individual element or function but in the very combination itself - that is in the substitution of Zeiler’s grey-square-based occlusion method by replacing it with Agarwal’s background inpainting method. Thus, the simple substitution of one known element for another producing a predictable result renders the claim obvious. It is predictable that such a modification would have resulted in “improving the accuracy and reliability of explanation methods” such as the one taught by Zeiler (See Section 1 of Agarwal).
As to claim 6, ERIC as modified above further teaches a non-transitory computer-readable medium including instructions which, when executed by a computer, cause the computer to carry out the method of claim 1 (ERIC’s disclosed algorithm is necessarily performed by a computer running software which is necessarily stored on a computer-readable medium).
Independent claim 7 recites an image classification apparatus comprising: at least one memory; and at least one processor, connected to the memory (ERIC’s disclosed algorithm is necessarily performed by a computer which necessarily include a processor and memory), to perform the steps recited in the method of independent claim 1. Accordingly, claim 7 is rejected for reasons analogous to those discussed above in conjunction with claim 1.
Claims 2 and 8 are rejected under 35 U.S.C. 103 as being obvious over ERIC in view of Zeiler and Agarwal and further in view of U.S. Patent Application Publication No. 2021/0232915 to Dalli et al. (hereinafter “Dalli”).
As to claim 2, ERIC as modified by Zeiler and Agarwal does not expressly disclose that images in the feature-labeled image dataset include still frames from at least one video recording.
Dalli, like ERIC, is directed to explaining convolutional neural networks by identifying rules using kernels that have been activated (Abstract and [0049, 0052-0053, 0063, 0066-0067, 0076, 0079]) and by then labeling the kernels using a “Kernel Labeling method” ([0097-0111]). Dalli discloses that the Kernel Labeler can use video ([0107]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of ERIC, Zeiler and Agarwal such that the kernel labeling process is performed using video frames, as taught by Dalli, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would have provided explainable “video understanding” ([0113] of Dalli).
Claim 8 recites features nearly identical to those recited in claim 2. Accordingly, claim 8 is rejected for reasons analogous to those discussed above in conjunction with claim 2.
Claims 3 and 9 are rejected under 35 U.S.C. 103 as being obvious over ERIC in view of Zeiler, Agarwal and Dalli and further in view of “Obstruction Level Detection of Sewer Videos Using Convolutional Neural Networks” by Gutierrez-Mondragon et al. (hereinafter “Gutierrez-Mondragon”).
As to claim 3, the proposed combination of ERIC, Zeiler, Agarwal and Dalli does not expressly disclose that the at least one video recording was captured by a closed-circuit television, CCTV, camera. However, Gutierrez-Mondragon discloses a CNN classifier (similar to that of ERIC) which classifies the state of sewage pipes using images from a Closed-Circuit Television system (Abstract). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of ERIC, Zeiler, Agarwal and Dalli to capture the images using a CCTV camera, as taught by Gutierrez-Mondragon, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would have extended the explainability of CNNs to sewer videos.
Claim 9 recites features nearly identical to those recited in claim 3. Accordingly, claim 9 is rejected for reasons analogous to those discussed above in conjunction with claim 3.
Claims 12-13 are rejected under 35 U.S.C. 103 as being obvious over ERIC in view of Zeiler, Agarwal, Dalli, and Guttierez-Mondragon and further in view of U.S. Patent Application Publication No. 2019/0370587 to Burachas et al. (hereinafter “Burachas”).
As to claim 12, the proposed combination of ERIC, Zeiler, Agarwal, Dalli, and Guttierez-Mondragon does not expressly disclose that the feature-labeled image dataset includes images annotated for semantic segmentation, and the record of each feature associated with each feature-labeled image in the feature-labeled image dataset includes a value corresponding to a total area occupied by the feature in the image.
Burachas, like ERIC, is directed to explaining the results of AI models and image classification CNNs, in particular (Abstract, [0004, 0023, 0046, 0054, 0069]). Burachas discloses that semantic segmentation masks for the input images are generated along with annotated attention weights for meaningful sub-regions, and the semantic segmentation masks are used to visualize each object, part, sub-part, etc. ([0066, 0080-0088]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of ERIC, Zeiler, Agarwal, Dalli, and Guttierez-Mondragon to additionally include semantic segmentation masks for the images in the dataset along with annotated attention weights for each feature therein, as taught by Burachas, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would “provide a more comprehensive and/or informative user experience that facilitates user trust and understanding of the VQA model” ([0006] of Burachas).
Claim 13 recites features nearly identical to those recited in claim 12. Accordingly, claim 13 is rejected for reasons analogous to those discussed above in conjunction with claim 12.
Claims 4 and 10 are rejected under 35 U.S.C. 103 as being obvious over ERIC in view of Zeiler and Agarwal and further in view of Burachas.
As to claim 4, the proposed combination of ERIC, Zeiler and Agarwal does not expressly disclose that the feature-labeled image dataset includes images annotated for semantic segmentation, and the record of each feature associated with each feature-labeled image in the feature-labeled image dataset includes a value corresponding to a total area occupied by the feature in the image.
Burachas, like ERIC, is directed to explaining the results of AI models and image classification CNNs, in particular (Abstract, [0004, 0023, 0046, 0054, 0069]). Burachas discloses that semantic segmentation masks for the input images are generated along with annotated attention weights for meaningful sub-regions, and the semantic segmentation masks are used to visualize each object, part, sub-part, etc. ([0066, 0080-0088]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of ERIC, Zeiler and Agarwal to additionally include semantic segmentation masks for the images in the dataset along with annotated attention weights for each feature therein, as taught by Burachas, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would “provide a more comprehensive and/or informative user experience that facilitates user trust and understanding of the VQA model” ([0006] of Burachas).
Claim 10 recites features nearly identical to those recited in claim 4. Accordingly, claim 10 is rejected for reasons analogous to those discussed above in conjunction with claim 4.
Claim Rejections - 35 USC § 103 (Part 2)
Claims 1, 6, and 7 are rejected under 35 U.S.C. 103 as being obvious over European Patent Application Publication No. 3291146 to Townsend (cited in the IDS filed 4/5/22; hereinafter “Townsend”) in view of “Interpreting CNN Knowledge via an Explanatory Graph” by Zhang et al. (cited in the IDS filed 4/5/22; hereinafter “Zhang”) and further in view of “Visualizing and Understanding Convolutional Networks by Zeiler et al. (hereinafter “Zeiler”) and further in view of “Explaining Image Classifiers by Removing Input Features Using Generative Models” by Agarwal et al. (cited in IDS filed 4/5/22; hereinafter “Agarwal”).
The Examiner notes that Townsend, like ERIC, has a common inventor (Joseph Townsend) with the instant application. However, the publication date of Townsend (3/7/18) is outside of the 1 year grace period before the effective filing date of the subject application (8/2/21). Thus, this rejection cannot be overcome by the means discussed above with respect to ERIC. This rejection and the additional set of rejections that follow (Part 2) are made in view of the high likelihood that ERIC will be disqualified as a prior art reference by way of affidavit or declaration.
As to independent claim 1, Townsend discloses a computer-implemented image classification method (Abstract discloses that Townsend is directed to assigning labels to features represented by convolutional filters of a CNN, wherein the CNN performs an image classification task; Fig. 13 and [0043] discloses that the method is implemented by a computer) comprising: obtaining a convolutional neural network, CNN, trained to classify features in images using a training image dataset ([0023-0024] discloses obtaining a CNN 200 which outputs a classification for input images; [0004, 0007] disclose that the CNN is trained which presupposes the use of a training image dataset); extracting a logic program from the CNN, the logic program being a symbolic approximation of outputs of kernels at an extraction layer of the CNN ([0024, 0030] discloses translating the CNN into a neural-symbolic network 300 representative of the convolutional filters 20 (which read on the claimed kernels) of the CNN 200 which contributed to classification of the input images; See Fig. 8), and deriving from the logic program rules which use the kernels to explain the classification of images by the CNN ([0023-0024, 0030] discloses that knowledge extraction methods are applied to the neural-symbolic network to “extract rules which provide an explanation for the classification of the input data”; See Fig. 9); forward-propagating feature-labeled images through the logic program to obtain kernel activations at the extraction layer for features in the images; assigning to each kernel in the logic program the label ([0005-0006, 0027-0032] discloses that input images are input to the CNN 200 for classification, and the active filters 20 which contribute to the classification of the image are labeled, thus resulting in the labeled filters of Fig. 10 which include, for example, “Circle”, “Triangle”, “Cat’s head”, “tail”, and “unknown”); and applying the assigned kernel labels to the kernels in the derived rules to obtain kernel-labeled rules ([0033] discloses that the labeled filters are mapped to the extracted rules to “describe why the classification of ‘cat’ was assigned to the image input into the CNN”; see Figs. 11-12).
Townsend does not expressly disclose obtaining a feature-labeled image dataset, and a record of each feature associated with each feature-labeled image in the feature-labeled image dataset; calculating a correlation between each kernel in the logic program and each feature in the feature-labeled images using the obtained kernel activations and the features associated with the feature-labeled images or that each kernel is assigned the label of the feature with which the kernel has the highest correlation.
Zhang, like Townsend, is directed to explaining a CNN based on kernel/filter activation relationships (Abstract). Zhang represents patterns in a CNN as an explanatory graph (similar to Townsend’s neural-symbolic network), wherein each node is tested to see whether it consistently represents the same object part across different images (Section entitled “Experiment 2: Semantic interpretability of patterns”). In particular, based on an image set I input to the CNN, N activation peaks are detected, some of which represent common parts of the object – or peak patterns – while some correspond to noise (Section entitled “Algorithm”). Image regions are drawn around each peak pattern, and human users evaluate and annotate the semantic purity of each pattern – i.e., whether each peak pattern described the same object part (Fig. 8). The top-D nodes are assigned the corresponding semantic labels for the filters, and as such, the explanatory graph is semantically labeled (Section entitled “Experiment 2: Semantic interpretability of patterns” and Fig. 10).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Townsend to obtain an image set representing common object parts and to feed the image dataset through the CNN to calculate a semantic purity (claimed “correlation”) between each kernel D and the corresponding object parts (claimed “features”) such that the labels in the explanatory graph are assigned according to the top scores, as taught by Zhang, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. More specifically, Townsend’s kernel labeling as modified by Zhang’s kernel/pattern labeling based on a calculated correlation between the kernel and a semantic part can yield a predictable result of semantically labeling kernels in an explanation graph of a CNN since both references teach the same. Thus, a person of ordinary skill would have appreciated including in Townsend’s kernel labeling the ability to select a part label having a highest correlation with the kernel, as taught by Zhang, since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Townsend and Zhang do not specify that the images and their part labels used for the kernel labeling process are pairs of image – one including the part, and the other not including the part. That is, the proposed combination of Townsend and Zhang does not expressly disclose wherein the images in the feature-labeled image dataset include pairs of feature-labeled images, a first feature-labeled image of the pair being of a scene containing the feature associated with the first feature-labeled image and a second feature-labeled image of the pair being of the same scene without the feature and including, instead of the feature, a portion of the scene which is occluded by the feature in the first feature-labelled image, the pairs of feature-labeled images being forward-propagated through the logic program.
Zeiler, like Townsend and Zhang, is directed to understanding convolutional neural networks by analyzing activations of the kernels therein (Abstract and Section 2). Specifically, Zeiler contemplates occluding different portions of an input image with a grey square and monitoring the output of a classification CNN (Section 4.2 and Fig. 7). For example, Zeiler discloses that the activation of the feature map having the strongest response in an unoccluded image of a dog with the classification “Pomeranian” shows the strongest drop in activation when the grey square covers the face of the dog (Section 4.2 and Fig. 7). In that scenario, the classifier also shows the largest drop in probability of the correct class, as well (Section 4.2 and Fig. 7).
That is, Zeiler discloses obtaining a feature-labeled image dataset, and a record of each feature associated with each feature-labeled image in the feature-labeled image dataset (Fig. 7 shows a dataset of test samples along with the ground-truth classification label and known features therein), wherein the images in the feature-labeled image dataset comprise pairs of feature-labeled images, a first feature-labeled image of the pair being of a scene containing the feature associated with the first feature-labeled image (unoccluded image of the dog including the face “feature” of the dog) and a second feature-labeled image of the pair being of the same scene without the feature (occluded image of the same scene of the dog in which it’s face “feature” is covered by the grey square), the pairs of feature-labeled images being forward-propagated through the logic program (both images are fed forward through the CNN, and the activation of kernel of interest decreases strongly, as does the probability of the correct class, when the face of the dog is occluded), as claimed.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Townsend and Zhang to use pairs of images – one including a feature and the other occluding the feature – in the process of labeling the filter activations, as taught by Zeiler, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. More specifically, Townsend and Zhang’s use of input images for analyzing filter activations of a classification CNN as modified by Zeiler’s use of a pair of images of a same scene – one including a feature and the other occluding the feature – in the process of visualizing effects on a feature map output by a convolutional filter of a classification CNN can yield a predictable result of identifying the feature which has a strongest effect on a particular kernel of the CNN since both references teach precisely that. Thus, a person of ordinary skill would have appreciated including in Townsend and Zhang’s dataset of images for filter labeling Zeiler’s dataset of pairs of images since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
The proposed combination of Townsend, Zhang and Zeiler does not expressly disclose that the second image of the pair is of the same scene without the feature and including, instead of the feature, a portion of the scene which is occluded by the feature in the first feature-labelled image.
Agarwal, like Townsend and Zeiler, is directed to explaining the outputs of a convolutional neural network image classifier (Abstract). Agarwal references known techniques for doing so, including a technique similar to that taught by Zeiler which “slide[s] a gray, occlusion patch across the image and record[s] the probability changes as attribution values in corresponding locations in the heatmap” (Section 3.3 and Fig. 2(b-c)). Agarwal notes that such techniques and other similar known techniques “often produce unrealistic, out-of-distribution images” that yield heatmaps that are “unreliable…and not faithful” (Section 1). Accordingly, Agarwal et al. “propose to harness a state-of-the-art generative inpainting model…to remove pixels from an input image and fill in with content that is plausible” (Section 1). For example, the inpainting model is used to “fill in the background region” of an image from which the classification target object is removed, thus arriving at an image pair including a “real” image with the target object and a “delete” image in which the target object has been removed and replaced with background from the scene (See Fig. 2a and 2d which form a real and delete image pair, and Fig. 2e and 2h which form another real and delete image pair). Agarwal discloses testing such image pairs with the CNN image classifier to evaluate how the “in-filled image x^ (where the bird has been removed) is perceptually dissimilar to the original image x”, and finding that “the inpainted images are consistently more dissimilar from the real images compared to the blurred and grayed-out images” (Section 4.2). That is, Agarwal discloses an image dataset comprises pairs of images, the first image of the pair being of a scene containing the feature and the second image of the pair being of the same scene without the feature and including, instead of the feature, a portion of the scene which is occluded by the feature in the first feature-labelled image (in-filled image x^ in which the target object has been replaced with inpainted background and original image x in which the target object is present are evaluated by the CNN classifier in order to explain the classifiers outputs).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Townsend, Zhang and Zeiler to replace Zeiler’s grey-square-based occlusion method with Agarwal’s background inpainting method to arrive at the claimed invention discussed above. Such a modification is the result of simple substitution of one known element for another producing a predictable result. More specifically, Zeiler’s grey-square-based occlusion method and Agarwal’s background inpainting method perform the same general and predictable function, the predictable function being removing salient aspects of a target object in an image in order to explain how a CNN-based classifier generates its outputs. Since each individual element and its function are shown in the prior art, albeit shown in separate references, the difference between the claimed subject matter and the prior art rests not on any individual element or function but in the very combination itself - that is in the substitution of Zeiler’s grey-square-based occlusion method by replacing it with Agarwal’s background inpainting method. Thus, the simple substitution of one known element for another producing a predictable result renders the claim obvious. It is predictable that such a modification would have resulted in “improving the accuracy and reliability of explanation methods” such as the one taught by Zeiler (See Section 1 of Agarwal).
As to claim 6, Townsend as modified above further teaches a non-transitory computer-readable medium including instructions which, when executed by a computer, cause the computer to carry out the method of claim 1 ([0042, 0046] of Townsend discloses “The invention may also be embodied as one or more device or apparatus programs (e.g. computer programs and computer program products) for carrying out part or all of the methods described herein. Such programs embodying the present invention may be stored on computer-readable media” and “such computer-readable media may include non-transitory computer-readable storage media”).
Independent claim 7 recites an image classification apparatus comprising: at least one memory ([0044-0047] and Fig. 13 of Townsend discloses memory 994); and at least one processor, connected to the memory ([0044-0047] and Fig. 13 of Townsend discloses processor 994), to perform the steps recited in the method of independent claim 1. Accordingly, claim 7 is rejected for reasons analogous to those discussed above in conjunction with claim 1.
Claims 2 and 8 are rejected under 35 U.S.C. 103 as being obvious over Townsend in view of Zhang, Zeiler and Agarwal and further in view of U.S. Patent Application Publication No. 2021/0232915 to Dalli et al. (hereinafter “Dalli”).
As to claim 2, Townsend as modified by Zhang, Zeiler and Agarwal does not expressly disclose that images in the feature-labeled image dataset include still frames from at least one video recording.
Dalli, like Townsend, is directed to explaining convolutional neural networks by identifying rules using kernels that have been activated (Abstract and [0049, 0052-0053, 0063, 0066-0067, 0076, 0079]) and by then labeling the kernels using a “Kernel Labeling method” ([0097-0111]). Dalli discloses that the Kernel Labeler can use video ([0107]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Townsend, Zhang, Zeiler and Agarwal such that the kernel labeling process is performed using video frames, as taught by Dalli, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would have provided explainable “video understanding” ([0113] of Dalli).
Claim 8 recites features nearly identical to those recited in claim 2. Accordingly, claim 8 is rejected for reasons analogous to those discussed above in conjunction with claim 2.
Claims 3 and 9 are rejected under 35 U.S.C. 103 as being obvious over Townsend in view of Zhang, Zeiler, Agarwal and Dalli and further in view of “Obstruction Level Detection of Sewer Videos Using Convolutional Neural Networks” by Gutierrez-Mondragon et al. (hereinafter “Gutierrez-Mondragon”).
As to claim 3, the proposed combination of Townsend, Zhang, Zeiler, Agarwal and Dalli does not expressly disclose that the at least one video recording was captured by a closed-circuit television, CCTV, camera. However, Gutierrez-Mondragon discloses a CNN classifier (similar to that of Townsend) which classifies the state of sewage pipes using images from a Closed-Circuit Television system (Abstract). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Townsend, Zhang, Zeiler, and Dalli to capture the images using a CCTV camera, as taught by Gutierrez-Mondragon, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would have extended the explainability of CNNs to sewer videos.
Claim 9 recites features nearly identical to those recited in claim 3. Accordingly, claim 9 is rejected for reasons analogous to those discussed above in conjunction with claim 3.
Claims 12-13 are rejected under 35 U.S.C. 103 as being obvious over Townsend in view of Zhang, Zeiler, Agarwal, Dalli and Gutierrez-Mondragon and further in view of U.S. Patent Application Publication No. 2019/0370587 to Burachas et al. (hereinafter “Burachas”).
As to claim 12, the proposed combination of Townsend, Zhang, Zeiler, Agarwal, Dalli, and Gutierrez-Mondragon does not expressly disclose that the feature-labeled image dataset includes images annotated for semantic segmentation, and the record of each feature associated with each feature-labeled image in the feature-labeled image dataset includes a value corresponding to a total area occupied by the feature in the image.
Burachas, like Townsend and Zhang, is directed to explaining the results of AI models and image classification CNNs, in particular (Abstract, [0004, 0023, 0046, 0054, 0069]). Burachas discloses that semantic segmentation masks for the input images are generated along with annotated attention weights for meaningful sub-regions, and the semantic segmentation masks are used to visualize each object, part, sub-part, etc. ([0066, 0080-0088]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Townsend, Zhang, Zeiler, Agarwal, Dalli, and Gutierrez-Mondragon to additionally include semantic segmentation masks for the images in the dataset along with annotated attention weights for each feature therein, as taught by Burachas, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would “provide a more comprehensive and/or informative user experience that facilitates user trust and understanding of the VQA model” ([0006] of Burachas).
Claim 13 recites features nearly identical to those recited in claim 12. Accordingly, claim 13 is rejected for reasons analogous to those discussed above in conjunction with claim 12.
Claims 4 and 10 are rejected under 35 U.S.C. 103 as being obvious over Townsend in view of Zhang, Zeiler and Agarwal and further in view of U.S. Patent Application Publication No. 2019/0370587 to Burachas et al. (hereinafter “Burachas”).
As to claim 4, the proposed combination of Townsend, Zhang, Zeiler and Agarwal does not expressly disclose that the feature-labeled image dataset includes images annotated for semantic segmentation, and the record of each feature associated with each feature-labeled image in the feature-labeled image dataset includes a value corresponding to a total area occupied by the feature in the image.
Burachas, like Townsend and Zhang, is directed to explaining the results of AI models and image classification CNNs, in particular (Abstract, [0004, 0023, 0046, 0054, 0069]). Burachas discloses that semantic segmentation masks for the input images are generated along with annotated attention weights for meaningful sub-regions, and the semantic segmentation masks are used to visualize each object, part, sub-part, etc. ([0066, 0080-0088]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Townsend, Zhang, Zeiler and Agarwal to additionally include semantic segmentation masks for the images in the dataset along with annotated attention weights for each feature therein, as taught by Burachas, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would “provide a more comprehensive and/or informative user experience that facilitates user trust and understanding of the VQA model” ([0006] of Burachas).
Claim 10 recites features nearly identical to those recited in claim 4. Accordingly, claim 10 is rejected for reasons analogous to those discussed above in conjunction with claim 4.
Allowable Subject Matter
Claims 5, 11, and 14-15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEAN M CONNER whose telephone number is (571)272-1486. The examiner can normally be reached 10 AM - 6 PM Monday through Friday, and some Saturday afternoons.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Greg Morse can be reached at (571) 272-3838. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SEAN M CONNER/Primary Examiner, Art Unit 2663