Last updated: May 29, 2026
Application No. 18/590,210
LONG-TAILED ANOMALY DETECTION IN IMAGES

Non-Final OA §103§112
Filed
Feb 28, 2024
Examiner
JONES, ANDREW B
Art Unit
2667
Tech Center
2600 — Communications
Assignee
Mitsubishi Electric Research Laboratories Inc.
OA Round
1 (Non-Final)
Interview Optional

— +21.4% interview lift. Examiner has a relatively high allowance rate (70%); +21.4% interview lift. A written response may suffice.
Based on 78 resolved cases, 2023–2026
Examiner Intelligence

JONES, ANDREW B View full profile →
Grants 70% — above average
Career Allowance Rate
55 granted / 78 resolved
+8.5% vs TC avg
Strong +21% interview lift
Without
With
+21.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
18 currently pending
Career history
100
Total Applications
across all art units
Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
88.2%
+48.2% vs TC avg
§102
1.7%
-38.3% vs TC avg
§112
9.0%
-31.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 78 resolved cases
Office Action

§103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 1 March, 2024 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements is being considered by the examiner.

Claim Objections
Claim 4 is objected to because of the following informalities:
Claim 4 states on line 8 “partitioned features of the of the plurality…”.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 8 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 8 recites the limitation “the encoder” and "the image decoder" in line 2, and “a plurality of classes” in line 2 - 3. There is insufficient antecedent basis for this limitation in the claim. For “the encoder” it is unclear if this is the “image encoder” of claim 2 or a different encoder altogether as claim 1 only references “encoding the image” but does not specify multiple encoders.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2, 10 – 13, 16, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Deng et al (H. Deng, Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization, 2023, {2308.15939}, https://arxiv.org/abs/2308.15939, hereinafter “Deng”) in view of Harary et al (U.S. Patent Publication No. 2025/005727 A1, hereinafter “Harary”). 
	
Regarding claim 1, Deng teaches a computer-implemented method for detecting an anomaly in a patch of an image (Page 1, Abstract: With both training-free adaptation (TFA) and test0time adaptation (TTA), we significantly exploit the potential of contrastive Language-Image Pre-Training (CLIP) for zero-shot anomaly localization and demonstrate the effectiveness of our proposed methods on various datasets.)
collecting a first text encoding of a first text prompt in a latent space (Page 3, Col. 1, ¶ 1: Anomaly detection involves the semantic concepts of ”normal” and ” anomaly”, so for a test class we can simply define two textual prompts i.e. ”a photo of a normal [CLASS]” and ”a photo of an abnormal [CLASS]”, and extract the corresponding text tokens [t+, t-].);
collecting a second text encoding of a second text prompt in the latent space (Page 3, Col. 1, ¶ 1: Anomaly detection involves the semantic concepts of ”normal” and ” anomaly”, so for a test class we can simply define two textual prompts i.e. ”a photo of a normal [CLASS]” and ”a photo of an abnormal [CLASS]”, and extract the corresponding text tokens [t+, t-].);
encoding the image to produce features of the image (Figure 2; Page 2, Col. 2, ¶ 2: Specifically, given an unknown image and a pre-defined text, the visual encoder and the text encoder output the visual class token v ϵ RC and the text token t ϵ RC, respectively. Here, C represents the feature dimension.);
partitioning the features of the image into feature patches (Page 3, Col. 1, ¶ 2: Although CLIP is only trained to match the global content of an image with text, the final layer of the visual encoder has a set of patch tokens P = {p-1, …, pM|pi ϵ RC} that potentially contain image local information in the patch level.);
projecting each of the feature patches into the latent space using a projector operator (Figure 2; Page 3, Col. 1, ¶ 2: Although CLIP is only trained to match the global content of an image with text, the final layer of the visual encoder has a set of patch tokens P = {p-1, …, pM|pi ϵ RC} that potentially contain image local information in the patch level. For a patch token, pi the local anomaly score is computed as: 
    PNG
    media_image1.png
    42
    333
    media_image1.png
    Greyscale
), wherein the projector operator is trained to project normal feature patches of normal images closer to the first text encoding than to the second text encoding while projecting noisy feature patches of the normal images closer to the second text encoding than to the first text encoding (Page 3, Col. 1, ¶ 2: Although CLIP is only trained to match the global content of an image with text, the final layer of the visual encoder has a set of patch tokens P = {p-1, …, pM|pi ϵ RC} that potentially contain image local information in the patch level. For a patch token, pi the local anomaly score is computed as: 
    PNG
    media_image1.png
    42
    333
    media_image1.png
    Greyscale
; Page 4, Col. 1, Section “Contrastive-State”: Thus, a series of opposing state words, such as “perfect” vs. “imperfect” and “with flaw” vs. “without flaw”, allow visual tokens to be matched with their preferred states.); and
comparing the projection of each of the feature patches with the first text encoding and the second text encoding to detect the anomaly when the projection of a feature patch from the feature patches is closer to the second text encoding than to the first text encoding (Page 2, Col. 2, ¶ 3: The cosine distance between v and t, denoted by < v, t >, quantifies the similarity between the image and the class concept.; Page 3, Col. 1, ¶ 1: Anomaly detection involves the semantic concepts of ”normal” and ” anomaly”, so for a test class we can simply define two textual prompts i.e. ”a photo of a normal [CLASS]” and ”a photo of an abnormal [CLASS]”, and extract the corresponding text tokens [t+, t-].; Page 3, Col. 1, ¶ 2: Although CLIP is only trained to match the global content of an image with text, the final layer of the visual encoder has a set of patch tokens P = {p-1, …, pM|pi ϵ RC} that potentially contain image local information in the patch level. For a patch token, pi the local anomaly score is computed as: 
    PNG
    media_image1.png
    42
    333
    media_image1.png
    Greyscale
).
Deng does not explicitly teach wherein the method uses a processor coupled with stored instructions implementing steps of the method.
However, Harary does teach wherein the method uses a processor coupled with stored instructions implementing steps of the method (¶ 0005: Another embodiment in this disclosure provide non-transitory computer-readable mediums containing computer program code that, when executed by operation of one or more computer processors, performs operations, including processing).
Deng and Harary are considered to be analogous art as both pertain to text based image anomaly detection. Therefore, it would have been obvious to one of ordinary skill in the art to combine the vision-language model adapted for unified zero-shot anomaly localization (as taught by Deng) and the text-based image anomaly detection system (as taught by Harary) before the effective filing date of the claimed invention.  The motivation for this combination of references would be Harary uses a weight to eliminate or reduce discrepancy between the query image embedding and the nearest neighbors to improve accuracy of the anomaly detection model. (See ¶ 0056).
This motivation for the combination of Deng and Harary is supported by KSR exemplary rationale (G) Some teaching, suggestion, or motivation in the prior art that would have led one of ordinary skill to modify the prior art reference or to combine prior art reference teachings to arrive at the claimed invention. MPEP 2141 (III).

Regarding claim 2, the Deng and Harary combination teaches the method of claim 1.
Additionally, Deng teaches wherein an image encoder is trained to encode global features of the image into the latent space shared by the image encoder and a text encoder of a visual-language foundation model (Figure 1 and 2; Page 1: The paradigm of zero-shot anomaly localization. Our adapted vision-language models can detect anomalies from any object. The anomaly score is derived from the distance between textural and visual tokens.; Page 2, Col. 2, ¶ 2: Specifically, given an unknown image and a pre-defined text, the visual encoder and the text encoder output the visual class token v ϵ RC and the text token t ϵ RC, respectively. Here, C represents the feature dimension.).

Regarding claim 10, the Deng and Harary combination teaches the method of claim 1.
Additionally, Deng teaches wherein the method further comprises:
determining a dot product between the projection of the feature patch with the first text encoding to produce a first score (Page 2, Col. 2, ¶ 3: The cosine distance between v and t, denoted by < v, t >, quantifies the similarity between the image and the class concept.; Page 3, Col. 1, ¶ 2: For a patch token, pi the local anomaly score is computed as: 
    PNG
    media_image1.png
    42
    333
    media_image1.png
    Greyscale
; Examiner’s note: < pi, t+ > is the cosine distance between the feature patch and the first encoding);
determining a dot product between the projection of the feature patch with the second text encoding to produce a second score (Page 2, Col. 2, ¶ 3: The cosine distance between v and t, denoted by < v, t >, quantifies the similarity between the image and the class concept.; Page 3, Col. 1, ¶ 2: For a patch token, pi the local anomaly score is computed as: 
    PNG
    media_image1.png
    42
    333
    media_image1.png
    Greyscale
; Examiner’s note: < pi, t- > is the cosine distance between the feature patch and the second encoding); and
detecting the anomaly in the feature patch based on the first score and the second score (Page 2, Col. 2, ¶ 3: The cosine distance between v and t, denoted by < v, t >, quantifies the similarity between the image and the class concept.; Page 3, Col. 1, ¶ 2: For a patch token, pi the local anomaly score is computed as: 
    PNG
    media_image1.png
    42
    333
    media_image1.png
    Greyscale
; Examiner’s note: SAL-(pi) is an anomaly score that is based on the first and second score.).

Regarding claim 11, the Deng and Harary combination teaches the method of claim 1.
Additionally, Deng teaches wherein the first text prompt is a semantic name of a class of the image, and wherein the second text prompt is a modification of the first text prompt (Page 3, Col. 1, ¶ 1: Anomaly detection involves the semantic concepts of ”normal” and ” anomaly”, so for a test class we can simply define two textual prompts i.e. ”a photo of a normal [CLASS]” and ”a photo of an abnormal [CLASS]”, and extract the corresponding text tokens [t+, t-].).

Regarding claim 12, the Deng and Harary combination teaches the method of claim 1.
Additionally, Deng teaches wherein the first text prompt is a semantic name of a class of the image, and wherein the second text prompt is a concatenation of a modifier word with the semantic name of the class of the image (Page 3, Col. 1, ¶ 1: Anomaly detection involves the semantic concepts of ”normal” and ” anomaly”, so for a test class we can simply define two textual prompts i.e. ”a photo of a normal [CLASS]” and ”a photo of an abnormal [CLASS]”, and extract the corresponding text tokens [t+, t-].; Page 4, Col. 1, Section “Contrastive State”: Thus, a series of opposing state words, such as ”perfect” vs. ”imperfect” and ”with flaw” vs. ”without flaw”, allow visual tokens to be matched with their preferred states… For example, the ”wood” dataset in MVTec (Bergmann et al. 2019) includes the defect type of ”hole”, so the corresponding contrastive-state prompt could be ”with a hole” vs. ”without a hole”.).

Regarding claim 13, the Deng and Harary combination teaches the method of claim 1.
Additionally, Deng teaches wherein the first text prompt is a semantic name of a class of the image learned for generating images of the class of the image with a visual-language foundation model (Examiner is interpreting this claim to mean “wherein the first text prompt is a semantic name of a class of image which the model has been trained on”.; Figure 2; Page 1, Col. 2, ¶ 3: By learning on millions of image-text pairs, CLIP has an excellent zero-shot transfer capability for downstream tasks.; Page 2, Col. 1, ¶ 1: On the other hand, as the zero-shot recognition capability of CLIP depends on the quality of textual prompts (Radford et al. 2021; Jeong et al. 2023), we hypothesize that fine-grained detection can benefit from a more precise prompt design and propose a unified domain-aware contrastive state prompt template, A [domain] photo of a [state] [class], for generating exhaustive prompts.).

Regarding claim 16, claim 16 has been analyzed with regard to claim 1 and is rejected for the same reasons of obviousness as used above as well as in accordance with Harary’s further teaching on:
Wherein the system comprises a processor and a memory having instructions stored thereon that cause the processor to (¶ 0005: Another embodiment in this disclosure provide non-transitory computer-readable mediums containing computer program code that, when executed by operation of one or more computer processors, performs operations, including processing):

Regarding claim 17, the Deng and Harary combination teaches the system of claim 16.
Additionally, Deng teaches wherein the system further comprises:
a text encoder trained to encode the first text prompt as the first text encoding and the second text prompt as the second text encoding in the latent space of a visual-language foundation model (Figure 1; Page 2, Col. 2, ¶ 3: CLIP includes a visual coder and a text encoder to extract visual features and text features respectively… Specifically, given an unknown image and a pre-defined text, the visual encoder and the text encoder output the visual class token v ϵ RC and the text token t ϵ RC, respectively. Here, C represents the feature dimension.); and
an image encoder trained to encode global features of the image into the latent space shared by the image encoder and the text encoder of the visual-language foundation model (Figure 1; Page 2, Col. 2, ¶ 3: CLIP includes a visual coder and a text encoder to extract visual features and text features respectively… Specifically, given an unknown image and a pre-defined text, the visual encoder and the text encoder output the visual class token v ϵ RC and the text token t ϵ RC, respectively. Here, C represents the feature dimension.; Page 3, Col. 1, ¶ 2: Although CLIP is only trained to match the global content of an image with text, the final layer of the visual encoder has a set of patch tokens P = {p-1, …, pM|pi ϵ RC} that potentially contain image local information in the patch level.).

Regarding claim 20, claim 20 has been analyzed with regard to claim 1 and is rejected for the same reasons of obviousness as used above as well as in accordance with Harary’s further teaching on:
A non-transitory computer readable storage medium embodied thereon a program executable b a processor for performing a method (¶ 0005: Another embodiment in this disclosure provide non-transitory computer-readable mediums containing computer program code that, when executed by operation of one or more computer processors, performs operations, including processing), the method comprising:

Claims 3, 4, and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Deng et al (H. Deng, Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization, 2023, {2308.15939}, https://arxiv.org/abs/2308.15939, hereinafter “Deng”) in view of Harary et al (U.S. Patent Publication No. 2025/005727 A1, hereinafter “Harary”) and further in view of Schulter et al (U.S. Patent Publication No. 2023/0281826 A1, hereinafter “Schulter”). 

Regarding claim 3, the Deng and Harary combination teaches the method of claim 2.
Neither Deng or Harary explicitly teach wherein the method further comprises: Collecting a plurality of normal images associated with a class; encoding, using the image encoder, the plurality of normal images to produce features of the plurality of normal images; processing, using an image decoder, the features of the plurality of normal images conditioned on a pseudo class name associated with the class, wherein the pseudo class name is the first text prompt; and training the image decoder to learn the pseudo class name as the first text encoding.
However, Schulter does teach wherein the method further comprises:
Collecting a plurality of normal images associated with a class (¶ 0033: When training the segmentation model, an image I is sampled from one of K datasets Dk…);
encoding, using the image encoder, the plurality of normal images to produce features of the plurality of normal images (¶ 0027: A panoptic segmentation model takes an image I as input and extracts multi-scale features using a neural network. A transformer encoder-decoder may be used to predict a set of N masks…);
processing, using an image decoder, the features of the plurality of normal images conditioned on a pseudo class name associated with the class, wherein the pseudo class name is the first text prompt (¶ 0031: Instead of directly predicting a probability distribution pi- for an image, the embedding model predicts an embedding vector eiI ϵ Rd for each query i.; ¶ 0032: The text embeddings ecT for the class c may be determined based on the input prompt for the text-encoder.); and
training the image decoder to learn the pseudo class name as the first text encoding (¶ 0033: When training the segmentation model, an image I is sampled from one of K datasets Dk, where k ϵ {1, . . . , K} which also defines the labelspace LK. Text embeddings ec T are computed for c ϵ Lk – the embeddings may be predetermined if prompts are not learned.).
Deng and Schulter are considered to be analogous art as both pertain to text based image segmentation. Therefore, it would have been obvious to one of ordinary skill in the art to combine the vision-language model adapted for unified zero-shot anomaly localization (as taught by Deng) and the panoptic segmentation system (as taught by Schulter) before the effective filing date of the claimed invention.  The motivation for this combination of references would be Schulter improves generalization by distilling information from the image encoder into the embedding space and rescores unseen categories during training to offset biasing toward “no-object” classifications for segments of the images. (See ¶ 0036 and 0037).
This motivation for the combination of Deng, Harary, and Schulter is supported by KSR exemplary rationale (G) Some teaching, suggestion, or motivation in the prior art that would have led one of ordinary skill to modify the prior art reference or to combine prior art reference teachings to arrive at the claimed invention. MPEP 2141 (III).

Regarding claim 4, the Deng, Harary, and Schulter combination teaches the method of claim 3.
Additionally, Deng teaches wherein the method further comprises:
obtaining, using the text encoder, encodings of a pair of contradictory class names in the latent space, the pair of contradictory class names comprising the first text prompt and the second text prompt (Page 3, Col. 1, ¶ 1: Anomaly detection involves the semantic concepts of ”normal” and ” anomaly”, so for a test class we can simply define two textual prompts i.e. ”a photo of a normal [CLASS]” and ”a photo of an abnormal [CLASS]”, and extract the corresponding text tokens [t+, t-].);
obtaining, using the image encoder, the features of the plurality of normal images (Figure 2; Page 2, Col. 2, ¶ 2: Specifically, given an unknown image and a pre-defined text, the visual encoder and the text encoder output the visual class token v ϵ RC and the text token t ϵ RC, respectively. Here, C represents the feature dimension.);
partitioning the features of the plurality of normal images (Page 3, Col. 1, ¶ 2: Although CLIP is only trained to match the global content of an image with text, the final layer of the visual encoder has a set of patch tokens P = {p-1, …, pM|pi ϵ RC} that potentially contain image local information in the patch level.);

training the projector operator to project the partitioned features of the plurality of normal images and the abnormal features within the latent space, wherein the partitioned features of the plurality of normal images are closer to the first text prompt and the abnormal features are closer to the second text prompt (Page 4, Col. 2, Section “Test-Time Adaptation (TTA): Mathematically, we denote the set of text prompt tokens as T ϵ R2N x C, where ti and ti + N- correspond to paired normal and abnormal tokens for 0 < I ≤ N… After obtaining the set of patch tokens P ϵ RM x C for a test image, the online-adapted patch token set can be written as: 
    PNG
    media_image2.png
    27
    140
    media_image2.png
    Greyscale
where α is the softmax activation function and ω ϵ R2N x C denotes the learn-able parameters initailized by the text tokens T.).
	Additionally, You teaches introducing noise to at least some of the partitioned features of the of the plurality of normal images to generate abnormal features (Fig. 3: Synthetic anomalies by adding confetti noise on normal samples.; Page 7, ¶ 1: In anomaly-available case, following [22], we synthesize anomalies by adding confetti noise on normal samples (Fig. 3).)

Regarding claim 7, the Deng, Harary, and Schulter combination teaches the method of claim 3.
Additionally, Schulter teaches wherein the image decoder is trained to learn encodings of a plurality of pseudo class names for a plurality of classes within the latent space (¶ 0033: When training the segmentation model, an image I is sampled from one of K datasets Dk, where k ϵ {1, . . . , K} which also defines the labelspace LK. Text embeddings ec T are computed for c ϵ Lk – the embeddings may be predetermined if prompts are not learned. The predefined embedding space of the vision-and-language model handles the different label spaces, where different categories having different names corresponding to respective locations in the embedding space. Different names of the same semantic category, such as "sofa" and "couch," will be located close to one another due to semantic training on large-scale natural image-text pairs.).

Claims 8, 9, 14, 18, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Deng et al (H. Deng, Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization, 2023, https://arxiv.org/abs/2308.15939, hereinafter “Deng”) in view of Harary et al (U.S. Patent Publication No. 2025/005727 A1, hereinafter “Harary”) and further in view of You et al (Z. You, “ADTR: Anomaly Detection Transformer with Feature Reconstruction”, 2022, https://arxiv.org/abs/2209.01816, hereinafter “You”). 

Regarding claim 8, the Deng and Harary combination teaches the method of claim 2.
Additionally, Harary teaches wherein training dataset for the encoder, the text (¶ 0068: In the illustrate example, the training module 550 uses both the anomaly images 505 and normal images 502 to train a machine learning model 555 to distinguish between two categories. In some embodiments, the training module 550 may properly label each of the images ("normal" or "anomalous") and split the labeled dataset into training, validation and testing sets.; Examiner’s note: As there is no indication of what a long-tailed description is in the claim, the examiner is interpreting this under broadest reasonable interpretation to mean a dataset comprising a distribution of images.).
Neither Deng or Harary explicitly teach training a decoder.
However, You teaches wherein training dataset for the encoder, the text encoder and the image decoder includes a plurality of images of the plurality of classes in a long-tailed description (Page 6, Section 4.1 “Dataset”: MVTec-AD [4] is a multi-category, multi-defect, industrial anomaly detection dataset with 15 categories. The ground-truth includes both image labels and anomaly segmentation.; Page 7, ¶ 1: CIFAR-10 [18] is a classical classification dataset with 10 classes. Each class has 5000 images for training and 1000 images for testing. In normal-sample-only case, following [19], the training set of one class is used for training, and the test set contains normal images of the same class and the same number of anomaly images randomly sampled from other classes.).
Deng and You are considered to be analogous art as both pertain to text based image anomaly detection. Therefore, it would have been obvious to one of ordinary skill in the art to combine the vision-language model adapted for unified zero-shot anomaly localization (as taught by Deng) and the anomaly detection transformer (as taught by You) before the effective filing date of the claimed invention.  The motivation for this combination of references would be that You utilizes transformer limits to reconstruct anomalies such that anomalies can be detected easily once reconstruction fails.. (See abstract).
This motivation for the combination of Deng, Harary, and You is supported by KSR exemplary rationale (G) Some teaching, suggestion, or motivation in the prior art that would have led one of ordinary skill to modify the prior art reference or to combine prior art reference teachings to arrive at the claimed invention. MPEP 2141 (III).

Regarding claim 9, the Deng and Harary combination teaches the method of claim 1.
Additionally, You teaches wherein the image encoder is a deep neural network including a sequence of layers, wherein each layer of the sequence of layers produces image features, and wherein the features of the image are formed by combining image features of different layers (Figure 2: Overview of our method: Embedding: (a). a pre-trained CNN backbone is applied to extract the multi-scale features. (b) Reconstruction: a transformer is utilized to reconstruct the feature tokens with an auxiliary learnable query embedding.; Page 3, Section 3.1 “Architecture”: A frozen pre-trained CNN backbone is first utilized for feature extraction.; Page 4, Section “Reconstruction”: The transformer encoder embeds the input feature tokens into a latent feature space. Each encoder layer follows the standard architecture [33] with multi-head attention, feed forward network (FFN), residual connection, and normalization.).

Regarding claim 14, the Deng and Harary combination teaches the method of claim 1.
Additionally, You teaches wherein the method further comprises:
partitioning the image into patches corresponding to the feature patches (Figure 2: Overview of our method: Embedding: (a). a pre-trained CNN backbone is applied to extract the multi-scale features.);
reconstructing, using a reconstruction model, each of the feature patches of the image (Figure 2: Overview of our method: (b) Reconstruction: a transformer is utilized to reconstruct the feature tokens with an auxiliary learnable query embedding.);
comparing the reconstructed feature patches with the corresponding partitions of the feature patches to produce reconstruction scores (Figure 2: Overview of our method: (c) Comparison: our approach is compatible with both normal-sample-only case and anomaly-available case. The anomaly score maps are obtained through the differences between extracted and reconstructed features.); and
detecting the anomaly based on the reconstruction scores (Figure 2: Overview of our method: (c) Comparison: our approach is compatible with both normal-sample-only case and anomaly-available case. The anomaly score maps are obtained through the differences between extracted and reconstructed features.)

Regarding claim 18, the Deng and Harary combination teaches the system of claim 17.
Additionally, You teaches wherein the image encoder is a deep neural network including a sequence of layers, wherein each layer of the sequence of layers produces image features, and wherein the features of the image are formed by combining image features of different layers (Figure 2: Overview of our method: Embedding: (a). a pre-trained CNN backbone is applied to extract the multi-scale features. (b) Reconstruction: a transformer is utilized to reconstruct the feature tokens with an auxiliary learnable query embedding.; Page 3, Section 3.1 “Architecture”: A frozen pre-trained CNN backbone is first utilized for feature extraction.; Page 4, Section “Reconstruction”: The transformer encoder embeds the input feature tokens into a latent feature space. Each encoder layer follows the standard architecture [33] with multi-head attention, feed forward network (FFN), residual connection, and normalization.).

Regarding claim 19, claim 19 has been analyzed with regard to respective claim 14 and is rejected for the same reasons of obviousness as used above.

Claims 5 and 15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Schulter et al (U.S. Patent Publication No. 2023/0281977 A1) teaches a system for detecting faults in an image by embedding captured images with text embeddings. Semantic information is then generated for a region of the image corresponding to a predetermined static object using the embedded image and camera faults are identified based on the semantic information and the semantic information of the predetermined static object.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANDREW JONES whose telephone number is (703)756-4573. The examiner can normally be reached Monday - Friday 8:00-5:00 EST, off Every Other Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571) 272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANDREW B. JONES/Examiner, Art Unit 2667                                                                                                                                                                                                        
/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667
Read full office action
Prosecution Timeline

Feb 28, 2024
Application Filed
Feb 24, 2026
Non-Final Rejection mailed — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/928,992
Patent 12626475
SAMPLE OBSERVATION DEVICE AND SAMPLE OBSERVATION METHOD
3y 5m to grant Granted May 12, 2026
18/346,846
Patent 12607515
SYSTEMS AND METHODS FOR TEMPERATURE DETERMINATION
2y 9m to grant Granted Apr 21, 2026
18/206,868
Patent 12599285
ANALYSIS OF IN-VIVO IMAGES USING CONNECTED GRAPH COMPONENTS
2y 10m to grant Granted Apr 14, 2026
17/999,149
Patent 12587607
CORRECTION OF COLOR TINTED PIXELS CAPTURED IN LOW-LIGHT CONDITIONS
3y 4m to grant Granted Mar 24, 2026
18/016,719
Patent 12586201
ORAL IMAGE PROCESSING DEVICE AND ORAL IMAGE PROCESSING METHOD
3y 2m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
70%
Grant Probability
92%
With Interview (+21.4%)
3y 0m (~9m remaining)
Median Time to Grant
Low
PTA Risk
Based on 78 resolved cases by this examiner. Grant probability derived from career allowance rate.