Last updated: April 19, 2026

Application No. 18/325,696

SCATTERING VISION TRANSFORMER

Non-Final OA §103§112

Filed

May 30, 2023

Examiner

CHUANG, SU-TING

Art Unit

2146

Tech Center

2100 — Computer Architecture & Software

Assignee

Microsoft Technology Licensing, LLC

OA Round

1 (Non-Final)

This examiner grants 52% of cases after interview

— +39.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 101 resolved cases, 2023–2026

Examiner Intelligence

CHUANG, SU-TING View full profile →

Grants 52% of resolved cases

Career Allow Rate

52 granted / 101 resolved

-3.5% vs TC avg

Strong +40% interview lift

Without

With

+39.7%

Interview Lift

resolved cases with interview

Typical timeline

4y 5m

Avg Prosecution

28 currently pending

Career history

129

Total Applications

across all art units

Statute-Specific Performance

§101

27.4%

-12.6% vs TC avg

§103

46.3%

+6.3% vs TC avg

§102

10.8%

-29.2% vs TC avg

§112

11.7%

-28.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 101 resolved cases

Office Action

§103 §112

DETAILED ACTION
Claims 1-20 are pending and have been examined.
--
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 06/07/2023, 09/23/2024 and 12/09/2025 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-8, 11-14 and 17-20 rejected under 35 U.S.C. 112(b)  or pre-AIA  35 U.S.C. 112, second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.

Claims 1 and 17 recite the limitation “the classification of the input image.” There is insufficient antecedent basis for this limitation in the claim. The limitation appears to be referring to “an image classification of an input image,” thus inconsistent and indefinite. For examination purposes examiner has interpreted “an image classification of an input image” to be “the image classification of an input image.”
Claims 3 and 11 recite the limitation “the position within the input image of its respective image patch.” There is insufficient antecedent basis for the limitation “the position” in the claim. For examination purposes examiner has interpreted “the position” to be “a position.”
Claims 4, 12 and 18 recite the limitation “generate low-frequency tokens and high-frequency tokens by applying a dual-tree complex wavelet transform… mix the low-frequency tokens with a first set of trainable weights to generate a low-frequency representation, and to mix the high-frequency tokens with a second set of trainable weights to generate a high-frequency representation; an inverse scatter network layer configured to combine the low-frequency representation and the high-frequency representation” which comprises relative terms and therefore renders the claim indefinite. The terms “low-frequency tokens, high-frequency tokens, a low-frequency representation, a high-frequency representation” are not defined by the claims, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The specification does not provide examples or teachings of usage within the context of “low-frequency tokens, high-frequency tokens, a low-frequency representation, a high-frequency representation.” See MPEP  2173.05(b).
Claims 5, 13 and 19 recite the limitation “mix the low frequency tokens with the first set of trainable weights by performing tensor mixing” which comprises relative terms and therefore renders the claim indefinite. The terms “the low frequency tokens” are not defined by the claims, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The specification does not provide examples or teachings of usage within the context of “the low frequency tokens.” See MPEP  2173.05(b).
Claims  6, 14 and 20 recite the limitation “mix the high frequency tokens with the second set of trainable weights by performing Einstein mixing” which comprises relative terms and therefore renders the claim indefinite. The terms “the high frequency tokens” are not defined by the claims, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. The specification does not provide examples or teachings of usage within the context of “the high frequency tokens.” See MPEP  2173.05(b).
Claims 2 and 7-8 are also rejected due to their dependency on a rejected claim.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-3, 7-11 and 15-17 rejected under 35 U.S.C. 103 as being unpatentable over Ding ("An enhanced vision transformer with wavelet position embedding for histopathological image classification" 20230314) in view of Yang (US 12198300 B2, filed on 20220225)

In regard to claims 1, 9 and 17, Ding teaches: A method for generating an image classification of an input image, the method comprising: (Ding, p. 3, 3 Methods "we employ two techniques to make it more suitable for histopathological image classification… Finally, the classification result is derived by the FC classifier.")
providing a plurality of image patch vectors corresponding to the input image (Ding, p. , 4.1.1. NCT-CRC-HE "NCT-CRC-HE [34] has two sub-datasets. 100,000 non-overlapping image patches, which are cropped from H&E-stained histopathological images..."; p. 3, 3.1. Overall architecture, "The proposed model first feeds the input into the tokenizer module to obtain the low dimensional representations [a plurality of image patch vectors] in the histopathological image. [the input image]"; the tokenizer receives image patches (cropped from an image) and generate [image patch vector] for each image patch, therefore will generate [image patch vectors] for an image;  a set of image tokens are patch vectors (often called patch embeddings or patch-level representations)) to a sequence of neural network layers configured to generate the image classification, wherein the sequence of neural network layers comprises: (Ding, p. 3, 3.1. Overall architecture, "As shown in Fig. 2, the model is an end-to-end trainable architecture that includes tokenizer, wavelet position embedding, transformer encoder, 
    PNG
    media_image1.png
    664
    1254
    media_image1.png
    Greyscale
sequence pooling, and a full convolution(FC) classifier module. [a sequence of neural network layers]")

at least one... layer configured to receive the plurality of image patch vectors and to process the plurality of image patch vectors to generate a first scatter output; (Ding, p. 3, 3.1. Overall architecture, "Wavelet position embedding has two 1D convolutions and a wave transform function…"; Fig. 2, Wavelet position embedding layer receiving patch vectors from the tokenizer to generate an output)
at least one attention layer configured to receive the first scatter output and to process the first scatter output to generate an attention output; and (Ding, p. 3, 3.1. Overall architecture, "Then it uses the wavelet positional embedding of these feature maps as the input of vision transformer to extract long-distance information of patches... The transformer encoder block is mainly compose of an external multi-head attention module and a MLP block."; Fig. 2, Transformer Encoder layer receiving output from the Wavelet layer to generate an attention output)
a classifier head layer configured to receive the attention output and to process the first scatter output to generate the classification of the input image. (Ding, p. 3, 3.1. Overall architecture, "Finally, the classification result is derived by the FC classifier."; Fig. 2, FC layer receiving output from the Transformer encoder layer to generate the classification)


    PNG
    media_image2.png
    444
    418
    media_image2.png
    Greyscale
Ding does not teach, but Yang teaches: scatter layer (Yang, (56) "At operation 530, a complex wavelet transform (e.g., a dual-tree complex wavelet transform (DTCWT)) 530 is performed on each image of the z-stack 520... At operation 540, the image data for each of the images is separated into regions that contain details of various sizes to determine wavelet coefficients. A relatively large wavelet coefficient in a region may, for example, indicate more pronounced detail than a region with a smaller wavelet coefficient."; 'scatter' in the context of the Wavelet Transform means information spreading into multiple wavelet coefficients across different scales)


It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Ding to incorporate the teachings of Yang by replacing Ding's wavelet transform with complex wavelet transform. Doing so would be able to detect areas of the images where sharp features and details are present. (Yang, (56) "At operation 530, a complex wavelet transform (e.g., a dual-tree complex wavelet transform (DTCWT)) 530 is performed on each image of the z-stack 520 to be able to detect areas of the images where sharp features and details are present.")

Claims 9 and 17 recite substantially the same limitation as claim 1, therefore the rejection applied to claim 1 also apply to claims 9 and 17. In addition, Ding teaches: (claim 9) An image classification system implemented by one or more computers (claim 17) One or more non-transitory computer readable media encoded with a computer program comprising instructions that when executed by the one or more computers cause the one or more computers to perform operations for (Ding, p. 6, 4.2. Implementation details "We implement the method in PyTorch and run it on a high performance computer with CPU of 35.4816 TFLOPS and GPU of 18.8 TFLOPS.")

In regard to claims 2 and 10, Ding teaches: further comprising: generating a plurality of image patches from the input image, wherein each image patch vector of the plurality of image patch vectors corresponds to one of the plurality of image patches and (Ding, p. 6, 4.1.1. NCT-CRC-HE "NCT-CRC-HE [34] has two sub-datasets. 100,000 non-overlapping image patches, [a plurality of image patches] which are cropped from H&E-stained histopathological images..."; p. 3, 3.1. Overall architecture, "The proposed model first feeds the input into the tokenizer module to obtain the low dimensional representations [a plurality of image patch vectors] in the histopathological image. [the input image]"; the tokenizer receives image patches (cropped from an image) and generate [image patch vector] for each image patch, therefore will generate [image patch vectors] for an image) comprises a linearly projection of its corresponding image patch. (Ding, p. 3, 3.1. Overall architecture, "The tokenizer module contains three convolution blocks, each block includes a single convolutional layer, ['convolution' includes a weighted sum as its linear operation, a linearly projection, e.g. Σ W(filter) * I(input)] ReLU activation and maxpool operation.")

In regard to claims 3 and 11, Ding teaches: further comprising: augmenting each patch vector to include a positional embedding that corresponds to the position within the input image of its respective image patch. (Ding, p. 3, 3.1. Overall architecture, "Then it uses the wavelet positional embedding of these feature maps as the input of vision transformer to extract long-distance information of patches. [the position within the input image of its respective image patch]")

In regard to claims 7 and 15, Ding teaches: wherein the at least one attention layer comprises a multi-headed self-attention sublayer coupled to a multi-layer perceptron sublayer. (Ding, p. 5, 3.3. External attention "The differences between the original attention module and the employed attention module are displayed in Fig. 4... It can be inferred from Eq. (4) that the self-attention mechanism only considers the relationship between elements in one sample and neglects the potential relation between elements in different samples, which would limit the power of self-attention. From Eq. 5, we can note that is weight-sharing in the dataset and implicitly learns the characteristics of all training samples. [including itself]"; p. 5, Fig. 4 "Difference between two attention modules… Right is the proposed external attention module."; p. 6, Fig. 5 "Multi-head external attention."; see Fig. 2, Norm + EMSA [external memory self-attention, a multi-headed self-attention sublayer] coupled to Norm + MLP [a multi-layer perceptron sublayer])


In regard to claims 8 and 16, Ding teaches: wherein the multi-headed self-attention sublayer and the multi-layer perceptron sublayer each further include layer normalization sublayers. (Ding, p. 5, 3.3. External attention "It can be normalized in a similar way to self-attention."; see Fig. 2, Norm + EMSA,  Norm + MLP, where Norm as a normalization sublayer)

Conclusion
Yang teaches (Yang, (56) "At operation 530, a complex wavelet transform (e.g., a dual-tree complex wavelet transform (DTCWT)) 530 is performed on each image of the z-stack 520 to be able to detect areas of the images where sharp features and details are present… At operation 550, the image data from the images is fused together based on the maximum values of the wavelet coefficients.... At operation 570, an inverse complex wavelet transform 570 is applied to generate a final fused image 580.")

Yang teaches DTCWT at 530, fusing images at 550, and IDTCWT at 580, but does not teach the limitation "mix the low-frequency tokens with a first set of trainable weights to generate a low-frequency representation, and to mix the high-frequency tokens with a second set of trainable weights to generate a high-frequency representation" in claim 4. Yang does not teach the two trainable weights used for mixing low or high frequency tokens.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519. The examiner can normally be reached Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached at (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/SU-TING CHUANG/Examiner, Art Unit 2146

Read full office action

Prosecution Timeline

May 30, 2023

Application Filed

Mar 07, 2026

Non-Final Rejection — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

16/953,977

Patent 12561600

LINEAR TIME ALGORITHMS FOR PRIVACY PRESERVING CONVEX OPTIMIZATION

2y 5m to grant Granted Feb 24, 2026

16/984,909

Patent 12518154

TRAINING MULTIMODAL REPRESENTATION LEARNING MODEL ON UNNANOTATED MULTIMODAL DATA

2y 5m to grant Granted Jan 06, 2026

17/224,858

Patent 12481725

SYSTEMS AND METHODS FOR DOMAIN-SPECIFIC ENHANCEMENT OF REAL-TIME MODELS THROUGH EDGE-BASED LEARNING

2y 5m to grant Granted Nov 25, 2025

16/540,414

Patent 12468951

Unsupervised outlier detection in time-series data

2y 5m to grant Granted Nov 11, 2025

18/609,221

Patent 12412095

COOPERATIVE LEARNING NEURAL NETWORKS AND SYSTEMS

2y 5m to grant Granted Sep 09, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

52%

Grant Probability

91%

With Interview (+39.7%)

4y 5m

Median Time to Grant

Low

PTA Risk

Based on 101 resolved cases by this examiner. Grant probability derived from career allow rate.