Last updated: May 29, 2026

Application No. 18/400,635

TOKEN PRUNING IN SWIN TRANSFORMER ARCHITECTURES

Non-Final OA §103

Filed

Dec 29, 2023

Priority

Dec 29, 2022 — provisional 63/435,923

Examiner

MENBERU, BENIYAM

Art Unit

2681

Tech Center

2600 — Communications

Assignee

Qatar Foundation For Education Science And Community Development

OA Round

1 (Non-Final)

Interview Optional

— +12.9% interview lift. Interview lift (+12.9%) is below the 15.0% threshold. A written response is recommended.

Based on 718 resolved cases, 2023–2026

Examiner Intelligence

MENBERU, BENIYAM View full profile →

Grants 74% — above average

Career Allowance Rate

530 granted / 718 resolved

+11.8% vs TC avg

Moderate +13% lift

Without

With

+12.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

18 currently pending

Career history

743

Total Applications

across all art units

Statute-Specific Performance

§101

1.1%

-38.9% vs TC avg

§103

89.1%

+49.1% vs TC avg

§102

0.8%

-39.2% vs TC avg

§112

5.8%

-34.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 718 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .



Claim Objections
Claims 16-20 are objected to because of the following informalities: Claims 16-20 should refer back to “The non-transitory computer readable storage device”. Appropriate correction is required.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 8, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over “Beyond Attentive Tokens: Incorporating Token Importance and Diversity for
Efficient Vision Transformers” to Long in view of US 20230090941 to Li.

       Regarding claim 1, Long discloses a method, comprising: 
            receiving input data into a transformer model (page 2, second column, section 2 “Vision transformers” (model); page 3, second column, section 3; input image into vision transformers; page 4, first column, section 4.1 “DeiT-S model”; ); 
            processing the input data in the transformer model to obtain tokens for a tokenized input (page 3, second column, section 3; page 4, first column top paragraph; xi tokens is obtained from the input image after processing by the transformer);
            selecting a pruning method for the input data from (page 8, section 5.3 ; different pruning method can be selected including attentive token preservation, token pack, token clustering and the Table 3 shows result of each selected method):
                 removing (page 1, second column last paragraph; page 3, bottom of first column; page 3 second column top; page 8, section 5.3; importance-based pruning discards (remove) inattentive tokens; “attentive token preservation”); 
                 packaging (page 8, second column top paragraph; token packing for inattentive token into one token); and 
                 merging (page 5, first column section 4.4 token merger; second column top paragraph ; merging tokens that are inattentive tokens);
            pruning the tokens using a token pruning module, which performs the selected token pruning method (page 8, section 5.3 Table 3 shows the different pruning modules such as attentive token preservation module (removing), inattentive token pack module (packaging), inattentive token clustering module (merging); one of modules is used for pruning; result of pruning using each of the selected pruning method is shown in Table 3); and 
       outputting pruned data (page 5, second column top paragraph ; outputting of pruned token sequence; see also Fig. 3 showing output of token merging after pruning).

However Long does not disclose receiving input data into a Swin transformer model; 
            processing the input data in the Swin transformer model to obtain tokens.
        Li discloses receiving input data into a Swin transformer model (paragraph 29, 34; input image into SWIN transformer model); 
            processing the input data in the Swin transformer model to obtain tokens (paragraph 29, 34-35; extracted tokens (features) by processing input image via SWIN model).
        It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Li to process the input data using SWIN transformer model.
        The motivation to combine the references is to provide gating method that transfers feature from earlier frame to later frames to reduce amount of redundant operation of computing features when processing the input data in the SWIN transformer model (Paragraph 43-44)










       Regarding claim 8, Long discloses a system (see Fig. 3 page 3), comprising: 
         perform operations that include:
           receiving input data into a transformer model (page 2, second column, section 2 “Vision transformers” (model); page 3, second column, section 3; input image into vision transformers; page 4, first column, section 4.1 “DeiT-S model”);
           processing the input data in the transformer model to obtain tokens for a tokenized input (page 3, second column, section 3; page 4, first column top paragraph; xi tokens is obtained from the input image after processing by the transformer);
           selecting a pruning method for the input data from (page 8, section 5.3 ; different pruning method can be selected including attentive token preservation, token pack, token clustering and the Table 3 shows result of each selected method):
            removing (page 1, second column last paragraph; page 3, bottom of first column; page 3 second column top; page 8, section 5.3; importance-based pruning discards (remove) inattentive tokens; “attentive token preservation”); 
            packaging (page 8, second column top paragraph; token packing for inattentive token into one token); and 
            merging (page 5, first column section 4.4 token merger; second column top paragraph ; merging tokens that are inattentive tokens);
           pruning the tokens using a token pruning module, which performs the selected token pruning method (page 8, section 5.3 Table 3 shows the different pruning modules such as attentive token preservation module (removing), inattentive token pack module (packaging), inattentive token clustering module (merging); one of modules is used for pruning; result of pruning using each of the selected pruning method is shown in Table 3); and 
           outputting pruned data (page 5, second column top paragraph; outputting of pruned token sequence; see also Fig. 3 showing output of token merging after pruning).
However Long does not disclose a processor; and a memory, storing instructions that, when executed by the processor, perform operations; and receiving input data into a SWIN transformer model; processing the input data in the SWIN transformer model to obtain tokens.
          Li discloses a processor; and a memory, storing instructions that, when executed by the processor, perform operations (paragraph 9, 87; CPU; memory 924 storing program when executed perform method); and receiving input data into a SWIN transformer model (paragraph 29, 34; input image into SWIN transformer model); processing the input data in the SWIN transformer model to obtain tokens (paragraph 29, 34-35; extracted tokens (features) by processing input image via SWIN model).
        It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Li to process the input data using SWIN transformer model.
        The motivation to combine the references is to provide gating method that transfers feature from earlier frame to later frames to reduce amount of redundant operation of computing features when processing the input data in the SWIN transformer model (Paragraph 43-44)



       Regarding claim 15, see rejection of claim 1. Further Li discloses a non-transitory computer readable storage device, including instructions that, when executed by a processor, perform operations (paragraph 9, 87; CPU; memory 924 storing program when executed by processor to perform method).








Claim(s) 2-3, 9, 10, 16, 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over “Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers” to Long in view of US 20230090941 to Li further in view US 20230306600 to Zhang further in view US 20220374766 to Thorsley.
       Regarding claim 2, Long discloses the method of claim 1, wherein the selected pruning method is removing (page 8, section 5.3; different pruning method can be selected including attentive token preservation; page 1, second column last paragraph; page 3, bottom of first column; page 3 second column top; page 8, section 5.3; importance-based pruning discards (remove) inattentive tokens; “attentive token preservation”)).
      However Long does not disclose identifying windows into which the tokenized input is divided and reintegrating each of the windows from remaining tokens therein.
      Zhang discloses identifying windows into which the tokenized input is divided (paragraph 55-56, 60; window partitioning of input image wherein each window includes tokens) and reintegrating each of the windows from remaining tokens therein (paragraph 61-62; each of windows 414 in a layer including remaining tokens after merging windows 412 in previous layer by packing smaller tokens into bigger tokens; each of these windows 414 are merged with each other (reintegrating) to generate a merged window 416).
        It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Zhang to provide dividing of input data into windows and reintegrating these windows.
        The motivation to combine the references is to provide dividing into windows such that self-attention of the input can be calculated locally at each window to get more accurate information (paragraph 54).

However Long does not disclose removing comprises:
identifying a pruning target; identifying N tokens in each window having a lowest information content, where an area of each window is N greater than an area for each window to meet the pruning target;
removing the N tokens from each window to achieve the pruning target with remaining tokens in each window.
         Thorsley discloses removing comprises:
identifying a pruning target; identifying N tokens in each window having a lowest information content, where an area of each window is N greater than an area for each window to meet the pruning target;
removing the N tokens from each window to achieve the pruning target with remaining tokens in each window (paragraph 53-54; layer 411 includes one of windows having sequence x1, x2, x3; x2 sequence include 4 tokens that have lowest importance score (lowest information) based on threshold 418; before pruning window has total of 12 tokens which is 4 greater than 8 tokens to meet target of 8 tokens after pruning; 4 tokens (N) are removed to achieve target of 8 remaining in window).
        It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Thorsley to provide token pruning by removing tokens to achieve target token amount.
        The motivation to combine the references is to provide hard pruning method that removes inattentive tokens completely by using hard mask based on hard threshold to remove tokens in the sequence and provide the pruning result to another layer (paragraph 53-54).

       Regarding claim 3, Zhang discloses the method of claim 2, wherein the reintegrated window (paragraph 61-62; merged windows) and Thorsley discloses maintains a relative order of the remaining tokens in each window from before pruning to after pruning (paragraph 53-54; before pruning the x3 sequence is after sequence x1; after pruning the x3 sequence remains after sequence x1 in window when integrated in layer 413).


       Regarding claim 9, see rejection of claim 2. 


       Regarding claim 16, see rejection of claim 2.


       Regarding claim 10, see rejection of claim 3.

       Regarding claim 17, see rejection of claim 3.


Claim(s) 4, 5, 11, 12, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over “Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers” to Long in view of US 20230090941 to Li further in view US 20230306600 to Zhang further in view of “NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS” to Liang.
       Regarding claim 4, Long discloses wherein the selected pruning method is packaging (page 8, section 5.3; different pruning method can be selected including attentive token preservation, token pack, token clustering and the Table 3 shows result of each selected method).
      However Long does not disclose identifying windows into which the tokenized input is divided and reintegrating each of the windows from remaining tokens therein and the packaged token.
      Zhang discloses identifying windows into which the tokenized input is divided (paragraph 55-56, 60; window partitioning of input image wherein each window includes tokens) and reintegrating each of the windows from remaining tokens therein and the packaged token (paragraph 61-62; each of windows 414 in a layer including remaining tokens after merging windows 412 in previous layer by packing smaller tokens into bigger tokens; each of these windows 414 are merged with each other (reintegrating) to generate a merged window 416).
        It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Zhang to provide dividing of input data into windows and reintegrating these windows.
        The motivation to combine the references is to provide dividing into windows such that self-attention of the input can be calculated locally at each window to get more accurate information (paragraph 54).

However Long does not disclose the method of claim 1, wherein packaging comprises:
identifying a pruning target; identifying P tokens in each window having a lowest information content, where an area of each window is P-1 greater than an area for each window to meet the pruning target; removing the P tokens from each window; combining the P removed tokens to produce a packaged token for each window from the P removed tokens from that window.
        Liang discloses packaging comprises:
identifying a pruning target (page 4-5, section 3.2-3.3; pruning target is top-k tokens plus one packed token (k+1); see Fig. 2); identifying P tokens in each window having a lowest information content, where an area of each window is P-1 greater than an area for each window to meet the pruning target; removing the P tokens from each window; combining the P removed tokens to produce a packaged token for each window from the P removed tokens from that window (page 4-5, section 3.2-3.3; in Fig. 2, 5 inattentive tokens (P tokens) having low information; initial window on left side of Fig. 2 there are 10 image tokens (area of window) in window and area of pruning target is 5 attentive tokens plus one packed token from inattentive token for total 6 tokens; 10 image tokens as area of window is greater by (5-1=4) than target which is 6; the P lowest tokens (inattentive) removed and are packed into one token as shown in Fig. 2).
        It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Liang to provide packaging of tokens based on target pruning.
        The motivation to combine the references is to provide improved accuracy in classification of images by reorganizing tokens based on the packaging method which fuses inattentive tokens into a token instead of removing these tokens (page 2, second last paragraph; page 4, last 3 paragraph).


       Regarding claim 5, Zhang discloses the method of claim 4, wherein the packaged token is reintegrated into the each of the windows (paragraph 61-62; merged windows in upper layer pack tokens from previous layers) and Thorsley discloses reintegrating in a shared location across the windows (page 4-5, section 3.2-3.3; the P lowest tokens (inattentive) removed and are packed into one token as shown in Fig. 2; the attentive tokens are appended (reintegrated) to the packed one token in Fig. 2 in new window of tokens; Fig. 2 shows the packed token from inattentive tokens at shared location which is bottom for each window and top attentive tokens are situated at the top).

       Regarding claim 11, see rejection of claim 4.


       Regarding claim 12, see rejection of claim 5.



       Regarding claim 18, see rejection of claim 4.




Allowable Subject Matter
Claims 6-7, 13-14, 19-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.



Other Prior Art Cited 
14. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20240296652 to Wang.
US 20240268700 to Liu.
    "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers" to Dong et al.
    "Which Tokens to Use? Investigating Token Reduction in Vision Transformers" to Haurum et al.






Conclusion
            Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENIYAM MENBERU whose telephone number is (571) 272-7465.  The examiner can normally be reached on Monday-Friday, 10:00am-6:30pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi Sarpong can be reached on (571) 270-3438.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.   
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the customer service office whose telephone number is (571) 272-2600. The group receptionist number for TC 2600 is (571) 272-2600. 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
For more information about the PAIR system, see <http://pair-direct.uspto.gov/>. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  

Patent Examiner
Beniyam Menberu

/BENIYAM MENBERU/Primary Examiner, Art Unit 2681                                                                                                                                                                                                        

03/20/2026

Read full office action

Prosecution Timeline

Dec 29, 2023

Application Filed

Mar 26, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/389,812

Patent 12639790

IMAGE PROCESSING SYSTEM, IMAGE PROCESSING METHOD, LEARNING SYSTEM, AND INFORMATION STORAGE MEDIUM

2y 5m to grant Granted May 26, 2026

18/411,582

Patent 12639018

SYSTEM INCLUDING OPERATION DEVICE AND INFORMATION STORING APPARATUS, METHOD PERFORMED BY THE SYSTEM, AND THE INFORMATION STORING APPARATUS

2y 4m to grant Granted May 26, 2026

18/360,477

Patent 12626323

CONTROL METHOD AND DEVICE FOR IMAGE SIGNAL PROCESSORS OF MULTIPLE CHANNELS

2y 9m to grant Granted May 12, 2026

17/862,383

Patent 12610015

INFORMATION PROCESSING APPARATUS, NON-TRANSITORY COMPUTER READABLE MEDIUM STORING PROGRAM, AND INFORMATION PROCESSING METHOD

3y 9m to grant Granted Apr 21, 2026

17/964,822

Patent 12610018

TERMINAL DEVICE, IMAGE PROCESSING APPARATUS, AND OUTPUT METHOD

3y 6m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

74%

Grant Probability

87%

With Interview (+12.9%)

2y 8m (~3m remaining)

Median Time to Grant

Low

PTA Risk

Based on 718 resolved cases by this examiner. Grant probability derived from career allowance rate.