DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claims 16-20 are objected to because of the following informalities: Claims 16-20 should refer back to “The non-transitory computer readable storage device”. Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 8, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over “Beyond Attentive Tokens: Incorporating Token Importance and Diversity for
Efficient Vision Transformers” to Long in view of US 20230090941 to Li.
Regarding claim 1, Long discloses a method, comprising:
receiving input data into a transformer model (page 2, second column, section 2 “Vision transformers” (model); page 3, second column, section 3; input image into vision transformers; page 4, first column, section 4.1 “DeiT-S model”; );
processing the input data in the transformer model to obtain tokens for a tokenized input (page 3, second column, section 3; page 4, first column top paragraph; xi tokens is obtained from the input image after processing by the transformer);
selecting a pruning method for the input data from (page 8, section 5.3 ; different pruning method can be selected including attentive token preservation, token pack, token clustering and the Table 3 shows result of each selected method):
removing (page 1, second column last paragraph; page 3, bottom of first column; page 3 second column top; page 8, section 5.3; importance-based pruning discards (remove) inattentive tokens; “attentive token preservation”);
packaging (page 8, second column top paragraph; token packing for inattentive token into one token); and
merging (page 5, first column section 4.4 token merger; second column top paragraph ; merging tokens that are inattentive tokens);
pruning the tokens using a token pruning module, which performs the selected token pruning method (page 8, section 5.3 Table 3 shows the different pruning modules such as attentive token preservation module (removing), inattentive token pack module (packaging), inattentive token clustering module (merging); one of modules is used for pruning; result of pruning using each of the selected pruning method is shown in Table 3); and
outputting pruned data (page 5, second column top paragraph ; outputting of pruned token sequence; see also Fig. 3 showing output of token merging after pruning).
However Long does not disclose receiving input data into a Swin transformer model;
processing the input data in the Swin transformer model to obtain tokens.
Li discloses receiving input data into a Swin transformer model (paragraph 29, 34; input image into SWIN transformer model);
processing the input data in the Swin transformer model to obtain tokens (paragraph 29, 34-35; extracted tokens (features) by processing input image via SWIN model).
It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Li to process the input data using SWIN transformer model.
The motivation to combine the references is to provide gating method that transfers feature from earlier frame to later frames to reduce amount of redundant operation of computing features when processing the input data in the SWIN transformer model (Paragraph 43-44)
Regarding claim 8, Long discloses a system (see Fig. 3 page 3), comprising:
perform operations that include:
receiving input data into a transformer model (page 2, second column, section 2 “Vision transformers” (model); page 3, second column, section 3; input image into vision transformers; page 4, first column, section 4.1 “DeiT-S model”);
processing the input data in the transformer model to obtain tokens for a tokenized input (page 3, second column, section 3; page 4, first column top paragraph; xi tokens is obtained from the input image after processing by the transformer);
selecting a pruning method for the input data from (page 8, section 5.3 ; different pruning method can be selected including attentive token preservation, token pack, token clustering and the Table 3 shows result of each selected method):
removing (page 1, second column last paragraph; page 3, bottom of first column; page 3 second column top; page 8, section 5.3; importance-based pruning discards (remove) inattentive tokens; “attentive token preservation”);
packaging (page 8, second column top paragraph; token packing for inattentive token into one token); and
merging (page 5, first column section 4.4 token merger; second column top paragraph ; merging tokens that are inattentive tokens);
pruning the tokens using a token pruning module, which performs the selected token pruning method (page 8, section 5.3 Table 3 shows the different pruning modules such as attentive token preservation module (removing), inattentive token pack module (packaging), inattentive token clustering module (merging); one of modules is used for pruning; result of pruning using each of the selected pruning method is shown in Table 3); and
outputting pruned data (page 5, second column top paragraph; outputting of pruned token sequence; see also Fig. 3 showing output of token merging after pruning).
However Long does not disclose a processor; and a memory, storing instructions that, when executed by the processor, perform operations; and receiving input data into a SWIN transformer model; processing the input data in the SWIN transformer model to obtain tokens.
Li discloses a processor; and a memory, storing instructions that, when executed by the processor, perform operations (paragraph 9, 87; CPU; memory 924 storing program when executed perform method); and receiving input data into a SWIN transformer model (paragraph 29, 34; input image into SWIN transformer model); processing the input data in the SWIN transformer model to obtain tokens (paragraph 29, 34-35; extracted tokens (features) by processing input image via SWIN model).
It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Li to process the input data using SWIN transformer model.
The motivation to combine the references is to provide gating method that transfers feature from earlier frame to later frames to reduce amount of redundant operation of computing features when processing the input data in the SWIN transformer model (Paragraph 43-44)
Regarding claim 15, see rejection of claim 1. Further Li discloses a non-transitory computer readable storage device, including instructions that, when executed by a processor, perform operations (paragraph 9, 87; CPU; memory 924 storing program when executed by processor to perform method).
Claim(s) 2-3, 9, 10, 16, 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over “Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers” to Long in view of US 20230090941 to Li further in view US 20230306600 to Zhang further in view US 20220374766 to Thorsley.
Regarding claim 2, Long discloses the method of claim 1, wherein the selected pruning method is removing (page 8, section 5.3; different pruning method can be selected including attentive token preservation; page 1, second column last paragraph; page 3, bottom of first column; page 3 second column top; page 8, section 5.3; importance-based pruning discards (remove) inattentive tokens; “attentive token preservation”)).
However Long does not disclose identifying windows into which the tokenized input is divided and reintegrating each of the windows from remaining tokens therein.
Zhang discloses identifying windows into which the tokenized input is divided (paragraph 55-56, 60; window partitioning of input image wherein each window includes tokens) and reintegrating each of the windows from remaining tokens therein (paragraph 61-62; each of windows 414 in a layer including remaining tokens after merging windows 412 in previous layer by packing smaller tokens into bigger tokens; each of these windows 414 are merged with each other (reintegrating) to generate a merged window 416).
It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Zhang to provide dividing of input data into windows and reintegrating these windows.
The motivation to combine the references is to provide dividing into windows such that self-attention of the input can be calculated locally at each window to get more accurate information (paragraph 54).
However Long does not disclose removing comprises:
identifying a pruning target; identifying N tokens in each window having a lowest information content, where an area of each window is N greater than an area for each window to meet the pruning target;
removing the N tokens from each window to achieve the pruning target with remaining tokens in each window.
Thorsley discloses removing comprises:
identifying a pruning target; identifying N tokens in each window having a lowest information content, where an area of each window is N greater than an area for each window to meet the pruning target;
removing the N tokens from each window to achieve the pruning target with remaining tokens in each window (paragraph 53-54; layer 411 includes one of windows having sequence x1, x2, x3; x2 sequence include 4 tokens that have lowest importance score (lowest information) based on threshold 418; before pruning window has total of 12 tokens which is 4 greater than 8 tokens to meet target of 8 tokens after pruning; 4 tokens (N) are removed to achieve target of 8 remaining in window).
It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Thorsley to provide token pruning by removing tokens to achieve target token amount.
The motivation to combine the references is to provide hard pruning method that removes inattentive tokens completely by using hard mask based on hard threshold to remove tokens in the sequence and provide the pruning result to another layer (paragraph 53-54).
Regarding claim 3, Zhang discloses the method of claim 2, wherein the reintegrated window (paragraph 61-62; merged windows) and Thorsley discloses maintains a relative order of the remaining tokens in each window from before pruning to after pruning (paragraph 53-54; before pruning the x3 sequence is after sequence x1; after pruning the x3 sequence remains after sequence x1 in window when integrated in layer 413).
Regarding claim 9, see rejection of claim 2.
Regarding claim 16, see rejection of claim 2.
Regarding claim 10, see rejection of claim 3.
Regarding claim 17, see rejection of claim 3.
Claim(s) 4, 5, 11, 12, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over “Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers” to Long in view of US 20230090941 to Li further in view US 20230306600 to Zhang further in view of “NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS” to Liang.
Regarding claim 4, Long discloses wherein the selected pruning method is packaging (page 8, section 5.3; different pruning method can be selected including attentive token preservation, token pack, token clustering and the Table 3 shows result of each selected method).
However Long does not disclose identifying windows into which the tokenized input is divided and reintegrating each of the windows from remaining tokens therein and the packaged token.
Zhang discloses identifying windows into which the tokenized input is divided (paragraph 55-56, 60; window partitioning of input image wherein each window includes tokens) and reintegrating each of the windows from remaining tokens therein and the packaged token (paragraph 61-62; each of windows 414 in a layer including remaining tokens after merging windows 412 in previous layer by packing smaller tokens into bigger tokens; each of these windows 414 are merged with each other (reintegrating) to generate a merged window 416).
It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Zhang to provide dividing of input data into windows and reintegrating these windows.
The motivation to combine the references is to provide dividing into windows such that self-attention of the input can be calculated locally at each window to get more accurate information (paragraph 54).
However Long does not disclose the method of claim 1, wherein packaging comprises:
identifying a pruning target; identifying P tokens in each window having a lowest information content, where an area of each window is P-1 greater than an area for each window to meet the pruning target; removing the P tokens from each window; combining the P removed tokens to produce a packaged token for each window from the P removed tokens from that window.
Liang discloses packaging comprises:
identifying a pruning target (page 4-5, section 3.2-3.3; pruning target is top-k tokens plus one packed token (k+1); see Fig. 2); identifying P tokens in each window having a lowest information content, where an area of each window is P-1 greater than an area for each window to meet the pruning target; removing the P tokens from each window; combining the P removed tokens to produce a packaged token for each window from the P removed tokens from that window (page 4-5, section 3.2-3.3; in Fig. 2, 5 inattentive tokens (P tokens) having low information; initial window on left side of Fig. 2 there are 10 image tokens (area of window) in window and area of pruning target is 5 attentive tokens plus one packed token from inattentive token for total 6 tokens; 10 image tokens as area of window is greater by (5-1=4) than target which is 6; the P lowest tokens (inattentive) removed and are packed into one token as shown in Fig. 2).
It would have been obvious to one of ordinary skill in the art at the time of the invention was made to modify the system of Long as taught by Liang to provide packaging of tokens based on target pruning.
The motivation to combine the references is to provide improved accuracy in classification of images by reorganizing tokens based on the packaging method which fuses inattentive tokens into a token instead of removing these tokens (page 2, second last paragraph; page 4, last 3 paragraph).
Regarding claim 5, Zhang discloses the method of claim 4, wherein the packaged token is reintegrated into the each of the windows (paragraph 61-62; merged windows in upper layer pack tokens from previous layers) and Thorsley discloses reintegrating in a shared location across the windows (page 4-5, section 3.2-3.3; the P lowest tokens (inattentive) removed and are packed into one token as shown in Fig. 2; the attentive tokens are appended (reintegrated) to the packed one token in Fig. 2 in new window of tokens; Fig. 2 shows the packed token from inattentive tokens at shared location which is bottom for each window and top attentive tokens are situated at the top).
Regarding claim 11, see rejection of claim 4.
Regarding claim 12, see rejection of claim 5.
Regarding claim 18, see rejection of claim 4.
Allowable Subject Matter
Claims 6-7, 13-14, 19-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Other Prior Art Cited
14. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20240296652 to Wang.
US 20240268700 to Liu.
"HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers" to Dong et al.
"Which Tokens to Use? Investigating Token Reduction in Vision Transformers" to Haurum et al.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENIYAM MENBERU whose telephone number is (571) 272-7465. The examiner can normally be reached on Monday-Friday, 10:00am-6:30pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi Sarpong can be reached on (571) 270-3438. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the customer service office whose telephone number is (571) 272-2600. The group receptionist number for TC 2600 is (571) 272-2600.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
For more information about the PAIR system, see <http://pair-direct.uspto.gov/>. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Patent Examiner
Beniyam Menberu
/BENIYAM MENBERU/Primary Examiner, Art Unit 2681
03/20/2026