DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Response to Amendment
The amendment filed on 1 April 2026 has been entered.
The amendment of claim 10 has been acknowledged.
In view of the amendment, the claim objection has been withdrawn.
Response to Arguments
Applicant's arguments filed on 1 April 2026, with respect to the pending claims, have been fully considered but they are not persuasive.
Applicant’s Representative submits that the prior art (Gani) does not teach the claims because Gani is silent with respect to the idea of the patches being converted to the distillation token.
The examiner respectfully disagrees. The language of claim 1 only requires “converting the patch labels to token labels.” Using the broadest reasonable interpretation in light of the claims and the specification, patch labels can be interpreted as a portion of image use in a vision transformer (see specification [0006]).
Gani Fig. 4 teaches that patch labels are fed into a vision transformer. Gani further teaches how ViTs work, see Gani ¶¶0005 “ViT typically splits the image into a grid of non-overlapping patches before passing them to a linear projection layer to adjust the token dimensionality. These tokens are then processed by a series of feed-forward and multi-headed self-attention layers” (emphasis added).
Further see Gani Figs. 4-5 & ¶¶0048-¶¶0051: “The present training method consists of two stages including Self-supervised View Prediction 410 followed by Supervised Label Prediction 430 tasks … The class token is a trainable vector, appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class. The distillation token 506 is used similarly as the class token 502 … In an embodiment, the full transformer model is used in each of the student 412, teacher 422 and vision transformer 442” (emphasis added).
In view of this reasonable interpretation of the claims and the prior art, the examiner respectfully submits that the rejections set forth below remain proper.
Claim Rejections - 35 USC § 102
Claim(s) 1, 3-6, 10, 13, 14, 16, 19, and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Gani et al. (US 2024/0212330 A1), hereinafter referred to as Gani.
Regarding claim 1, Gani teaches a computer-implemented method for training an image classifier, the method comprising:
training a first vision transformer model to generate patch labels for corresponding images patches of images; converting the patch labels to token labels; and training a second vision transformer model to classify images based on the token labels (Gani Abstract: “A deep learning training system and method … receives the generated global views as a first sequence of non-overlapping image patches … trains parameters in a student-teacher network to predict a class of objects … The teacher parameters are updated via exponential moving average of the student network parameters. The parameters in the teacher network are transferred to the vision transformer, and the vision transformer is trained by supervised learning”; Gani ¶¶0050: “The DeiT model includes a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. The DeiT model adds a new token, the distillation token 504, to the initial embeddings (patches 504 and class 502 token). The class token is a trainable vector, appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class. The distillation token 506 is used similarly as the class token 502. It interacts with other embeddings through self-attention 508 and is output by the network after the last layer. To get a full transformer block a Feed-Forward Network (FFN) 510 is added on top of the self-attention layer 508. This FFN 510 is composed of two linear layers separated by a GeLu activation”; further see the response to argument above and Gani ¶¶0005 “ViT typically splits the image into a grid of non-overlapping patches before passing them to a linear projection layer to adjust the token dimensionality. These tokens are then processed by a series of feed-forward and multi-headed self-attention layers”; Gani Figs. 4-5 & ¶¶0048-¶¶0051: “The present training method consists of two stages including Self-supervised View Prediction 410 followed by Supervised Label Prediction 430 tasks … The class token is a trainable vector, appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class. The distillation token 506 is used similarly as the class token 502 … In an embodiment, the full transformer model is used in each of the student 412, teacher 422 and vision transformer 442”).
Regarding claim 3, Gani teaches the computer-implemented method of claim 1, wherein training the first vision transformer model comprises:
dividing a first training image into a plurality of image patches (Gani Abstract discussed above);
presenting the plurality of image patches to the first vision transformer model to generate a first image classification for the first training image and a plurality of first respective patch labels representing classifications for the plurality of image patches (Gani Abstract & ¶¶0050 discussed above; Gani Figs. 4-5);
computing a first loss based on the first image classification and a ground truth classification for the first training image (Gani ¶¶0052: “In the original DeiT, the target objective is given by the distillation component of the loss. The target objective uses a hard-label distillation. Hard-label distillation is a variant of distillation where the hard decision of the teacher is taken as a true label. Let Zs be the logits of the student model. LCE is the cross-entropy 516, y is the softmax function. Let yt=argmaxcZt(c) 518 be the hard decision of the teacher”);
computing a second loss based on the ground truth classification for the first training image and the plurality of first respective patch labels (Gani Abstract & ¶¶0050, ¶¶0052 discussed above); and
updating the first vision transformer model based on the first loss and the second loss (Gani ¶¶0033: “the same ViT network is finetuned on the same target dataset using cross-entropy loss”; Gani Abstract & ¶¶0052 discussed above).
Regarding claim 4, Gani teaches the computer-implemented method of claim 3, wherein the plurality of image patches are non-overlapping image patches (Gani Abstract discussed above).
Regarding claim 5, Gani teaches the computer-implemented method of claim 3, wherein computing the second loss comprises computing an average patch label from the plurality of first respective patch labels (Gani ¶¶0014: “receive the generated global views as a first sequence of non-overlapping image patches, receive the generated global views and the generated local views as a second sequence of non-overlapping image patches, train parameters in a student-teacher network that includes a student network and a teacher network to predict a class of objects in the global views and the local views by self-supervised view prediction using the first sequence and the second sequence, wherein the processing circuitry updates the teacher parameters via exponential moving average of the student network parameters”; Gani ¶¶0068: “The teacher parameters are updated via exponential moving average (EMA) 416 of the student weights using: θt←λθt+(1−λθs) where θt and θs denote the parameters of teacher 422 and student 412 network respectively and, 2 follows the cosine schedule from 0.996 to 1 during training”).
Regarding claim 6, Gani teaches the computer-implemented method of claim 3, wherein:
the first loss is a cross entropy loss; and the second loss is a cross entropy loss (Gani ¶¶0033 & ¶¶0052 discussed above).
Regarding claim 10, Gani teaches the computer-implemented method of claim 1, wherein training the second vision transformer model comprises:
dividing a first training image into a plurality of image patches (Gani Abstract discussed above);
presenting the plurality of image patches to the first vision transformer model to generate a plurality of first patch labels representing respective classifications for each of the plurality of image patches by the first vision transformer model (Gani Abstract, Figs. 4-5, & ¶¶0050 discussed above);
converting the plurality of first patch labels to the second vision transformer model to generate an image classification for the first training image and a plurality of second patch labels representing respective classifications for the plurality of image patches by the second vision transformer model (Gani Abstract, Figs. 4-5, & ¶¶0050 discussed above);
computing a first loss based on the image classification and a ground truth classification for the first training image (Gani ¶¶0052 discussed above);
computing a second loss based on the plurality of token labels and the plurality of second patch labels (Gani Abstract & ¶¶0050, ¶¶0052 discussed above); and
updating the second vision transformer model based on the first loss and the second loss (Gani Abstract, ¶¶0033 & ¶¶0052 discussed above).
Regarding claim 13, Gani teaches one or more non-transitory computer readable media storing instructions that, when executed by one or more processors, cause the one or more processors to perform the processes described in claim 1 (Gani ¶¶0015: “a non-transitory computer readable storage medium storing program instructions for a deep learning training framework, which when executed by processing circuitry of a machine learning engine, perform a method”). Therefore, claim 13 is rejected using the same rationale as applied to claim 13 discussed above.
Regarding claim 14, Gani teaches the one or more non-transitory computer readable media of claim 13, wherein the image processing task is image classification (Gani Fig. 7 & ¶¶0060: “image classification, object detection, and semantic segmentation”).
Claim 16 is rejected using the same rationale as applied to claim 3 discussed above.
Claim 19 is rejected using the same rationale as applied to claim 10 discussed above.
Regarding claim 20, Gani teaches a system comprising:
one or more memories storing instructions, and one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to perform the processes descried in claims 1, 3, and 14 (Gani Fig. 13 & ¶¶0098: “The interfaces, memory and processors may communicate over the system bus 1326. The computer system 1300 includes a power supply 1321, which may be a redundant power supply”).
Therefore, claim 20 is rejected using the same rationale as applied to claim 1, 3, and 14 discussed above.
Claim Rejections - 35 USC § 103
Claim(s) 2 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gani et al. (US 2024/0212330 A1), in view of Zhou et al. (“Understanding The Robustness in Vision Transformers,” arXiv:2204.12451v4 [cs.CV] 8 Nov 2022), hereinafter referred to as Gani and Zhou, respectively.
Regarding claim 2, Gani teaches the computer-implemented method of claim 1, wherein the network is a ViT using self-attention tokens and MLP (Gani ¶¶0050 discussed above; Gani ¶¶0067: “The features representation of each view is further processed by a 3-layer self-supervised MLP Projection (MLP 414) of the student 412 and teacher 422 networks”), but Gani does not appear to explicitly teach that the networks are fully attentional networks.
Pertaining to the same field of endeavor, Zhou teaches that the first vision transformer model is a fully attentional network; and the second vision transformer model is a fully attentional network (Zhou Fig. 2 & pg. 2 right column: “we propose a novel attentional channel processing design which promotes channel selection through reweighting … the attentional design is dynamic and content-dependent … The proposed fully attentional design is both efficient and effective, bringing systematically improved robustness with marginal extra costs”).
Gani and Zhou are considered to be analogous art because they are directed to vision transformers. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system and method of training vision transformer on small-scale datasets (as taught by Gani) to use fully attentional networks (as taught by Zhou) because the combination improves robustness with marginal extra costs (Zhou pg. 2 right column).
Claim 15 is rejected using the same rationale as applied to claim 2 discussed above.
Claim(s) 7, 8, 11, 12, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gani et al. (US 2024/0212330 A1), in view of Li et al. (US 2023/0368494 A1), hereinafter referred to as Gani and Li, respectively.
Regarding claim 7, Gani teaches the computer-implemented method of claim 3, but does not appear to explicitly teach that converting the patch labels to the token labels comprises emphasizing the patch labels based on confidence scores for the patch labels.
Pertaining to the same field of endeavor, Li teaches that converting the patch labels to the token labels comprises emphasizing the patch labels based on confidence scores for the patch labels (Li ¶¶0004: “pruning tokens of the input image may include pruning tokens that are not in a group of a minimum number of highest-weighted tokens having token importance scores that sum to be equal to greater than the predetermined threshold value”; Li ¶¶0029: “The training framework disclosed herein trains the vision transformer 200 so that the early transformer layers (l0 ~ lP−1) learn to identify the importance of each patch token. At a designated pruning layer lP, token importance scores (TISs) are extracted based on the attention weights, and are used for token selection and sparsification. The subsequent layers lP+1 → lL−1 are alternately trained using pruned tokens and fully dense tokens (i.e., without pruning)”).
Gani and Li are considered to be analogous art because they are directed to vision transformers. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system and method of training vision transformer on small-scale datasets (as taught by Gani) to emphasis patch labels based on confidence scores (as taught by Li) because the combination prunes less important tokens (Li ¶¶0004).
Regarding claim 8, Gani, in view of Li, teaches the computer-implemented method of claim 7, wherein emphasizing the patch labels based on the confidence scores comprises converting the patch labels with low confidence scores to token labels indicating a background classification (Li ¶¶0045: “The patches of input images containing target objects receive higher TIS and the background patches have lower TIS. When small target objects occupy fewer number of patches, the corresponding distribution of TIS tends to be more concentrated, whereas large objects have associated TIS values spread over a larger area. As a result, given a sparsification strategy that is based on a mass threshold Mth, TSM is able to adjust the number of selected tokens based on the input image”).
Regarding claim 11, Gani teaches the computer-implemented method of claim 10, but does not appear to explicitly teach that the losses are combined.
Pertaining to the same field of endeavor, Li teaches computing a combined loss from the first loss and the second loss (Li ¶¶039: “Combining the distillation loss with the label loss, the final training loss Ltot”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system and method of training vision transformer on small-scale datasets (as taught by Gani) to combine the losses (as taught by Li) because the combination improves the accuracy (Li ¶¶0037).
Regarding claim 12, Gani teaches the computer-implemented method of claim 10, wherein:
the first loss is a cross entropy loss (Gani ¶¶0033 & ¶¶0052 discussed above).
However, Gani does not appear to explicitly teach that the second loss is an aggregate of respective cross entropy losses between the plurality of token labels and the plurality of second patch labels.
Pertaining to the same field of endeavor, Li teaches that the second loss is an aggregate of respective cross entropy losses between the plurality of token labels and the plurality of second patch labels (Li ¶¶0037: “To improve the accuracy of the early layers in learning TIS, the distillation loss is introduced at lP via knowledge transfer from a teacher model … The distillation loss may be computed using Kullback-Leiber (KL) divergence … CLS is the classification token from the last layer of teacher module backbone averaged across all attention heads”; Li ¶¶039: “Combining the distillation loss with the label loss, the final training loss Ltot … in which Llabel=CrossEntropy(y, y) as the Cross Entropy loss between the model predictions y and the ground truth labels y”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system and method of training vision transformer on small-scale datasets (as taught by Gani) to aggregate the losses (as taught by Li) because the combination improves the accuracy (Li ¶¶0037).
Claim 17 is rejected using the same rationale as applied to claim 7 discussed above.
Claim(s) 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gani et al. (US 2024/0212330 A1), in view of Li et al. (US 2023/0368494 A1), and further in view of Meng et al. (“AdaViT: Adaptive Vision Transformers for Efficient Image Recognition,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)), hereinafter referred to as Gani, Li, and Meng, respectively.
Regarding claim 9, Gani, in view of Li, teaches the computer-implemented method of claim 7, but does not appear to explicitly teach using a Gumbel-SoftMax block.
Pertaining to the same field of endeavor, Meng teaches processing the patch labels and the confidence scores with a Gumbel-SoftMax block (Meng pg. 12310: “Since binary decisions are non-differentiable, we resort to Gumbel-Softmax [26] during training to make the whole framework end-to-end trainable”; Meng pg. 12312 left column: “We relax the sampling process with Gumbel-Softmax trick [26] to make it differentiable during training”; Meng Eq. (10) & pg. 12313 left column: “Given an input image I with a label y, the final prediction is produced by the transformer F with parameters θ, and the cross-entropy loss is computed … A common solution is to resort to reinforcement learning and optimize the network with policy gradient methods …To this end, we use the Gumbel-Softmax trick [26] to relax the sampling”).
Gani, in view of Li, and Meng are considered to be analogous art because they are directed to vision transformers. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system and method of training vision transformer on small-scale datasets (as taught by Gani, in view of Li) to use Gumbel-SoftMax (as taught by Meng) because the combination makes the whole framework end-to-end trainable (Meng pg. 12310).
Claim 18 is rejected using the same rationale as applied to claim 9 discussed above.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1-2, 13-15, and 20 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-20 of copending Application No. 18/119,770 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because both applications are directed to training vision transformers using image patches, tokens, and a fully-attentional network.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOO J SHIN whose telephone number is (571)272-9753. The examiner can normally be reached M-F; 10-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Soo Shin/Primary Examiner, Art Unit 2667 571-272-9753
soo.shin@uspto.gov