Last updated: May 29, 2026

Application No. 18/476,033

PROCESSING DATA USING CONVOLUTION AS A TRANSFORMER OPERATION

Final Rejection §102§103

Filed

Sep 27, 2023

Priority

Oct 06, 2022 — provisional 63/413,903

Examiner

SHIN, SOO JUNG

Art Unit

2667

Tech Center

2600 — Communications

Assignee

Qualcomm Incorporated

OA Round

2 (Final)

Interview Optional

— +16.4% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 87% grant rate with +16.4% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 610 resolved cases, 2023–2026

Examiner Intelligence

SHIN, SOO JUNG View full profile →

Grants 87% — above average

Career Allowance Rate

531 granted / 610 resolved

+25.0% vs TC avg

Strong +16% interview lift

Without

With

+16.4%

Interview Lift

resolved cases with interview

Fast prosecutor

2y 2m

Avg Prosecution

23 currently pending

Career history

636

Total Applications

across all art units

Statute-Specific Performance

§101

3.4%

-36.6% vs TC avg

§103

63.8%

+23.8% vs TC avg

§102

4.1%

-35.9% vs TC avg

§112

19.0%

-21.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 610 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.

Response to Amendment
The amendment filed on 19 February 2026 has been entered.
The amendment of claim 14 has been acknowledged.
In view of the amendment, the claim object has been withdrawn.

Response to Arguments
Applicant's arguments filed 19 February 2026, with respect to the pending claims, have been fully considered but they are not persuasive.
Applicant’s representative submits that the prior art of record does not teach the claims because the prior art does not teach a depth-wise separation convolution filter to generate the first output and applying a pointwise convolution filter to the first output to generate the second output because the prior art is silent regarding generating a second output based on global information from a spatial dimension and a channel dimension associated with the image.
The examiner respectfully disagrees. The limitation “applying, via the convolution engine, a pointwise convolution filter to the first output to generate a second output based on global information from a spatial dimension and a channel dimension associated with the image” requires that the second output to be generated from the first output using a pointwise convolution filter. This “applying” is based on global information from spatial and channel dimensions. 
The prior art of record (Mehta) teaches that the first output is generated using a depth-wise separable convolution filter (Mehta Fig. 1 & pg. 8: “we replace standard convolutions in the SSD head with separable convolutions” [Wingdings font/0xE0] the standard linear convolution is replaced with depth-wise separable convolution filter, refer to Fig. 1(b) showing the depth d, as opposed to the linear d in Fig. 1(a)).
The prior art further teaches generating a second output from the first output using the point-wise convolution filter (Mehta Fig. 1(b): the “Fusion” step shows a point-wise convolution conv 1x1 from the first output from the “Transformers as Convolutions” step).
The prior art further teaches that the applying is based on global information from a spatial and channel dimensions associated with the image (Mehta Fig. 1(b) teaches that the “applying” is based on global information from spatial and channel dimensions, see H, W, d, P, Lx).
In view of this reasonable interpretation of the claims and the prior art, the examiner respectfully submits that the rejections set forth below remain proper.

Claim Rejections - 35 USC § 102
Claim(s) 1-6, 9-11, 13-19, 22-24, and 26-28 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Mehta et al. (“MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,” arXiv:2110.02178v2 [ cs.CV] 4 Mar 2022), hereinafter referred to as Mehta.
Regarding claims 1, 14, and 27, Mehta teaches a processor-implemented method and apparatus for processing image data, the method and apparatus comprising:
at least one memory (Mehta pg. 9: “These optimizations improve latency and memory access”; pg. 18: “Memory footprint. A light-weight network running on mobile devices should be memory efficient”);
at least one processor coupled to the at least one memory (Mehta pg. 18: “iPhone 12 CPU, iPhone 12 neural engine, and NVIDIA V100 GPU”); and
a non-transitory computer-readable memory storing instructions which cause at least one processor coupled to the non-transitory computer-readable memory (Mehta pg. 9 & 18 discussed above) to be configured to:
receive, at a convolution engine of a machine learning system, a first set of features associated with an image, the first set of features being associated with a three-dimensional shape (Mehta Fig. 1 & pg. 4: “Here, C, H, and W represent the channels, height and width of the tensor respectively, and P = wh is number of pixels in the patch with height h and width w … Mobile ViT block. … a given input tensor X”);
apply, via the convolution engine, a depth-wise separable convolutional filter to the first set of features to generate a first output (Mehta pg. 3: “Vision transformers … uses depth-wise separable convolutions”; Mehta pg. 8: “Implementation details … we replace standard convolutions in the SSD head with separable convolutions”);
apply, via the convolution engine, a pointwise convolutional filter to the first output to generate a second output based on global information from a spatial dimension and a channel dimension associated with the image (Mehta Fig. 1 & pg. 5: “XF is then projected to low C-dimensional space using a point-wise convolution and combined with X via concatenation operation”);
modify the second output to the three-dimensional shape to generate a second set of features (Mehta Fig. 1 & pg. 5 discussed above); and
combine the first set of features and the second set of features to generate an output set of features (Mehta Fig. 1 & pg. 5 discussed above).
Refer to the comparison of Applicant’s Fig. 2 & Mehta’s Fig. 1 (see next page). Both use the same architecture and operations.

Regarding claims 2, 15 and 28, Mehta teaches the processor-implemented method, apparatus, and non-transitory computer-readable memory of claims 1, 14 and 27, wherein the convolution engine is configured to perform transformer operations (Mehta Fig. 1(b): see ViT (visual transformer) blocks).

Regarding claims 3 and 16, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14, wherein the convolution engine is configured to perform pair-wise self-attention and global feature extraction operations (Mehta Abstract: “To learn global representations, self-attention-based vision transformers (ViTs) have been adopted”; Mehta pg. 4: “The computational cost of self-attention in vision transformers is O(N2d)”).

Regarding claims 4 and 17, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14, further comprising:
performing, via the machine learning system, image classification associated with the image based on the output set of features (Mehta pg. 7: “4.1. Image Classification on the ImageNet-1K Dataset … We train MobileViT models from scratch on the ImageNet-1k classification dataset”; Mehta pg. 8: “We use smooth L1 and cross-entropy losses for object localization and classification, respectively”).




    PNG
    media_image1.png
    483
    900
    media_image1.png
    Greyscale

Applicant’s Fig. 2


    PNG
    media_image2.png
    438
    974
    media_image2.png
    Greyscale

Mehta Fig. 1(b)



Regarding claims 5 and 18, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14,  further comprising: receiving, at a second convolution engine of the machine learning system, a third set of features, the third set of features being generated based on the output set of features; applying, via the second convolution engine, an additional depth-wise separable convolutional filter to the third set of features to generate a third output; applying, via the second convolution engine, an additional pointwise convolutional filter to the third output to generate a fourth output based on the global information from the spatial dimension and the channel dimension associated with the image; modifying the fourth output to the three-dimensional shape to generate a fourth set of features; and combining the third set of features and the fourth set of features to generate an additional output set of features (see Mehta Fig. 1(b) discussed above. The network architecture comprises multiple ViT blocks. Each ViT blocks comprise transformers as convolutions).

Regarding claim 6 and 19, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14, wherein the first set of features is associated with a local representation of the image and the second set of features is associated with a global representation of the image (Mehta Fig. 1(b) discussed above. See “Local representations” and “Transformers as Convolutions (global representations)”).

Regarding claims 9 and 22, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14, wherein combining the first set of features and the second set of features to generate the output set of features comprises performing an element-wise cross-product between the first set of features and the second set of features (Mehta Fig. 1(b): conv-n × n).

Regarding claims 10 and 23, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14, wherein the first set of features comprises a first tensor, wherein the second set of features comprises a second tensor, and wherein the output set of features comprises an output tensor (Mehta Fig. 1(b) & pg. 2: “we introduce the MobileViT block that encodes both local and global information in a tensor effectively”).

Regarding claims 11 and 24, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14, further comprising:
obtaining first features from a first intermediate feature map based on the output set of features, the first intermediate feature map having a first resolution (Mehta Fig. 1(b): 128 x 128 [Wingdings font/0xE0] 64 x 64 [Wingdings font/0xE0] 32 x 32 [Wingdings font/0xE0] 16 x 16 [Wingdings font/0xE0] 8 x 8 [Wingdings font/0xE0] 1 x 1);
obtaining second features from a second intermediate feature map based on the output set of features the second intermediate feature map having a second resolution different from the first resolution (Mehta Fig. 1(b) discussed above); 
combining, via a self-attention engine, the first features and second features to generate fused features (Mehta Fig. 1(b): “Fusion”; Mehta pg. 4 discussed above; Mehta pg. 5: “vision transformers (ViTs) with multi-head self-attention are shown to be effective for visual recognition tasks”); and
predicting a location of an object in the image based on the fused features (Mehta Fig. 1(b) discussed above).

Regarding claims 13 and 26, Mehta teaches the processor-implemented method and apparatus of claims 11 and 24, further comprising:
performing a first prediction based on the first features; performing a second prediction based on the second features; and performing a third prediction based on the fused features (Mehta Fig. 1(b) discussed above).

Claim Rejections - 35 USC § 103
Claim(s) 7, 8, 12, 20, 21, and 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mehta et al. (“MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,” arXiv:2110.02178v2 [ cs.CV] 4 Mar 2022), in view of Jenatton et al. (US 2023/0107409 A1), hereinafter referred toas Mehta and Jenatton, respectively.
Regarding claims 7 and 20, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14, wherein the depth-wise separable convolutional filter extracts spatial information from a spatial domain of the first set of feature to generate the first output (Mehta Fig. 1(b): “Output spatial dimensions”; Mehta pg. 4: “The n x n convolutional layer encodes local spatial information while the point-wise convolution projects the tensor to a high-dimensional space (or d-dimensional, where d > C) by learning linear combinations of input channels”; Mehta pg. 8: “we replace standard convolutions in the SSD head with separable convolutions and call the resultant network as SSDLite”).
However, Mehta does not appear to explicitly teach that the convolutional filter comprises a spatial multilayer perceptron (MLP).
Pertaining to the same field of endeavor, Jenatton teaches that the convolutional filter comprises a spatial MLP (Jenatton ¶¶0089: “The first n - 4 blocks are ViT blocks 330. A ViT block is a Vision Transformer self-attention block for processing representations of images”; Jenatton ¶¶0104: “Each expert neural network (which, in the example of FIG. 4, is an MLP) processes the representations that are assigned to the expert to generate an expert output and the block 130 combines the expert outputs for each of the representations so that the block output again includes k=2 representations of each network input”; Jenatton Fig. 4).
Mehta and Jenatton are considered to be analogous art because they are directed to image processing using vision transformers. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the MobileViT (as taught by Mehta) to use MLP (as taught by Jenatton) because the combination can utilize multiple “expert” network blocks selectively activates subsets of the parameters of the neural network based on the network input, significantly improving the time and computational efficiency of the neural network (Jenatton ¶¶0006).

Regarding claims 8 and 21, Mehta teaches the processor-implemented method and apparatus of claims 1 and 14, wherein the pointwise convolutional filter extracts channel information from a channel dimension of the first set of features to generate the second output (Mehta Fig. 1(b), pg. 4, pg. 8 discussed above; further see Mehta pg. 4: “channel height … input channels”; Table 14: “Output channels”).
However, Mehta does not appear to explicitly teach that the convolutional filter comprises a channel multilayer perceptron (MLP).
Pertaining to the same field of endeavor, Jenatton teaches that the convolutional filter comprises a spatial MLP (Jenatton ¶¶0089, ¶¶0104 & Fig. 4 discussed above).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the MobileViT (as taught by Mehta) to use MLP (as taught by Jenatton) because the combination can utilize multiple “expert” network blocks selectively activates subsets of the parameters of the neural network based on the network input, significantly improving the time and computational efficiency of the neural network (Jenatton ¶¶0006).

Regarding claims 12 and 25, Mehta teaches the processor-implemented method and apparatus of claims 11 and 24, combining the first and second features using the self-attention engine (Mehta Abstract, Fig. 1(b), pg. 4-5 discussed above).
However, Mehta does not appear to explicitly teach using dot-products and applying a Softmax function to the output to combine the features.
Pertaining to the same field of endeavor, Jenatton teaches performing, via the self-attention engine, a first dot-product of each feature of the first features of the first intermediate feature map with each feature of the second features in the second intermediate feature map; performing, via the second-attention engine, a second dot-product of each feature of the second features of the second intermediate feature map with each feature of the first features in the first intermediate feature map; applying, via the self-attention engine, a Softmax function to an output of the first dot-product and an output of the second dot-product; and performing, via the self-attention engine, a weighted summation of an output of the Softmax function to combine the first features and the second features (Jenatton ¶¶0111: “The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function, e.g., a dot product or scaled dot product, of the query with the corresponding key”; Jenatton ¶¶0113: “For example the attention mechanism may be a dot product attention mechanism applied by applying each query vector to each key vector to determine respective weights for each value vector, then combining the value vectors using the respective weights to determine the self-attention layer output for each element of the input sequence. The self-attention layer output may be scaled by a scaling factor, e.g., by the square root of the dimensions of the queries and keys, to implement scaled dot product attention. Thus, for example, an output of the attention mechanism may be determined as softmax”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the MobileViT (as taught by Mehta) to use dot products and softmax function (as taught by Jenatton) because the combination can utilize multiple “expert” network blocks selectively activates subsets of the parameters of the neural network based on the network input, significantly improving the time and computational efficiency of the neural network (Jenatton ¶¶0006).

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOO J SHIN whose telephone number is (571)272-9753. The examiner can normally be reached M-F; 10-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached at (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Soo Shin/Primary Examiner, Art Unit 2667                                                                                                                                                                                                        571-272-9753
soo.shin@uspto.gov

Read full office action

Prosecution Timeline

Sep 27, 2023

Application Filed

Nov 21, 2025

Non-Final Rejection mailed — §102, §103

Feb 17, 2026

Applicant Interview (Telephonic)

Feb 17, 2026

Examiner Interview Summary

Feb 19, 2026

Response Filed

Mar 27, 2026

Final Rejection mailed — §102, §103

May 27, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/143,306

Patent 12633145

APPARATUS AND METHOD FOR PERFORMING IMAGE AUTHENTICATION

3y 0m to grant Granted May 19, 2026

18/179,443

Patent 12633093

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND STORAGE MEDIUM FOR RECOGNIZING TASKS

3y 2m to grant Granted May 19, 2026

18/465,502

Patent 12626487

OBJECT DETECTION BASED ON ATROUS CONVOLUTION AND ADAPTIVE PROCESSING

2y 8m to grant Granted May 12, 2026

18/480,082

Patent 12626499

IMAGE PROCESSING DEVICE AND IMAGE PROCESSING METHOD

2y 7m to grant Granted May 12, 2026

18/380,365

Patent 12620073

IMAGE RECOGNITION METHOD, SYSTEM AND MOBILE VEHICLE

2y 6m to grant Granted May 05, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

87%

Grant Probability

99%

With Interview (+16.4%)

2y 2m (~0m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 610 resolved cases by this examiner. Grant probability derived from career allowance rate.