Last updated: April 19, 2026
Application No. 18/613,263
TRANSFORMER WITH MULTI-SCALE MULTI-CONTEXT ATTENTIONS

Non-Final OA §101§103§112
Filed
Mar 22, 2024
Examiner
ROBERTS, RACHEL L
Art Unit
2674
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
Interview Optional

— +14.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 19 resolved cases, 2023–2026
Examiner Intelligence

ROBERTS, RACHEL L View full profile →
Grants 90% — above average
Career Allow Rate
17 granted / 19 resolved
+27.5% vs TC avg
Moderate +14% lift
Without
With
+14.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
35 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
12.1%
-27.9% vs TC avg
§103
65.1%
+25.1% vs TC avg
§102
7.9%
-32.1% vs TC avg
§112
12.1%
-27.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 19 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant claims the benefit of US Provisional Application No. 63/509,590, filed 06/22/2023.  Claims 1-20  have been afforded the benefit of this filing date.

Information Disclosure Statement
The IDS dated 03/22/2024 and 11/01/20247 have been considered and placed in the application file.  

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1’s term “size of the transformed version of the image pixels” renders claim 1 indefinite.  The specification (US 20240428576) uses “size” to refer to at least four distinct and non-interchangeable quantities: (1) the token sequence length N, which drives the O(N^2) complexity analysis (¶28), (2) the slice hyper parameter L defining the dimension of each slice sub0tenstor (¶50), (3) the spatial height-width resolution HxW of a feature map (¶¶157, 162, 168), and (4) the dimensionality of internal attention matrices such as the query matrix (¶179).  The only contextual guidance offered in the dynamic-selection discussion, the parenthetical “(e.g., have smaller resolution)” at ¶171, is explicitly non-limiting by the virtue of the “e.g.” qualifier and does not constitute a lexicographic definition.  Because the specification does not resolve which quantities constitutes the operative “size” for the purposes of the selection step, the metes and bounds of claim 1 cannot be determined with reasonable certainty.   Nautilus, Inc. v. Biosig Instruments, Inc., 572 U.S. 898, 910 (2014).
Claim(s) 2-20 depend either directly or indirectly from the rejection(s) of claim(s) 1, therefore they are also rejected. Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1- 19 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more. The claim(s) recite(s) selecting and applying a variable number of mathematical attention operations to image data based on input size parameters, a mathematical step under MPEP §2106.04(a)(2)(I) . This judicial exception is not integrated into a practical application because the additional elements of generic memory and processor merely implement the mathematical operations without improving computer functionality or producing a physical result. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception except claim 20 which recites the specific serial multi-scale attention pipeline that reflects the computational efficiency improvement described in the specification. When reviewing independent claims 1 and based upon consideration of all of the relevant factors with respect to the claim as a whole, claims 1- 19 are held to claim an abstract idea without reciting elements that amount to significantly more than the abstract idea and is/are therefore rejected as ineligible subject matter under 35 U.S.C. 101. The Examiner will analyze Claim 1. The rationale, under MPEP § 2106, for this finding is explained below:  
The claimed invention (1) must be directed to one of the four statutory categories, and (2) must not be wholly directed to subject matter encompassing a judicially recognized exception, as defined below. The following two step analysis is used to evaluate these criteria.
Step 1: Is the claim directed to one of the four patent-eligible subject matter categories: process, machine, manufacture, or composition of matter?
When examining the claim under 35 U.S.C. 101, the Examiner interprets that the claims is related to a machine since the claim is directed to a processing system comprising memory and processors. 
Step 2a, Prong 1: Does the claim wholly embrace a judicially recognized exception, which includes laws of nature, physical phenomena, and abstract ideas, or is it a particular practical application of a judicial exception?
The Examiner interprets that the judicial exception applies since Claim 1 limitation of selecting a number of local attention operations based on input size (a mathematical relationship); and generating a transformer output by applying those operations plus a global attention operation (matrix multiplications, SoftMax, weighted summations). These limitations together recite a mathematical concept is directed to an abstract idea. The claim is related to mathematical concepts.  Analytics, Inc. v. Fox Corp., No. 23-2437, 134 F.4th 1205 (Fed. Cir. 2025). If the claim recites a judicial exception (i.e., an abstract idea enumerated in MPEP § 2106.04(a), a law of nature, or a natural phenomenon), the claim requires further analysis in Prong Two. 
Step 2a, Prong 2: Does the claim recite additional elements that integrate the judicial exception into a practical application? 
The Examiner interprets that Claim 1 limitation does not provide additional elements or combination of additional elements to a practical application since the claims are not adding insignificant extra-solution activity to the judicial exception. Since the claim is bringing in additional elements, memory, processors, and “in a device,” are generic computer components that implement the abstract idea rather than integrate it into a practical application. Specifically, the analysis method does not integrate a judicial exception into practical application. See Genetic Techs. v. Merial LLC, 818 F.3d 1369, 1376, 118 USPQ2d 1541, 1546 (Fed. Cir. 2016) (eligibility "cannot be furnished by the unpatentable law of nature (or natural phenomenon or abstract idea) itself."). For a claim reciting a judicial exception to be eligible, the additional elements (if any) in the claim must "transform the nature of the claim" into a patent-eligible application of the judicial exception, Alice Corp., 573 U.S. at 217, 110 USPQ2d at 1981, either at Prong Two or in Step 2B.  The Examiner interprets that Claim 1 limitation does not provide additional elements or combination of additional elements to a practical application since the claims do not provide clear improvement to a technology or to computer functionality. The claim terminates in “generate a transformer output,” an intermediate computational result with no downstream physical effect, machine control, or real-world application. Applicant may argue the dynamic selection mechanism reflects the computational efficiency improvement described in, ¶[0028] (O(N2) yields near-linear complexity), but that efficiency improvement is the abstract idea itself, the mathematical relationship between input size and operation count. Using the abstract idea to bootstrap the practical application analysis is circular and unavailing. The spec's described improvement requires the claim to reflect the specific components or steps that provide it, the generic processors executing a selection algorithm do not reflect these specific components or steps. §2106.05(a) not satisfied. See Genetic Techs. v. Merial LLC, 818 F.3d 1369, 1376, 118 USPQ2d 1541, 1546 (Fed. Cir. 2016) (eligibility "cannot be furnished by the unpatentable law of nature (or natural phenomenon or abstract idea) itself."). For a claim reciting a judicial exception to be eligible, the additional elements (if any) in the claim must "transform the nature of the claim" into a patent-eligible application of the judicial exception,  For a claim reciting a judicial exception to be eligible, the additional elements (if any) in the claim must "transform the nature of the claim" into a patent-eligible application of the judicial exception, Alice Corp., 573 U.S. at 217, 110 USPQ2d at 1981, either at Prong Two or in Step 2B.  If there are no additional elements in the claim, then it cannot be eligible. In such a case, after making the appropriate rejection, it is a best practice for the examiner to recommend an amendment, if possible, that would resolve eligibility of the claim.
Step 2b: If a judicial exception into a practical application is not recited in the claim, the Examiner must interpret if the claim recites additional elements that amount to significantly more than the judicial exception. 
The Examiner interprets that the Claims do not amount to significantly more since the Claims state using memory and processors that, considered individually and as an order combination, are well-understood, routine, and conventional (WURC) in the field. Further, there is no unconventional combination of the memory or processors. See, simply appending well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception, e.g., a claim to an abstract idea requiring no more than a generic computer to perform generic computer functions that are well-understood, routine and conventional activities previously known to the industry, as discussed in Alice Corp., 573 U.S. at 225, 110 USPQ2d at 1984.
Claims 2-19 depending on the independent claim/s include all the limitation of the independent claim. The Examiner finds that Claim 2 adds saliency map generation and semantic complexity determination, which are further abstract mathematical calculations. This is seen as an abstract idea related to a mathematical process. The claim describes the further mathematical calculations on image data, with no practical application added. See MPEP 2106.05(g) and MPEP 2106.05(e). The Examiner finds that Claim 3 selects operation count based on contextual object count from saliency map, refines mathematical selection criterion, which are further abstract mathematical concepts. This is seen as an abstract idea related to a mathematical process and no practical application is added. See MPEP 2106.05(g) and MPEP 2106.05(e). The Examiner finds that Claim 4 compares object count against thresholds, which is a mathematical threshold comparison, which are further abstract mathematical concepts. This is seen as an abstract idea related to a mathematical process and no practical application added. See MPEP 2106.05(g) and MPEP 2106.05(e). The Examiner finds that Claim 5 performs operation count directly proportional to object count, which is a mathematical proportionality relationship, which are further abstract mathematical concepts. This is seen as an abstract idea related to a mathematical process and no practical application added. See MPEP 2106.05(g) and MPEP 2106.05(e). The Examiner finds that Claim 6 selects less than 2 operations when object count satisfies threshold, which is a mathematical conditional relationship, which are further abstract mathematical concepts. This is seen as an abstract idea related to a mathematical process and no practical application added. See MPEP 2106.05(g) and MPEP 2106.05(e). The Examiner finds that Claims 7-10 obtains display resolutions based on operation selection and size of objects, which is a selection criterion that remains mathematical, which are further abstract mathematical concepts. This is seen as an abstract idea related to a mathematical process as display resolution is a physical parameter but that claims still terminates in transformer output with no downstream application, therefore no practical application added. See MPEP 2106.05(g) and MPEP 2106.05(e). The Examiner finds that Claim 11 details an operation count directly proportional to input size, which is a, mathematical proportionality relationship, which are further abstract mathematical concepts. This is seen as an abstract idea related to a mathematical process and no practical application added. See MPEP 2106.05(g) and MPEP 2106.05(e). The Examiner finds that Claim 12 details selecting less than two operations based on a size threshold, which is a, mathematical conditional relationship, which are further abstract mathematical concepts. This is seen as an abstract idea related to a mathematical process and no practical application added. See MPEP 2106.05(g) and MPEP 2106.05(e). The Examiner finds that Claim 13 details an operation count selected based further on display resolution, in which display resolution links math to field of use. This is seen as an abstract idea related to a mathematical process and no practical application added. See MPEP 2106.05(h) and MPEP 2106.05(e). The Examiner finds that Claim 14 details an operation count selected based further on display resolution, which is a mathematical proportionality to physical parameter, which are further abstract mathematical concepts. This is seen as an abstract idea related to a mathematical process and no practical application added. See MPEP 2106.05(g) and MPEP 2106.05(e).  The Examiner finds that Claim 15 details adding a camera that captures image data and processors that transform the image data to generate input, cameras are well understood and routine concepts in this field, and gathering the image data constitutes pre-solution data gathering. This is seen as insignificant extra-solution activity and no practical application added. See MPEP 2106.05(d) and MPEP 2106.05(g). The Examiner finds that Claim 16 details adding a transmitter to transmit transformer output to receiver, transmission of computational result is post solution activity. This is seen as post-solution activity and no practical application added. See MPEP 2106.05(g). The Examiner finds that Claim 17 details processors generating output prediction from the transformer output, output prediction is a downstream data result. This limitation has no physical effect or machine control recited and therefore no practical application added. See MPEP 2106.05(h). The Examiner finds that Claim 18 details adding a display configured to display output prediction. Displays are well understood and routine concepts in this field; displaying a data result is a post-solution activity. This limitation does not integrate abstract idea into practical application and therefore no practical application is added. See MPEP 2106.05(g). The Examiner finds that Claim 19 details output prediction comprising a depth map, classification, or segmentation map. This limitation specifies type of data result and has no physical output or machine control. This limitation is field of use limitation. See MPEP 2106.05(h).
Thus, Claims 2-19 recite the same abstract idea and therefore are not drawn to the eligible subject matter as they are directed to the abstract idea without significantly more. Claims 1-19 are all rejected under 35 U.S.C. §101.

Claim Interpretation
The claims in this application are given their broadest reasonable interpretation using the 
plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification.
Under MPEP 2143.03, "All words in a claim must be considered in judging the patentability of that claim against the prior art." In re Wilson, 424 F.2d 1382, 1385, 165 USPQ 494, 496 (CCPA 1970).  As a general matter, the grammar and ordinary meaning of terms as understood by one having ordinary skill in the art used in a claim will dictate whether, and to what extent, the language limits the claim scope. Language that suggests or makes a feature or step optional but does not require that feature or step does not limit the scope of a claim under the broadest reasonable claim interpretation. In addition, when a claim requires selection of an element from a list of alternatives, the prior art teaches the element if one of the alternatives is taught by the prior art. See, e.g., Fresenius USA, Inc. v. Baxter Int’l, Inc., 582 F.3d 1288, 1298, 92 USPQ2d 1163, 1171 (Fed. Cir. 2009).

Claim 19 recite “at least one of” then listing “a depth map, a classification, or a segmentation map.”. Since “at least one”  is disjunctive, any one of the elements found in the prior art is sufficient to reject the claim.  While citations have been provided for completeness and rapid prosecution, only one element is required.  Because, on balance, it appears the disjunctive interpretation enjoys the most specification support and for that reason the disjunctive interpretation (one of A, B OR C) is being adopted for the purposes of this Office Action.  Applicant’s comments and/or amendments relating to this issue are invited to clarify the claim language and the prosecution history.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


Claims 1-20 are rejected under 35 U.S.C. 103 as obvious over Lui et al  (Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF international conference on computer vision. 2021, hereafter referred to as Lui)  in view of  Rao et al  (Dynamicvit: Efficient vision transformers with dynamic token sparsification." Advances in neural information processing systems 34 (2021): 13937-13949, hereafter referred to as Rao) in further view of Wang et at (Wang, Wenxiao, et al. "CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention." arXiv e-prints (2021): arXiv-2108, hereafter referred to as Wang). 

 Regarding Claim 1, Lui teaches a processing system in a device (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method in general hardware which is well understood to include processing), comprising: 
a memory configured to store machine learning model parameters (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware); and 
one or more processors, coupled to the memory (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware), configured to:
access a transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme) as input to an attention layer of a machine learning model (Lui Fig 2 and Pg 2 Col 1 ¶2 and Table 4 discloses self-attention in computed at every window and provides connections among the windows in the model); 
	to the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme)
of the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme); and 
generate a transformer output (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied) for the attention layer of the machine learning model (Lui Fig 2 and Pg 2 Col 1 ¶2 and Table 4 discloses self-attention in computed at every window and provides connections among the windows in the model) to the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme).
Lui does not explicitly teach select a number of local attention operations to apply, in one transformer and based on applying the number of local attention operations  and at least one global attention operation. 
	Rao is in the same field of use same hierarchical vision transformer design and employing compatible attention mechanisms. Further, Rao teaches select a number of local attention operations to apply, in one transformer (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply)
based on applying the number of local attention operations  and at least one global attention operation (Rao ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply Pg 4 Section 3.2 discloses the local and global attention operations being applied). 
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lui by incorporating the runtime dynamic selection mechanism as taught by Rao, to make an invention that optimizes computational efficiency for variable resolution inputs; thus, one of ordinary skilled in the art would be motivated to combine the references since an object of the present inventions reduce complexity of vision transformers while maintaining accuracy. (Rao, Pg 2 ¶02).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.
Lui and Rao in combination so not explicitly disclose based at least in part on a size. 
Wang is in the same field of use same hierarchical vision transformer design and employs compatible attention mechanisms. Further, Wang teaches based at least in part on a size (Wang Pg 2 ¶03-04 and ¶07 and discloses a vision transformer where the number of local attention operations scales proportionally with the spatial size of the input feature map). 
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Lui in view of Rao by incorporating size-proportional attention scaling as taught by Wang, to make an invention that optimizes computational efficiency for variable resolution inputs; thus, one of ordinary skilled in the art would be motivated to combine the references since an object of the present inventions reduce complexity of vision transformers while maintaining accuracy. (Wang, Abstract and Introduction).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.

Regarding Claim 2, Lui in view of Rao in further view of Wang teaches the processing system of claim 1, wherein the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors)  are configured to: 
generate a saliency map (Rao Abstract, Section 3.1 ¶1 and Section 4.3 ¶05 disclose importance score of each token given the current features being used to determine the selected pixels) and  based on the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme); and 
determine a semantic complexity (Lui Pg 2 Col 1 ¶01 and Pg 3 Col 2 ¶01 discloses semantic segmentation to require dense prediction at the pixel level) of the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme)  based on the saliency map.  See Claim 1 for rationale, its parent claim. 

Regarding Claim 3, Lui in view of Rao in further view of Wang teaches the processing system of claim 2, wherein, to select the number of local attention operations  (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors) are configured to select the number of local attention operations (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply)  based on a number of contextual objects (Rao Section 3.2 discloses the global features containing the context of the whole image) indicated in the saliency map (Rao Abstract, Section 3.1 ¶1 and Section 4.3 ¶05 disclose importance score of each token given the current features being used to determine the selected pixels). See Claim 1 for rationale, its parent claim. 

Regarding Claim 4, Lui in view of Rao in further view of Wang teaches the processing system of claim 3, wherein, to select the number of local attention operations (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors) are configured to compare the number of contextual objects against one or more thresholds (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the small scale embeddings to the large scale embeddings)  to select the number of local attention operations (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply). See Claim 1 for rationale, its parent claim. 

Regarding Claim 5, Lui in view of Rao in further view of Wang teaches the processing system of claim 3, wherein the selected number of local attention operations (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), is directly proportional to the number of contextual objects (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features). See Claim 1 for rationale, its parent claim. 

Regarding Claim 6, Lui in view of Rao in further view of Wang teaches the processing system of claim 3, wherein, to select the number of local attention operations, (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors)  are configured to select at least two local attention operations (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings) based on a determination that the number of contextual objects satisfies a defined threshold (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings therefore if there are two contextual features there will be two attention operators). See Claim 1 for rationale, its parent claim. 
Regarding Claim 7, Lui in view of Rao in further view of Wang teaches the processing system of claim 3, wherein, to select the number of local attention operations, (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors)  are configured to:
obtain a display resolution of a display device (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image) included in the processing system (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method in general hardware which is well understood to include processing); and 
select three local attention operations (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings) , in the transformer, when a display resolution is set to at least a maximum size (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image) of the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme) and the number of contextual objects is three or more  (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings therefore if there are three contextual features there will be three attention operators). See Claim 1 for rationale, its parent claim. 

Regarding Claim 8, Lui in view of Rao in further view of Wang teaches the processing system of claim 3, wherein,  to select the number of local attention operations, (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors)  are configured to : 
obtain a display resolution of a display device (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image) included in the processing system (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method in general hardware which is well understood to include processing); and 
select two local attention operations, in the transformer  (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings, when a display resolution is set to less than a maximum size (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image) of the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme) and the number of contextual objects is two (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings therefore if there are two contextual features there will be two attention operators). See Claim 1 for rationale, its parent claim. 

Regarding Claim 9, Lui in view of Rao in further view of Wang teaches the processing system of claim 3, wherein, to select the number of local attention operations, (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors)  are configured to : 
obtain a display resolution of a display device (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image) included in the processing system (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method in general hardware which is well understood to include processing); and 
select one local attention operations, in the transformer(Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings) , when a display resolution is set to less than a maximum size (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image) of the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme) and the number of contextual objects is one  (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings therefore if there is one contextual features there will be one attention operators). See Claim 1 for rationale, its parent claim. 

Regarding Claim 10, Lui in view of Rao in further view of Wang teaches the processing system of claim 3, wherein, to select the number of local attention operations, (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors) are configured to : 
obtain a display resolution of a display device (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image) included in the processing system (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method in general hardware which is well understood to include processing); and 
select one local attention operations, in the transformer (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings)when a display resolution is set to a smallest size (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image) of the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme) and the number of contextual objects is one  (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings therefore if there is one contextual features there will be one attention operators).   See Claim 1 for rationale, its parent claim. 

Regarding Claim 11, Lui in view of Rao in further view of Wang teaches the processing system of claim 1, wherein the selected number of local attention operations is directly proportional(Wang Pg 2 ¶03-04 and ¶07 and discloses a vision transformer where the number of local attention operations scales proportionally with the spatial size of the input feature )  to the size of the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme, which in this case is the input). See Claim 1 for rationale, its parent claim. 

Regarding Claim 12, Lui in view of Rao in further view of Wang teaches the processing system of claim 11, wherein, wherein, to select the number of local attention operations, (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors)  are configured to select at least two local attention operations (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings) based on a determination that the size satisfies a defined threshold (Wang Pg 2 ¶03-04 and ¶07 and discloses a vision transformer where the number of local attention operations scales proportionally with the spatial size of the input feature ). See Claim 1 for rationale, its parent claim. 

Regarding Claim 13, Lui in view of Rao in further view of Wang teaches the processing system of claim 1, wherein the number of local attention operations is selected  (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings)  based further on a resolution of a display (Lui Pg 4 Col 1 ¶01 discloses outputting a high resolution image)  that will be used to display output of the machine learning model (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method in general hardware which is well understood to include processing and displaying).See Claim 1 for rationale, its parent claim. 

Regarding Claim 14, Lui in view of Rao in further view of Wang teaches the processing system of claim 13, wherein the selected number of local attention operations is directly proportional (Wang Pg 4-5 Section 3.2.1 and discloses the attention operations being proportional to the features including the contextual relationship of the features including the small scale embeddings to the large scale embeddings)  to the resolution (Lui Pg 3 Col 2 ¶01discloses multiple resolution feature maps). See Claim 1 for rationale, its parent claim. 

Regarding Claim 15, Lui in view of Rao in further view of Wang teaches the processing system of claim 1, further comprising a camera coupled to the one or more processors (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method using general hardware which is well understood to include processing and image capture since images are the subject of the processing), wherein the camera is configured to capture image data (Lui Fig 2 disclose image pixels), and wherein the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors) are configured to transform the image data to generate the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme). See Claim 1 for rationale, its parent claim. 

Regarding Claim 16, Lui in view of Rao in further view of Wang teaches the processing system of claim 1, further comprising a transmitter coupled to the one or more processors (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method using general hardware which is well understood to include processing and a transmitter), wherein the transmitter is configured to transmit the transformer output  (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied)  to a receiver (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method using general hardware which is well understood to include processing and a receiver). See Claim 1 for rationale, its parent claim. 

Regarding Claim 17, Lui in view of Rao in further view of Wang teaches the processing system of claim 1, wherein the one or more processors (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method using general hardware which is well understood to include processing including processors) are configured to generate an output prediction of the machine learning model (Rao Section 3.4 discloses prediction models to determine the most influential tokens on the model) based at least in part on the transformer output (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied). See Claim 1 for rationale, its parent claim. 

Regarding Claim 18, Lui in view of Rao in further view of Wang teaches the processing system of claim 17, further comprising a display coupled to the one or more processors (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method in general hardware which is well understood to include processing and displaying), wherein the display  is configured to display (Lui Pg 3 Col 1 ¶02 and Pg 2 Col 2 ¶03 discloses implementation of the method in general hardware which is well understood to include a display) the output prediction (Rao Section 3.4 discloses prediction models to determine the most influential tokens on the model). See Claim 1 for rationale, its parent claim. 

Regarding Claim 19, Lui in view of Rao in further view of Wang teaches the processing system of claim 17, wherein the output prediction  (Rao Section 3.4 discloses prediction models to determine the most influential tokens on the model) comprises at least one of: a depth map (Rao Fig 5 discloses the result of scarification of tokens), a classification (Lui Pg 3 Col 1-2 ¶04 discloses image classification), or a segmentation map (Lui Pg 3 Col 1-2 ¶04 discloses segmentation).See Claim 1 for rationale, its parent claim. 

Regarding Claim 20, Lui in view of Rao in further view of Wang teaches the processing system of claim 1, wherein, to generate the transformer output (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied), the one or more processors (Lui Pg 2, Col 1, ¶02 discloses  all query patches within a window share the same key set, which facilitates memory access in hardware which is well understood would include processors) are configured to: 
generate a first local attention output  (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied)based on processing the transformed version of image pixels (Lui Fig 1 and Fig 2 disclose the transformed version of image pixels by merging image patches in a window partitioning scheme) using a first sliced (Wang Section 3.2.1 discloses splitting local attention and global attention) local attention operation (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply) at a first scale (Wang Pg 2 ¶02 and Pg discloses local attention and global attention split at each scale); 
generate a second local attention output (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied) based on the first local attention output (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied) and a second (Wang Section 3.2.1 discloses splitting local attention and global attention) sliced local attention operation (Rao Section 3.1 ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply)  at a second scale  (Wang Pg 2 ¶02 and Pg discloses local attention and global attention split at each scale and the scale depends on the input); 
generate a global attention output (Rao Section 3.2 discloses the global features containing the context of the whole image) based on the second local attention output (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied) and a global attention operation(Rao ¶01 and Figure 2 discloses a vision transformer that dynamically selects, at runtime, which attention operation to apply Pg 4 Section 3.2 discloses the local and global attention operations being applied) ; and 
generate the transformer output (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied)   based on the first local attention output (Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied) , the second local attention output(Lui Fig 2, Pg 2 Col 1 ¶1 discloses the output of the transformer being new windows where an additional self attention layer can be applied) , and the global attention output (Rao Section 3.2 discloses the global features containing the context of the whole image). See Claim 1 for rationale, its parent claim. 

Reference Cited
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
US-20230401716-A1 to Wang discloses a system and method for image segmentation using convolutional self-attention models.  

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RACHEL LYNN ROBERTS whose telephone number is (571)272-6413. The examiner can normally be reached Monday- Friday 7:30am- 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Oneal Mistry can be reached on (313) 446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/RACHEL L ROBERTS/Examiner, Art Unit 2674                           

/Ross Varndell/Primary Examiner, Art Unit 2674
Read full office action
Prosecution Timeline

Mar 22, 2024
Application Filed
Mar 17, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/270,688
Patent 12581132
LARGE-SCALE POINT CLOUD-ORIENTED TWO-DIMENSIONAL REGULARIZED PLANAR PROJECTION AND ENCODING AND DECODING METHOD
2y 5m to grant Granted Mar 17, 2026
18/306,404
Patent 12569208
PET APPARATUS, IMAGE PROCESSING METHOD, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM
2y 5m to grant Granted Mar 10, 2026
17/968,823
Patent 12564324
IMAGE PROCESSING APPARATUS AND IMAGE PROCESSING SYSTEM FOR ABNORMALITY DETECTION
2y 5m to grant Granted Mar 03, 2026
18/216,545
Patent 12561773
METHOD AND APPARATUS FOR PROCESSING IMAGE, ELECTRONIC DEVICE, CHIP AND MEDIUM
2y 5m to grant Granted Feb 24, 2026
18/333,131
Patent 12525028
CONTACT OBJECT DETECTION APPARATUS AND NON-TRANSITORY RECORDING MEDIUM
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
90%
Grant Probability
99%
With Interview (+14.3%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 19 resolved cases by this examiner. Grant probability derived from career allow rate.