Last updated: April 18, 2026

Application No. 18/305,097

DYNAMIC ADDITIVE ATTENTION ADAPTION FOR MEMORY-EFFICIENT MULTI-DOMAIN ON-DEVICE LEARNING

Final Rejection §101§103

Filed

Apr 21, 2023

Examiner

HICKS, AUSTIN JAMES

Art Unit

2142

Tech Center

2100 — Computer Architecture & Software

Assignee

Arizona Board of Regents

OA Round

2 (Final)

Interview Optional

— +25.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 403 resolved cases, 2023–2026

Examiner Intelligence

HICKS, AUSTIN JAMES View full profile →

Grants 76% — above average

Career Allow Rate

308 granted / 403 resolved

+21.4% vs TC avg

Strong +25% interview lift

Without

With

+25.1%

Interview Lift

resolved cases with interview

Typical timeline

3y 4m

Avg Prosecution

54 currently pending

Career history

457

Total Applications

across all art units

Statute-Specific Performance

§101

13.9%

-26.1% vs TC avg

§103

46.3%

+6.3% vs TC avg

§102

17.3%

-22.7% vs TC avg

§112

19.2%

-20.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 403 resolved cases

Office Action

§101 §103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 3/13/2026 with respect to the 103 rejection has been fully considered but they are not persuasive. 
Applicant argues, “[Urtasun ]is different from the claimed system, which takes place within the machine learning model, and is not merely a pre-processing step for the input data. The claimed input feature map is obtained from the output of either the input layer or one of the hidden layers of the machine learning model, and from it, the basic adaptor creates a basic adaptor output and a binary mask for augmenting the input feature map prior to passing it to the next layer in the machine learning algorithm.” Remarks 8. Urtasun teaches this in fig. 5 below. Urtasun para 171 “the attention generator 406 can use input features 504 as input to a machine-learned model 508 with a plurality of layers.” Urtasun para 172, “The machine-learned model 506 can output an attention mask.”  

    PNG
    media_image1.png
    418
    702
    media_image1.png
    Greyscale


Claim Objections
	All claim objections are withdrawn, thank you.

Claim Rejections - 35 USC § 101
	All 101 rejections are withdrawn, thank you.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 6, 9, 11-13 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over US20210278852A1 to Urtasun et al and US 10,803,885 B1 to Kao et al.
Claims 7-8 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over US20210278852A1 to Urtasun et al, US 10,803,885 B1 to Kao et al and https://mlnotebook.github.io/post/CNN1/ (CNN Basics).
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over US20210278852A1 to Urtasun et al, US 10,803,885 B1 to Kao et al and US 20210216871 A1 to Zhang et al.
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over US20210278852A1 to Urtasun et al, US 10,803,885 B1 to Kao et al and US 20210232909 A1 to Tan et al.

Urtasun teaches claim 6. A computing device comprising:
a processor; and
a non-transitory computer-readable medium with instructions stored thereon, which when executed by the processor, perform steps comprising:
providing a dynamic additive attention module for a deep neural network, comprising:
an adaptor configured to accept an input activation from an input layer or a hidden layer of the deep neural network, and provide an output activation to a subsequent layer of the deep neural network, the adaptor comprising a spatial attention module and a basic adaptor module; (Urtasun fig. 5, see below. Hidden and input layers feeding into model 506 508.)
the spatial attention module configured to accept a down-sampled copy of the input activation and to calculate a soft attention value and a binarized soft attention value; (Urtasun para 47 “the mask generation system can down-sample the voxel grid representation.” Voxel grid is the input feature map. Urtasun para 50 “The mask generation system can also employ the Gumbel SoftMax technique to convert each scalar value to either a one or a zero. To do so, the mask generation system can add Gumbel noise to each scalar value. In this example, i and j denote spatial coordinates and z1,1 represent the scalar output from the CNN at i,j coordinate.” The scalar value conversion to a one or zero is turning each scalar weight into a binary weight.)
the basic adaptor module configured to accept the down-sampled copy of the input activation and the soft attention value, and calculate an up-sampled, (Urtasun para 78 “an attention weighted feature map. To do so, the vehicle computing system can generate a sparse feature map by multiplying the voxel grid representation and the attention mask.” Attention mask includes the soft attention value. The input feature map is multiplied by the attention mask. Urtasun para 47 “convolutional neural network itself can include multiple down-sample and up-sample stages.”)
a spatial attention module configured to spatially sample the input activation. (Urtasun fig. 5 below.)

    PNG
    media_image2.png
    362
    600
    media_image2.png
    Greyscale

	Urtasun doesn’t teach upsampled weighted input activation.
	However, Kao teaches up-sampled, weighted input activation.
(Kao 8:40 “the system may weight each frame vector by its score. For example, a first upsampled data frame [F1] may be multiplied by its respective score S1 to obtain a weighted upsampled data frame…” The weighted upsampled data fame is the upsampled weighted input activation.)
	Urtasun, Kao and the claims all upsample and down sample data to process it for a classifier. It would have been obvious to a person having ordinary skill in the art, at the time of filing, to upsample the data after downsampling so that the data fed to the classifier is “effectively smoothed” after having been chopped up by the downsampler. Kao 6:25.

Urtasun teaches claim 7. The computing device of claim 6, wherein the dynamic additive attention module comprises a (Urtasun para 47 “the mask generation system can down-sample the voxel grid representation.” Voxel grid is the input activation.)
	Urtasun doesn’t teach a 2x2.
	However, CNN basics teaches 2x2. (CNN basics “he pooling layer is key to making sure that the subsequent layers of the CNN are able to pick up larger-scale detail than just edges and curves. It does this by merging pixel regions in the convolved image together (shrinking the image) before attempting to learn kernels on it. Effectlively, this stage takes another kernel, say [2 x 2] and passes it over the entire image, just like in convolution. It is common to have the stride and kernel size equal i.e. a [2 x 2] kernel has a stride of 2. This example will half the size of the convolved image.”)
	The claims, Urtasun and CNN basics all described downsampling input. It would have been obvious to a person having ordinary skill in the art, at the time of filing, to use a 2x2 because a 2x2 kernel size is common and it will half the size of the convolved image.1

Urtasun teaches claim 8. The computing device of claim 7, wherein the  (Urtasun para 47 “the mask generation system can down-sample the voxel grid representation.” Voxel grid is the input feature map.)
	Urtasun doesn’t teach a 2x2 average pooling.
	However, Kao teaches average pooling. (Kao 10:40 “An average pooling block may then be used to further reduce the dimensionality…”)
	Urtasun and Kao don’t teach 2x2.
	However, CNN basics teaches 2x2. (CNN basics “he pooling layer is key to making sure that the subsequent layers of the CNN are able to pick up larger-scale detail than just edges and curves. It does this by merging pixel regions in the convolved image together (shrinking the image) before attempting to learn kernels on it. Effectlively, this stage takes another kernel, say [2 x 2] and passes it over the entire image, just like in convolution. It is common to have the stride and kernel size equal i.e. a [2 x 2] kernel has a stride of 2. This example will half the size of the convolved image.”)

Urtasun teaches claim 9. (Currently Amended) A computing device, comprising: a processor; and a non-transitory computer-readable medium with instructions stored thereon, which when executed by the processor, perform steps comprising: deploying a machine learning model having an input layer, a plurality of hidden layers, and an output layer; (Urtasun fig. 4 and attention generator 406.) 
training the machine learning model for a new domain by performing steps comprising: (Urtasun para 174 “By using the Gumbel SoftMax technique, the attention generator 406 can enable differentiation to be used on this stage of the process while the CNN is being trained.”)
obtaining an output of an input layer or a hidden layer of the machine learning model as an input feature map; (Urtasun fig. 5 shows the attentions generator 406 with an input and hidden layer feeding into the machine learning model a 506 508.)

    PNG
    media_image3.png
    288
    608
    media_image3.png
    Greyscale

down-sampling the input feature map to provide a basic adaptor input; (Urtasun para 47 “the mask generation system can down-sample the voxel grid representation.” Voxel grid is the input feature map.)

    PNG
    media_image4.png
    362
    600
    media_image4.png
    Greyscale

calculating a soft attention from the down-sampled input feature map; (Urtasun para 50 “The mask generation system can also employ the Gumbel SoftMax technique to convert each scalar value to either a one or a zero. To do so, the mask generation system can add Gumbel noise to each scalar value. In this example, i and j denote spatial coordinates and z1,1 represent the scalar output from the CNN at i,j coordinate.” The soft attention is the Gumbel noise added to each scalar.)
binarizing the soft attention to obtain a set of binary weighting values selected from 0 and 1; (Urtasun para 50 “The mask generation system can also employ the Gumbel SoftMax technique to convert each scalar value to either a one or a zero. To do so, the mask generation system can add Gumbel noise to each scalar value. In this example, i and j denote spatial coordinates and z1,1 represent the scalar output from the CNN at i,j coordinate.” The scalar value conversion to a one or zero is turning each scalar weight into a binary weight.)
multiplying the basic adaptor input by the soft attention to provide a weighted basic adaptor input; (Urtasun para 78 “an attention weighted feature map. To do so, the vehicle computing system can generate a sparse feature map by multiplying the voxel grid representation and the attention mask.” The product of the voxel and the attention mask is the weighted basic adaptor input.)
up-sampling (Urtasun para 47 “convolutional neural network itself can include multiple down-sample and up-sample stages.”)
adding the (Urtasun para 54 “The attention weighted feature map can then be added back with the voxel grid representation data.” The input feature map is the voxel grid. The adapted feature map “intermediate feature map” (Id.) is the weighted feature map plus the voxel grid.)
multiplying the adapted feature map by the set of binary weighting values; and (Urtasun para 55 “The intermediate feature map can be used as input to a subsequent residual block. If so, it can again be multiplied with the binary attention mask. The cost map generation system can be designed to repeat any number of times and will include that number of residual blocks. Once the total number of residual block cycles have been completed, the output of the cost map generation system can be a final attention weighted feature map.”)
providing the multiplied adapted feature map to a subsequent layer of the machine learning model. (Urtasun para 56 “The final attention weighted feature map can be used by the cost map generation system as input to one or more machine learned models.”) 
Urtasun doesn’t explicitly up-sample the weighted basic adaptor input to provide a basic adaptor output and then add the basic adaptor output.
	However, Kao teaches up-sampling the weighted basic adaptor input to provide a basic adaptor output; (Kao 8:40 “the system may weight each frame vector by its score. For example, a first upsampled data frame [F1] may be multiplied by its respective score S1 to obtain a weighted upsampled data frame…” The weighted upsampled data fame is the basic adaptor output.)
adding the basic adaptor output to the input feature map (Kao 8:40 “The weighted upsampled data frames (for all frames 0 through N of the input audio data 302) may be summed together to determine composite upsampled frame data…” Kao 8-9:65-5 “The combiner 448 may be used to determine the composite upsampled data frame and weighted composite upsampled data frame 444.”)
	Urtasun, Kao and the claims all upsample and down sample data to process it for a classifier. It would have been obvious to a person having ordinary skill in the art, at the time of filing, to upsample the data after downsampling so that the data fed to the classifier is “effectively smoothed” after having been chopped up by the downsampler. Kao 6:25.

Urtasun teaches claim 10. The computing device of claim 9, wherein the computing device is an (Urtasun fig. 4)
	Urtasun doesn’t teach and edge computer.
	However, Zhang teaches an edge computing device. (Zhang para 25 “thus effectively “lowering the bar” needed to run a complex CNN on a resource restricted hardware device, such as edge computing device in IoT systems.”)
	Urtasun, Zhang and the claims are all directed to CNNs. It would have been obvious to a person having ordinary skill in the art, at the time of filing, to run on an edge device to expand the use cases for a CNN.

Urtasun teaches claim 11. The computing device of claim 9, wherein the computing device is a resource-limited processor. (Urtasun fig. 4. This claim element covers all processors.)

Urtasun teaches claim 12. The computing device of claim 9, wherein the soft attention is calculated using a Gumbel-softmax function. (Urtasun para 50 “The mask generation system can also employ the Gumbel SoftMax technique…”)

Urtasun teaches claim 13. The computing device of claim 9, wherein the binarization comprises a thresholding function. (Urtasun para 51 teaches the threshold as αi,j(1), “When the inference is made, the mask generation system can generate the hard attention value (e.g., a binary value) by comparing the values as follows:” 
    PNG
    media_image5.png
    54
    210
    media_image5.png
    Greyscale
)

Urtasun teaches claim 14. The m computing device of claim 9, wherein the method further comprises the step of (Urtasun para 71 “The above factors can be included in the loss calculation and applied to alter the weights and biases of the multi-stage machine-learned model at each stage.”)
	Urtasun doesn’t teach a frozen layer.
	However, Tan teaches freezing. (Tan para 35 “the freeze-out component 108 freezes one or more layers of the neural network selected by the selection component 106 so that weights of output connections from the one or more frozen layers will not be updated for a training run.” The weights are the learned parameters with a multiplicative relationship.
	The claims, Urtasun and Tan are all neural networks. It would have been obvious to a person having ordinary skill in the art, at the time of filing, to freeze layers of a neural network because freezing layers speeds up training.

Urtasun teaches claim 15. (New) The computing device of claim 9, wherein the step of down-sampling the input feature map comprises (Urtasun para 47 “the mask generation system can down-sample the voxel grid representation.” Voxel grid is the input activation.)
	Urtasun doesn’t teach a 2x2.
	However, CNN basics teaches 2x2. (CNN basics “he pooling layer is key to making sure that the subsequent layers of the CNN are able to pick up larger-scale detail than just edges and curves. It does this by merging pixel regions in the convolved image together (shrinking the image) before attempting to learn kernels on it. Effectlively, this stage takes another kernel, say [2 x 2] and passes it over the entire image, just like in convolution. It is common to have the stride and kernel size equal i.e. a [2 x 2] kernel has a stride of 2. This example will half the size of the convolved image.”)
	The claims, Urtasun and CNN basics all described downsampling input. It would have been obvious to a person having ordinary skill in the art, at the time of filing, to use a 2x2 because a 2x2 kernel size is common and it will half the size of the convolved image.2

Urtasun teaches claim 16. (New) The computing device of claim 9, wherein the step of binarization comprises a thresholding function with a threshold of 0.5. (Urtasun para 50 “The mask generation system can also employ the Gumbel SoftMax technique to convert each scalar value to either a one or a zero. To do so, the mask generation system can add Gumbel noise to each scalar value. In this example, i and j denote spatial coordinates and z1,1 represent the scalar output from the CNN at i,j coordinate.” The scalar value conversion to a one or zero is turning each scalar weight into a binary weight.)Urtasun para 182 “can determine whether the respective scalar value exceeds a predetermined threshold value. The predetermined threshold value can be, for example, 0.5.”



Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Austin Hicks whose telephone number is (571)270-3377. The examiner can normally be reached Monday - Thursday 8-4 PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mariela Reyes can be reached at (571) 270-1006. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AUSTIN HICKS/Primary Examiner, Art Unit 2142                                                                                                                                                                                                        


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 “The pooling layer is key to making sure that the subsequent layers of the CNN are able to pick up larger-scale detail than just edges and curves. It does this by merging pixel regions in the convolved image together (shrinking the image) before attempting to learn kernels on it. Effectlively, this stage takes another kernel, say [2 x 2] and passes it over the entire image, just like in convolution. It is common to have the stride and kernel size equal i.e. a [2 x 2] kernel has a stride of 2. This example will half the size of the convolved image.“ https://mlnotebook.github.io/post/CNN1/#:~:text=The%20pooling%20layer%20is%20key,figure%20below%20shows%20the%20principal.
        2 “The pooling layer is key to making sure that the subsequent layers of the CNN are able to pick up larger-scale detail than just edges and curves. It does this by merging pixel regions in the convolved image together (shrinking the image) before attempting to learn kernels on it. Effectlively, this stage takes another kernel, say [2 x 2] and passes it over the entire image, just like in convolution. It is common to have the stride and kernel size equal i.e. a [2 x 2] kernel has a stride of 2. This example will half the size of the convolved image.“ https://mlnotebook.github.io/post/CNN1/#:~:text=The%20pooling%20layer%20is%20key,figure%20below%20shows%20the%20principal.

Read full office action

Prosecution Timeline

Apr 21, 2023

Application Filed

Dec 11, 2025

Non-Final Rejection — §101, §103

Mar 13, 2026

Response Filed

Apr 01, 2026

Final Rejection — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/786,650

Patent 12591767

NEURAL NETWORK ACCELERATION CIRCUIT AND METHOD

2y 5m to grant Granted Mar 31, 2026

17/706,298

Patent 12554795

REDUCING CLASS IMBALANCE IN MACHINE-LEARNING TRAINING DATASET

2y 5m to grant Granted Feb 17, 2026

17/805,674

Patent 12530630

Hierarchical Gradient Averaging For Enforcing Subject Level Privacy

2y 5m to grant Granted Jan 20, 2026

17/559,001

Patent 12524694

OPTIMIZING ROUTE MODIFICATION USING QUANTUM GENERATED ROUTE REPOSITORY

2y 5m to grant Granted Jan 13, 2026

18/959,714

Patent 12524646

VARIABLE CURVATURE BENDING ARC CONTROL METHOD FOR ROLL BENDING MACHINE

2y 5m to grant Granted Jan 13, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

76%

Grant Probability

99%

With Interview (+25.1%)

3y 4m

Median Time to Grant

Moderate

PTA Risk

Based on 403 resolved cases by this examiner. Grant probability derived from career allow rate.