Last updated: May 29, 2026

Application No. 18/044,842

Modeling Dependencies with Global Self-Attention Neural Networks

Non-Final OA §103

Filed

Mar 10, 2023

Priority

Sep 16, 2020 — nonprovisional of PCTUS2020050995

Examiner

GODO, MORIAM MOSUNMOLA

Art Unit

2148

Tech Center

2100 — Computer Architecture & Software

Assignee

Google LLC

OA Round

1 (Non-Final)

Interview Optional

— +33.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 44% grant rate with +33.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 69 resolved cases, 2023–2026

Examiner Intelligence

GODO, MORIAM MOSUNMOLA View full profile →

Grants 44% of resolved cases

Career Allowance Rate

30 granted / 69 resolved

-11.5% vs TC avg

Strong +34% interview lift

Without

With

+33.7%

Interview Lift

resolved cases with interview

Typical timeline

4y 7m

Avg Prosecution

27 currently pending

Career history

118

Total Applications

across all art units

Statute-Specific Performance

§101

1.4%

-38.6% vs TC avg

§103

91.8%

+51.8% vs TC avg

§102

0.6%

-39.4% vs TC avg

§112

5.4%

-34.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 69 resolved cases

Office Action

§103

DETAILED ACTION
1.	This office action is in response to the Application No. 18044842 filed on  03/10/2023. Claims 1-19 are presented for examination and are currently pending.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


3.	Claims 1, 3, 4, 6, 8, and 9-19 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. ("Dsanet: Dual self-attention network for multivariate time series forecasting." Proceedings of the 28th ACM international conference on information and knowledge management. 2019) in view of Liu et al. ("PiCANet: Pixel-wise contextual attention learning for accurate saliency detection." IEEE Transactions on Image Processing 29 (2020): 6438-6451, Date of Publication: 23 April 2020) and further in view of Cakaloglu et al. ("Text embeddings for retrieval from a large knowledge base." arXiv:1810.10176v2 [cs.IR] 2 May 2019).

Regarding claim 1, Huang teaches a computing system for performing modeling of dependencies using global self-attention (Figure 1 presents an overview of our proposed DSANet. DSANet utilizes two convolutional structures, namely global temporal convolution and local temporal convolution, … Each vector forms a matrix and then enter an elaborate self-attention module to capture the dependencies between multiple series, pg. 2130, left col., section 4 Methodology), comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store (A computer with Intel i7-8700 CPU, GTX1060 GPU, 6 cores, 32 GB RAM is used to conduct all experiments, pg. 2131, right col., second to the last para.): 
a machine-learned model (Figure 1: Dual Self-Attention Network (DSANet), pg. 2131) configured to receive a model input (Input: Multivariate Series: (X(1), X(2), X(D)) Fig. 1, pg. 2131) and process the model input to generate a model output (output XT+h, Fig. 1), 
wherein the machine-learned model (Figure 1: Dual Self-Attention Network (DSANet), pg. 2131) comprises a content attention layer (Global Temporal Convolution, Attention, Feed Forward layer, Fig. 1, pg. 2131) and 
a positional attention layer (Local Temporal Convolution, Attention, Feed Forward layer, Fig. 1, pg. 2131) configured to operate in parallel with each other (DSANet completely dispenses with recurrence and utilizes two parallel convolutional components, called global temporal convolution and local temporal convolution, to capture complex mixtures of global and local temporal patterns, abstract), and 
wherein the machine-learned model (Figure 1: Dual Self-Attention Network (DSANet), pg. 2131) is configured to perform operations comprising: receiving a layer-input comprising input data that comprises a plurality of content values each associated with one or more context positions (Input: Multivariate Series: (X(1) = X1(1), X2(1), … XT(1); X(2) = X1(2), X2(2), … XT(2) …, Fig. 1, pg. 2131.The Examiner notes X1, X2, … XT are context values associated with context positions X(1), X(2), …); 
generating, by the content attention layer (Global Temporal Convolution, Attention, Feed Forward layer, Fig. 1, pg. 2131), 
one or more output features for each context position based on a global attention operation applied to the content values (In general, an attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and the output are all vectors, pg. 2130, right col., last para.) 
and determining a layer-output (Dense Layer, Fig. 1), based at least in part on the one or more output features for each context position generated by the content attention layer and the attention map generated for each context position by the positional attention layer (In the forecasting stage, we first use a dense layer to combine the outputs of two self-attention modules and get the self-attention based prediction X DT+h ∈ RD pg. 2131, left col., second to the last para., Fig. 1).
Huang does not explicitly teach generating, by the positional attention layer, an attention map for each of the context positions based on one or more of the content values associated with the respective context position and a neighborhood of context positions relative to the respective context position, the positional attention layer comprising at least a column-focused attention sublayer that attends to context positions along a column of each respective context position and a row-focused attention sublayer that attends to context positions along a row of each respective context position; content values independent of the context positions;
Liu teaches generating, by the positional attention layer, an attention map for each of the context positions based on one or more of the content values associated with the respective context position (The proposed PiCANet simultaneously generates an attention map at each pixel over its context region, pg. 2, left col., first para. the Examiner notes each pixel is context position that has a content value at each pixel) and 
a neighborhood of context positions relative to the respective context position (Thus, we enable each pixel (w, h) to “see” the local neighbouring region F¯w,h ∈ RW¯×H¯×C centered at it. pg. 4, left col., first para. The Examiner notes the neighboring region of a pixel is a neighborhood of context positions),
 the positional attention layer comprising at least a column- focused attention sublayer that attends to context positions along a column of each respective context position (Next, the ReNet uses another two LSTMs to scan each column of the obtained feature map in both bottom-up and top-down orders, pg. 3, right col., second para. The Examiner notes second top and bottom feature maps in Figure 2a is the column sublayer) and 
a row-focused attention sublayer that attends to context positions along a row of each respective context position (Specifically, two LSTMs along each row of F scan the pixels one-by-one from left to right and from right to left, respectively, pg. 3, right col., second para. The Examiner notes first top and bottom feature maps in Fig. 2a is the row sublayer);
Since Huang as primary reference desires global temporal convolution and local temporal convolution (abstract) and Liu as secondary reference discloses convolution operations with attending to global or local context (abstract), then,
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Huang to incorporate the teachings of Liu for the benefit of constructing a network by hierarchically embedding (pg. 2, third para.) and learning of global attention, which can help to learn global contrast better and improve model performance (Liu, pg. 2, right col., first para.)
Modified Huang does not explicitly teach content values independent of the context positions;
Cakaloglu teaches content values independent of the context positions;
(As expressed in the equation below, a L-layer bILM computes a set of 2L + 1 representations for each token ti 

    PNG
    media_image1.png
    58
    542
    media_image1.png
    Greyscale

where xLMi , a context-independent token representation, is computed via a CNN over characters, pg. 3, section 3.1 Embedding Model)
	Since Huang as primary reference teaches to embed each univariate series in X into two representation vectors (pg. 2130, left col., last para.), while Cakaloglu as secondary reference teaches achieving a good embedding (pg. 4, second para.), then,
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Huang to incorporate the teachings of Cakaloglu for the benefit of developing embedding accuracy (pg. 5, first para.)

Regarding claim 3, Modified Huang teaches the computing system of claim 1, Huang teaches wherein the global attention operation comprises multiplying the queries, a matrix transpose of the keys with softmax normalization applied to each row, and the values (In the self-attention module following the global temporal convolution, a set of queries, keys, and values are packed together into matrices QG, KG, and VG, obtained by applying projections to the input HG, pg. 2130, right col., last para.); Mathematically, the scaled dot product self-attention computation can be expressed as 

    PNG
    media_image2.png
    58
    400
    media_image2.png
    Greyscale

where dk is the dimension of keys, pg. 2131, left col., first para.)

Regarding claim 4, Modified Huang teaches the computing system of claim 1, Liu teaches wherein the column- focused attention sublayer (The GAP network architecture is shown in Figure 2(a) … Next, the ReNet uses another two LSTMs to scan each column of the obtained feature map in both bottom-up and top-down orders, pg. 3 right col., second para.) and 
row-focused attention sublayer (The GAP network architecture is shown in Figure 2(a) … Specifically, two LSTMs along each row of F scan the pixels one-by one from left to right and from right to left, respectively, pg. 3 right col., second para.) are configured to operate in parallel with each other (The first top and bottom feature maps in Fig. 2a is the column sublayer, and both sublayers are parallel with each other).
The same motivation to combine independent claim 1 applies here.

Regarding claim 6, Modified Huang teaches the computer system of claim 1, Liu teaches wherein the column-focused attention sublayer (The GAP network architecture is shown in Figure 2(a) … Next, the ReNet uses another two LSTMs to scan each column of the obtained feature map in both bottom-up and top-down orders, pg. 3, right col., second para.) and the row-focused attention sublayer (he GAP network architecture is shown in Figure 2(a) … Specifically, two LSTMs along each row of F scan the pixels one-by one from left to right and from right to left, respectively, pg. 3, right col., second para.) each are configured to use learned relative positional embeddings for each respective context position (We present three formulations of the PiCANet via embedding the pixel-wise contextual attention mechanism (abstract); PiCANets sample limited context positions by using dilation or adopting local PiCANets, and thus can be applied to feature maps at various scales, pg. 5, right col., second to the last para.).  
The same motivation to combine independent claim 1 applies here.

Regarding claim 8, Modified Huang teaches the computer system of claim 1, Liu teaches wherein the output of the positional attention layer is determined based at least in part on combining output from each of the attention sublayers (output of first top and bottom feature maps in Fig. 2a (as row sublayer) is combined with output of second top and bottom feature maps in Fig. 2a (as column sublayer); As a result, we obtain an attentive contextual feature map FGAP, pg. 3, right col., second to the last para.).  
The same motivation to combine independent claim 1 applies here.

Regarding claim 9, Modified Huang teaches the computer system of claim 1, Liu teaches wherein the machine-learned model has been trained on a set of labeled training data using supervised learning (we add explicit supervision for the learning of global attention, pg. 2, right col., first para.), 
wherein the supervised learning comprises backpropagating a gradient of a loss function through a plurality of parameters (The whole network is trained end-to-end using stochastic gradient descent (SGD) with momentum. For the weight of each loss term in, we empirically set γ6,γ5 ,...,γ1 as 0.5, 0.5, 0.5, 0.8, 0.8, and 1, respectively, without further tuning, pg. 7, first para.).
The same motivation to combine independent claim 1 applies here.

Regarding claim 10, Modified Huang teaches the computer system of claim 1, Huang teaches wherein the input data comprises at least one of image data, video data, sensor data, audio data, or text data (We use a large time series data set provided by a gas station service company. The data set contains the daily revenue of five gas stations ranging from Dec.1, 2015 to Dec.1, 2018. The stations are geographically close, which means a complex mix of revenue promotion and mutual exclusion exists between them, pg. 2131, left col., last para. the Examiner notes revenue data is text data). 
 
Regarding claim 11, Modified Huang teaches the computer system of claim 1, Liu teaches wherein the machine-learned model has been trained to perform image recognition, image classification, image captioning, scene segmentation, object detection, action recognition, action localization, image synthesis, semantic segmentation, panoptic segmentation, or natural language processing (Furthermore, we demonstrate the effectiveness and generalization ability of the PiCANets on semantic segmentation and object detection with improved performance, abstract).  
The same motivation to combine independent claim 1 applies here.

Regarding claim 12, Modified Huang teaches the computer system of claim 1, Liu teaches wherein the machine-learned model has been trained on a set of ImageNet training data (The VGG16 network is used as our encoder to utilize its parameters pre-trained on ImageNet, pg. 6, left col., second para.).  
The same motivation to combine independent claim 1 applies here.

Regarding claim 13, Modified Huang teaches the computer system of claim 1, Liu teaches wherein the machine-learned model is used as part of backbone processing in a neural network (For object detection, we embed the Pi CANets into the SSD network for experiments. SSD uses the VGG [49] 16-layer network as the backbone, pg. 12, left col., last para.).  
The same motivation to combine independent claim 1 applies here.

Regarding claim 14, Modified Huang teaches the computer system of claim 1, Liu teaches wherein the machine-learned model is used to replace convolutions in a neural network (When using AC, we directly replace the vanilla Conv layer of Conv8 2 with an AC module, pg. pg. 12, right col., first para.); the AC (attention convolution) module generate attention over a local neighboring region¯ Fw,h centered at (w,h), pg. 3, left col., last para.; Fig. 2. (f) show detailed operations of AC).  
The same motivation to combine independent claim 1 applies here.

Regarding claim 15, Modified Huang teaches the computer system of claim 1, Liu teaches wherein a sequence of two or more instances of the machine-learned model (In this section, we present three forms of the proposed PiCANet. Suppose we have a convolutional (Conv) feature map F ∈ RW×H×C, with W, H, and C denoting its width, height and number of channels, respectively. For each location (w,h) in F, the GAP module generates global attention over the entire feature map F, while the LAP module and the AC module generate attention over a local neighbouring region ¯ Fw,h centered at (w,h), pg. 3, left col., last para.; All the three models are fully differentiable and can be integrated with convolutional neural networks with joint training, abstract. the Examiner notes the three instances are GAP module, LAP module and AC module) are implemented as part of a neural network (The results demonstrate that PiCANets can be used as general neural network modules for dense prediction tasks, pg. 2, left col., third para.).  
The same motivation to combine independent claim 1 applies here.

Regarding claim 16, Modified Huang teaches the computer system of claim 1, Liu teaches wherein the sequence of the two or more instances of the machine-learned model (In this section, we present three forms of the proposed PiCANet. Suppose we have a convolutional (Conv) feature map F ∈ RW×H×C, with W, H, and C denoting its width, height and number of channels, respectively. For each location (w,h) in F, the GAP module generates global attention over the entire feature map F, while the LAP module and the AC module generate attention over a local neighbouring region ¯ Fw,h centered at (w,h), pg. 3, left col., last para.; All the three models are fully differentiable and can be integrated with convolutional neural networks with joint training, abstract. the Examiner notes the three instances are GAP module, LAP module and AC module) are arranged consecutively (The three instances are GAP module, LAP module and AC module are arranged consecutively in Fig. 3a) as part of the neural network (The results demonstrate that PiCANets can be used as general neural network modules for dense prediction tasks, pg. 2, left col., third para.).  

Regarding claim 17, Modified Huang teaches the computer system of claim 1, Huang teaches wherein determining the layer-output comprises summing the one or more output features for each context position generated by the content attention layer and the attention map generated for each context position by the positional attention layer (In the forecasting stage, we first use a dense layer to combine the outputs of two self-attention modules and get the self-attention based prediction ˆXDT+h ∈ RD. The final prediction of DSANet ˆXT+h is then obtained by summing the self attention based prediction ˆXDT+h and the AR prediction ˆXLT+h, pg. 2131, left col., second to the last para.).

Regarding claim 18, claim 18 is similar to claim 1. It is rejected in the same manner and reasoning applying.

Regarding claim 19, claim 19 is similar to claim 1. It is rejected in the same manner and reasoning applying. Further, Huang teaches one or more non-transitory computer-readable media storing one or both of: instructions that when executed by a computing system cause the computing system to perform operations, the operations comprising (A computer with Intel i7-8700 CPU, GTX1060 GPU, 6 cores, 32 GB RAM is used to conduct all experiments, pg. 2131, right col., second to the last para.):

4.	Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. ("Dsanet: Dual self-attention network for multivariate time series forecasting." Proceedings of the 28th ACM international conference on information and knowledge management. 2019) in view of Liu et al. ("PiCANet: Pixel-wise contextual attention learning for accurate saliency detection." IEEE Transactions on Image Processing 29 (2020): 6438-6451, Date of Publication: 23 April 2020) in view of Cakaloglu et al. ("Text embeddings for retrieval from a large knowledge base."arXiv:1810.10176v2 [cs.IR] 2 May 2019) and further in view of Li et al. ("Ftrans: energy-efficient acceleration of transformers using fpga." Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 2020, August 10–12, 2020).

Regarding claim 2, Modified Huang teaches the computing system of claim 1, Modified Huang does not explicitly teach the limitations of claim 2.
Li teaches wherein the machine-learned model further comprises an input processing layer that generates a plurality of keys, queries, and values derived from the input data (The input consists of queries and keys of dimension dk, and values of dimension dv, pg. 3, right col., third para.). 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Huang to incorporate the teachings of Li for the benefit of significantly reducing the model size of NLP (Natural Language Processing) models by up to 16 times (Li, abstract).

5.	Claims 5 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. ("Dsanet: Dual self-attention network for multivariate time series forecasting." Proceedings of the 28th ACM international conference on information and knowledge management. 2019) in view of Liu et al. ("PiCANet: Pixel-wise contextual attention learning for accurate saliency detection." IEEE Transactions on Image Processing 29 (2020): 6438-6451, Date of Publication: 23 April 2020) in view of Cakaloglu et al. ("Text embeddings for retrieval from a large knowledge base."arXiv:1810.10176v2 [cs.IR] 2 May 2019) and further in view of Luo et al. (US20200257979 filed 04/29/2020)

Regarding claim 5, Modified Huang teaches the computing system of claim 1, Modified Huang does not explicitly teach the limitations of 5.
Luo teaches wherein the positional attention layer comprises the column-focused attention sublayer (A dimension normalization unit 42 configured to normalize a feature map set output by means of a network layer [0260]; the dimension normalization unit 42 is configured to obtain the spatial dimension mean based on at least one feature map by using a height value … of the at least one feature map [0268]. The Examiner notes the height is the column) followed by a batch normalization layer (Moreover, the normalization layer is added behind each layer of neural network to perform the adaptive normalization operation on each layer of feature map [0210]) that is followed by the row-focused attention sublayer (A dimension normalization unit 42 configured to normalize a feature map set output by means of a network layer [0260]; the dimension normalization unit 42 is configured to obtain the spatial dimension mean based on at least one feature map by using … a width value of the at least one feature map [0268]. The Examiner notes the width is the row).  
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Huang to incorporate the teachings of Luo for the benefit of performing normalization along at least one dimension so that statistics information of each dimension of a normalization operation is covered, thereby ensuring good robustness of statistics in each dimension (Luo, abstract)
 
Regarding claim 7, Modified Huang teaches the computing system of claim 1, Modified Huang does not explicitly teach the limitations of claim 7. 
Luo teaches wherein the positional attention layer comprises the column-focused attention sublayer (A dimension normalization unit 42 configured to normalize a feature map set output by means of a network layer [0260]; the dimension normalization unit 42 is configured to obtain the spatial dimension mean based on at least one feature map by using a height value … of the at least one feature map [0268]. The Examiner notes the height is the column) followed by the batch normalization layer (Moreover, the normalization layer is added behind each layer of neural network to perform the adaptive normalization operation on each layer of feature map [0210]) that is followed by the row-focused attention sublayer (A dimension normalization unit 42 configured to normalize a feature map set output by means of a network layer [0260]; the dimension normalization unit 42 is configured to obtain the spatial dimension mean based on at least one feature map by using … a width value of the at least one feature map [0268]. The Examiner notes the width is the row) that is followed by a second batch normalization layer (Moreover, the normalization layer is added behind each layer of neural network to perform the adaptive normalization operation on each layer of feature map [0210], Fig. 3) that is followed by a time or depth attention sublayer (different normalization operation modes are selected in different network depths due to different visual representations [0203]; at least one normalization layer based on the prediction result [0292]; adjusting parameters of the at least one network layer … based on the prediction result [0209]. The Examiner notes the network layer as depth attention sublayer adjust parameters based on the prediction result from the normalization layer).  
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Huang to incorporate the teachings of Luo for the benefit of performing normalization along at least one dimension so that statistics information of each dimension of a normalization operation is covered, thereby ensuring good robustness of statistics in each dimension (Luo, abstract)

Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 8:00am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T. Bechtold can be reached on (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.G./Examiner, Art Unit 2148                                   
/MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148

Read full office action

Prosecution Timeline

Mar 10, 2023

Application Filed

Feb 19, 2026

Non-Final Rejection mailed — §103

May 13, 2026

Applicant Interview (Telephonic)

May 13, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

16/927,018

Patent 12639556

Object-Centric Learning with Slot Attention

5y 10m to grant Granted May 26, 2026

18/583,459

Patent 12608609

MACHINE LEARNING BASED FILE RANKING METHODS AND SYSTEMS

2y 2m to grant Granted Apr 21, 2026

18/919,417

Patent 12602586

SUPERVISORY NEURON FOR CONTINUOUSLY ADAPTIVE NEURAL NETWORK

1y 5m to grant Granted Apr 14, 2026

17/096,425

Patent 12530583

VOLUME PRESERVING ARTIFICIAL NEURAL NETWORK AND SYSTEM AND METHOD FOR BUILDING A VOLUME PRESERVING TRAINABLE ARTIFICIAL NEURAL NETWORK

5y 2m to grant Granted Jan 20, 2026

16/249,279

Patent 12511528

NEURAL NETWORK METHOD AND APPARATUS

6y 11m to grant Granted Dec 30, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

44%

Grant Probability

77%

With Interview (+33.7%)

4y 7m (~1y 4m remaining)

Median Time to Grant

Low

PTA Risk

Based on 69 resolved cases by this examiner. Grant probability derived from career allowance rate.