Last updated: July 17, 2026
Application No. 18/222,327
MEMORY EFFICIENT SEQUENCE GENERATION NEURAL NETWORKS

Non-Final OA §103§112
Filed
Jul 14, 2023
Examiner
SPRAUL III, VINCENT ANTON
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Google LLC
OA Round
1 (Non-Final)
Interview Optional

— +26.4% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 58% grant rate with +26.4% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 43 resolved cases, 2023–2026
Examiner Intelligence

SPRAUL III, VINCENT ANTON View full profile →
Grants 58% of resolved cases
Career Allowance Rate
25 granted / 43 resolved
+3.1% vs TC avg
Strong +26% interview lift
Without
With
+26.4%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
20 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.4%
-37.6% vs TC avg
§103
94.2%
+54.2% vs TC avg
§102
1.0%
-39.0% vs TC avg
§112
1.5%
-38.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 43 resolved cases
Office Action

§103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 9 objected to because of the following informality. Claim 9 recites “computing a matrix multiplication between a key matrix generated from the suffix matrix and the query matrix representing the queries to generate a second matrix product.” Neither claim 9 nor any claim dependent on claim 9 recite any other matrix products beyond the “second matrix product,” hence the “second” is superfluous.
	Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 10-11 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claim 10:
	Claim 10 recites “concatenating the first and second matrix products along a row dimension.” Neither claim 10 nor any claim upon which claim 10 depends recites a first matrix product, therefore “the first … matrix products” lacks an antecedent basis. In further examination below, the first matrix will be interpreted as including any matrix.

Regarding claim 11:
Claim 11 is dependent on claim 10 and is rejected by the same reasoning. In further examination below, the first matrix will be interpreted as including any matrix.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-7 and 9-20 rejected under 35 U.S.C. 103 over Shazeer et al., US Pre-Grant Publication No. 2022/0051099 (hereafter Shazeer) in view of Savinov et al., US Pre-Grant Publication No. 2024/0412042 (hereafter Savinov).

Regarding claim 1 and analogous claims 13 and 14:
	Shazeer teaches:
“A computer-implemented method comprising”: Shazeer, paragraph 0089, “Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them [A computer-implemented method]. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus.”
(bold only) “receiving a request to generate, from an input sequence comprising a plurality of input tokens, a plurality of output sequences each comprising a respective output token at each of a plurality of output positions”: Shazeer, paragraphs 0083-0085, “The system receives an input sequence (step 310) [receiving a request to generate, from an input sequence comprising a plurality of input tokens]. The system processes the input sequence using the encoder neural network to generate a respective encoded representation of each of the network inputs in the input sequence (step 320). In particular, the system processes the input sequence through the embedding layer to generate an embedded representation of each network input and then process the embedded representations through the sequence of encoder subnetworks to generate the encoded representations of the network inputs. The system processes the encoded representations using the decoder neural network to generate an output sequence (step 330) [generate … a plurality of output sequences each comprising a respective output token at each of a plurality of output positions].”
“generating, by using an auto-regressive generative neural network, the plurality of output sequences from the input sequence”: Shazeer, paragraph 0085, “The decoder neural network is configured to generate the output sequence from the encoded representations in an auto-regressive manner [generating, by using an auto-regressive generative neural network, the plurality of output sequences from the input sequence].”
“wherein the auto-regressive generative neural network comprises a plurality of attention layers”: Shazeer, paragraph 0049, “In particular, each decoder subnetwork 170 includes two different attention sub-layers: a decoder self-attention sub-layer 172 and an encoder-decoder attention sub-layer 174.”
“and wherein the generating comprises, at an attention layer and for a particular output position of the plurality of output positions of each output sequence: maintaining context data comprising (i) a respective embedded representation of each of the plurality of input tokens included in the input sequence and (ii) for each output sequence, a respective embedded representation of an output token at each output position that precedes the particular output position of the output sequence”: Shazeer, paragraph 0023, “The input sequence 102 has a respective network input at each of multiple input positions in an input order and the output sequence 152 has a respective network output at each of multiple output positions in an output order. That is, the input sequence 102 has multiple inputs arranged according to an input order and the output sequence 152 has multiple outputs arranged according to an output order [a particular output position of the plurality of output positions of each output sequence]”; Shazeer, paragraph 0037, “Each encoder subnetwork 130 includes an encoder self-attention sub-layer 132 [an attention layer]. The encoder self-attention sublayer 132 is configured to receive the subnetwork input for each of the plurality of input positions and, for each particular input position in the input order, apply an attention mechanism over the encoder subnetwork inputs at the input positions using one or more queries derived from the encoder subnetwork input at the particular input position to generate a respective output for the particular input position”; Shazeer, paragraphs 0083-0085, “The system receives an input sequence (step 310). The system processes the input sequence using the encoder neural network to generate a respective encoded representation of each of the network inputs in the input sequence (step 320). In particular, the system processes the input sequence through the embedding layer to generate an embedded representation of each network input and then process the embedded representations through the sequence of encoder subnetworks to generate the encoded representations of the network inputs [maintaining context data comprising (i) a respective embedded representation of each of the plurality of input tokens included in the input sequence, maintaining context data interpreted as making the data available for later processing]. The system processes the encoded representations using the decoder neural network to generate an output sequence (step 330). That is, the decoder neural network generates one output from the output sequence at each generation time step. At a given generation time step at which a given output is being generated, the system processes the outputs before the given output in the output sequence through the embedding layer in the decoder to generate embedded representations [(ii) for each output sequence, a respective embedded representation of an output token at each output position that precedes the particular output position of the output sequence]. The system then processes the embedded representations through the sequence of decoder subnetworks, the linear layer, and the softmax layer to generate the given output. Because the decoder subnetworks include encoder-decoder attention sub-layers as well as decoder self-attention sub-layers, the decoder makes use of both the already generated outputs and the encoded representations when generating the given output.”
“for each output sequence, receiving a respective embedded representation of the output token at the particular output position within the output sequence”: Shazeer, paragraph 0023, “The input sequence 102 has a respective network input at each of multiple input positions in an input order and the output sequence 152 has a respective network output at each of multiple output positions in an output order. That is, the input sequence 102 has multiple inputs arranged according to an input order and the output sequence 152 has multiple outputs arranged according to an output order [receiving a respective embedded representation of the output token at the particular output position within the output sequence].”
“generating a first set of attention logits that includes a plurality of logit values for each of the plurality of input tokens included in the input sequence”: Shazeer, paragraph 0077, “When the encoder self-attention sub-layer implements multi-head attention, each encoder self-attention layer in the encoder self-attention sub-layer [that is, a plurality of self-attention layers] is configured to: apply a learned query linear transformation to each encoder sub-network input at each input position to generate a respective query for each input position, apply a learned key linear transformation to each encoder subnetwork input at each input position to generate a respective key for each input position, apply a learned value linear transformation to each encoder subnetwork input at each input position to generate a respective value for each input position, and then apply the attention mechanism (i.e., the scaled dot-product attention mechanism described above) using the queries, keys, and values to determine an initial encoder self-attention output for each input position [generating a first set of attention logits that includes a plurality of logit values for each of the plurality of input tokens included in the input sequence]. The sub-layer then combines the initial outputs of the attention layers as described above.”
“comprising applying, using one or more queries derived from the respective embedded representation of the output token at the particular output position, a first attention mechanism over the respective embedded representation of each of the plurality of input tokens included in the input sequence”: Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing multi-head attention mechanism 174, applied to output from component 134, which in turn processes values ultimately derived from input embedding 120, hence, applying … a first attention mechanism over the respective embedded representation of each of the plurality of input tokens included in the input sequence; also showing multi-head attention mechanism 174 receiving input from component 160 along with positional encoding, hence, using one or more queries derived from the respective embedded representation of the output token at the particular output position]; Shazeer, paragraph 0051, “Each encoder-decoder attention sub-layer 174, on the other hand, is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position and, for each of the output positions, apply an attention mechanism over the encoded representations at the input positions using one or more queries derived from the input for the output position to generate an updated representation for the output position. Thus, the encoder-decoder attention sub-layer 174 applies attention over encoded representations while the encoder self-attention sub-layer 172 applies attention over  inputs at output positions.”
“generating a second set of attention logits that includes, for each output sequence, a logit value for the output token at each output position that precedes the particular output position, comprising applying, using the one or more queries, a second attention mechanism over the respective embedded representation of the output token at each output position that precedes the particular output position of the output sequence”: Shazeer, paragraph 0050, “Each decoder self-attention sub-layer 172 is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position [each output position that precedes the particular output position] and, for each of the particular output positions, apply an attention mechanism over the inputs at the output positions preceding the corresponding position using one or more queries derived from the input at the particular output position [generating a second set of attention logits that includes, for each output sequence, a logit value for the output token at each output position] to generate a updated representation for the particular output position. That is, the decoder self-attention sub-layer 172 applies an attention mechanism that is masked so that it does not attend over or otherwise process any data that is not at a position preceding the current output position in the output sequence.”
“generating, from the first and second sets of attention logits, a respective updated embedded representation of the output token at the particular output position”: Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at step 174 a combination of outputs (logits) from attention mechanisms 132 (generating a first set of attention logits) and 172 (generating a second set of attention logits]; Shazeer, paragraph 0051, “Each encoder-decoder attention sub-layer 174, on the other hand, is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position and, for each of the output positions, apply an attention mechanism over the encoded representations at the input positions using one or more queries derived from the input for the output position to generate an updated representation for the output position [generating, from the first and second sets of attention logits, a respective updated embedded representation of the output token at the particular output position]. Thus, the encoder-decoder attention sub-layer 174 applies attention over encoded representations while the encoder self-attention sub-layer 172 applies attention over inputs at output positions.”
	Shazeer does not explicitly teach (bold only) “receiving a request to generate, from an input sequence comprising a plurality of input tokens, a plurality of output sequences each comprising a respective output token at each of a plurality of output positions.”
	Savinov teaches “receiving a request to generate, from an input sequence comprising a plurality of input tokens, a plurality of output sequences each comprising a respective output token at each of a plurality of output positions”: Savinov, paragraphs 0020-0021, “As one example, the system can receive a context input as part of a request and generate an output sequence that is a response to the request [receiving a request to generate, from an input sequence]. As a particular example, the system can be part of a dialog system and the context data can be a prompt submitted by a user of the dialog system. As another example, if the context input is a sequence of words i.e. text in one, e.g. natural, language, the output sequence generated by the neural network may be a translation of the input text into another, e.g. natural, language, i.e. a sequence of words that is the translation [an input sequence comprising a plurality of input tokens, a plurality of output sequences each comprising a respective output token at each of a plurality of output positions].”
Savinov and Shazeer are analogous arts as they are both related to generative models. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the user requests of Savinov with the teachings of Shazeer to arrive at the present invention, in order to design a system responsive to the user, as stated in Savinov, paragraph 0020, “As a particular example, the system can be part of a dialog system and the context data can be a prompt submitted by a user of the dialog system.”

Regarding claim 2 and analogous claim 15:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 1.”
Shazeer further teaches “wherein maintaining the respective embedded representation of each of the plurality of input tokens included in the input sequence comprises: maintaining a prefix matrix having numeric values that represent the respective embedded representation of each of the plurality of input tokens included in the input sequence”: Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at step 120 that inputs are transformed to representative embeddings, which are maintained for later use, e.g., at component 132, hence maintaining a prefix matrix having numeric values that represent the respective embedded representation of each of the plurality of input tokens included in the input sequence].

Regarding claim 3 and analogous claim 16:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 1.”
Shazeer further teaches “wherein maintaining, for each output sequence, the respective embedded representation of the output token at each output position that precedes the particular output position of the output sequence comprises: maintaining a suffix matrix having numeric values that represent the respective embedded representation of the output token at each output position that precedes the particular output position”: Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at step 160 that outputs are transformed to representative embeddings, which are maintained for later use, e.g., at component 170, hence maintaining a suffix matrix having numeric values that represent the respective embedded representation of the output token at each output position that precedes the particular output position].

Regarding claim 4 and analogous claim 17:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 2.”
Shazeer further teaches “wherein maintaining the prefix matrix comprises storing the prefix matrix in a memory device”: Shazeer, paragraph 0089, “Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device [storing the prefix matrix in a memory device], or a combination of one or more of them.”

Regarding claim 5 and analogous claim 18:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 3.”
Shazeer further teaches “wherein the suffix matrix has more rows than the prefix matrix”: Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at step 120 that inputs are transformed to representative embeddings, which are maintained for later use, e.g., at component 132, hence outputting values to a prefix matrix; showing at step 160 that outputs are transformed to representative embeddings, which are maintained for later use, e.g., at component 170, hence outputting values to a suffix matrix]; Shazeer, paragraph 0040, “Once the encoder neural network 110 has generated the encoded representations, the decoder neural network 150 is configured to generate the output sequence in an auto-regressive manner [hence, as the encoder neural network 110 executes once, producing one row of the prefix matrix, the decoder neural network executes iteratively, producing multiple rows of the suffix matrix, thus, the suffix matrix has more rows than the prefix matrix.”

Regarding claim 6 and analogous claim 19:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 1.”
	Shazeer further teaches “wherein the plurality of attention layers comprise a masked self-attention layer, and wherein the first and second attention mechanisms are both a masked self-attention mechanism applied by the self-attention layer”: Shazeer, paragraph 0051, “Each encoder-decoder attention sub-layer 174, on the other hand, is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position [masked] and, for each of the output positions, apply an attention mechanism over the encoded representations at the input positions using one or more queries derived from the input for the output position to generate an updated representation for the output position [the first … attention mechanisms are … a masked self-attention mechanism applied by the self-attention layer]. Thus, the encoder-decoder attention sub-layer 174 applies attention over encoded representations while the encoder self-attention sub-layer 172 applies attention over inputs at output positions”; Shazeer, paragraph 0050, “Each decoder self-attention sub-layer 172 is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position [masked] and, for each of the particular output positions, apply an attention mechanism over the inputs at the output positions preceding the corresponding position using one or more queries derived from the input at the particular output position [the … second attention mechanisms are … a masked self-attention mechanism applied by the self-attention layer] to generate a updated representation for the particular output position. That is, the decoder self-attention sub-layer 172 applies an attention mechanism that is masked so that it does not attend over or otherwise process any data that is not at a position preceding the current output position in the output sequence.”

Regarding claim 7 and analogous claim 20:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 2.”
Shazeer further teaches “wherein applying the first attention mechanism comprises: computing a matrix multiplication between a key matrix generated from the prefix matrix and a query matrix representing the queries to generate a first matrix product”: Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at step 120 that inputs are transformed to representative embeddings, which are maintained for later use, hence a prefix matrix, the values later used in multi-head attention component 174, the first attention mechanism]; Shazeer, paragraph 0058, “FIG. 2 is a diagram 200 showing attention mechanisms that are applied by the attention sub-layers in the subnetworks of the encoder neural network 110 and the decoder neural network 150”; Shazeer, paragraphs 0061-0062, “In operation and as shown in the left hand side of FIG. 2, the attention sub-layer computes the attention over a set of queries simultaneously. In particular, the attention sub-layer packs the queries into a matrix Q, packs the keys into a matrix K, and packs the values into a matrix V. To pack a set of vectors into a matrix, the attention sub-layer can generate a matrix that includes the vectors as the rows of the matrix. The attention sub-layer then performs a matrix multiply (MatMul) between the matrix Q and the transpose of the matrix K to generate a matrix of compatibility function outputs [computing a matrix multiplication between a key matrix generated from the prefix matrix and a query matrix representing the queries to generate a first matrix product].”

Regarding claim 9:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 3.”
Shazeer further teaches “wherein applying the second attention mechanism comprises: computing a matrix multiplication between a key matrix generated from the suffix matrix and the query matrix representing the queries to generate a second matrix product”: Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at step 160 that outputs are transformed to representative embeddings, which are maintained for later use, e.g., at component 172, hence a suffix matrix, the values later used in multi-head attention component 172, the second attention mechanism]; Shazeer, paragraph 0058, “FIG. 2 is a diagram 200 showing attention mechanisms that are applied by the attention sub-layers in the subnetworks of the encoder neural network 110 and the decoder neural network 150”; Shazeer, paragraphs 0061-0062, “In operation and as shown in the left hand side of FIG. 2, the attention sub-layer computes the attention over a set of queries simultaneously. In particular, the attention sub-layer packs the queries into a matrix Q, packs the keys into a matrix K, and packs the values into a matrix V. To pack a set of vectors into a matrix, the attention sub-layer can generate a matrix that includes the vectors as the rows of the matrix. The attention sub-layer then performs a matrix multiply (MatMul) between the matrix Q and the transpose of the matrix K to generate a matrix of compatibility function outputs [computing a matrix multiplication between a key matrix generated from the suffix matrix and the query matrix representing the queries to generate a second matrix product].”

Regarding claim 10:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 9.”
Shazeer further teaches “wherein generating the respective updated embedded representation of the output token at the particular output position comprises: concatenating the first and second matrix products along a row dimension”: Shazeer, paragraph 0070, “As shown in FIG. 2, the attention sub-layer concatenates (concat) the outputs of the attention layers and applies a learned linear transformation to the concatenated output to generate the output of the attention sub-layer [hence, every attention layer in a multi-headed attention embodiment includes a matrix concatenation]”; Shazeer, Fig. 2, 

    PNG
    media_image2.png
    418
    729
    media_image2.png
    Greyscale

[showing that scaled dot-product attention includes the matmul between Q and K and is followed by the concat operation]; Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at step 160 that outputs are transformed to representative embeddings, which are maintained for later use, hence a suffix matrix, the matrix used in multi-head attention component 172, hence, concatenating the first and second matrix products along a row dimension, first matrix product interpreted as any concatenated matrix (see 112(b) rejection)].

Regarding claim 11:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 10.”
Shazeer further teaches:
“at each attention layer and for each particular output position of the plurality of output positions of each output sequence: processing the concatenated first and second matrix products using a compatibility function to generate a weight matrix”: Shazeer, Fig. 2, 

    PNG
    media_image2.png
    418
    729
    media_image2.png
    Greyscale

[showing that each dot-product attention includes softmax]; Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at attention component 174 the processing of the output of attention component 172, which includes concatenating the first and second matrix products along a row dimension]; Shazeer, paragraph 0060, “More specifically, each attention sub-layer applies a scaled dot-product attention mechanism 230. In scaled dot-product attention, for a given query, the attention sub-layer computes the dot products of the query with all of the keys, divides each of the dot products by a scaling factor, e.g., by the square root of the dimensions of the queries and keys, and then applies a softmax function [a compatibility function, as suggested in the present specification, paragraph 0081, “Each attention head processes the concatenated first and second matrix products using a compatibility function, e.g., the softmax function in FIG. 4, to generate a weight matrix”] over the scaled dot products to obtain the weights on the values [processing the concatenated first and second matrix products using a compatibility function to generate a weight matrix]. The attention sub-layer then computes a weighted sum of the values in accordance with these weights. Thus, for scaled dot-product attention the compatibility function is the dot product and the output of the compatibility function is further scaled by the scaling factor.”
“and computing a matrix multiplication between the weight matrix and a value matrix generated from both the prefix matrix and the suffix matrix to generate a weighted value matrix having numeric values that represent the respective updated embedded representation of the output token at the particular output position”: Shazeer, paragraph 0064, “The attention sub-layer then applies a softmax over the scaled output matrix to generate a matrix of weights and performs a matrix multiply (MatMul) between the weight matrix and the matrix V to generate an output matrix that includes the output of the attention mechanism for each of the values [computing a matrix multiplication between the weight matrix and a value matrix generated from both the prefix matrix and the suffix matrix]”; Shazeer, paragraph 0051, “Each encoder-decoder attention sub-layer 174, on the other hand, is configured to, at each generation time step, receive an input for each output position preceding the corresponding output position and, for each of the output positions, apply an attention mechanism over the encoded representations at the input positions using one or more queries derived from the input for the output position to generate an updated representation for the output position [having numeric values that represent the respective updated embedded representation of the output token at the particular output position]. Thus, the encoder-decoder attention sub-layer 174 applies attention over encoded representations while the encoder self-attention sub-layer 172 applies attention over  inputs at output positions.”

Regarding claim 12:
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 1.”
Shazeer further teaches“wherein maintaining the context data comprises: updating the context data to include the output token at the particular output position that has been generated based on the respective updated embedded representation of the output token”: Shazeer, paragraphs 0085, “At a given generation time step at which a given output is being generated, the system processes the outputs before the given output in the output sequence through the embedding layer in the decoder to generate embedded representations [a respective embedded representation of an output token, hence context data]. The system then processes the embedded representations through the sequence of decoder subnetworks, the linear layer, and the softmax layer to generate the given output. Because the decoder subnetworks include encoder-decoder attention sub-layers as well as decoder self-attention sub-layers, the decoder makes use of both the already generated outputs [i.e., prior outputs, produced using the output tokens, are processed by the embedding layer ins subsequent steps, hence updating the context data to include the output token at the particular output position that has been generated based on the respective updated embedded representation of the output token ] and the encoded representations when generating the given output.”
	

Claim 8 rejected under 35 U.S.C. 103 over Shazeer as modified by Savinov in view of SciPy, “Broadcasting,” 2015, https://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html (hereafter SciPy).
	Shazeer as modified by Savinov teaches “The computer-implemented method of claim 7.”
Shazeer further teaches (bold only) “wherein computing the matrix multiplication comprises broadcasting the prefix matrix along a column dimension to match a row number of the suffix matrix”: Shazeer, Fig. 1, 

    PNG
    media_image1.png
    909
    804
    media_image1.png
    Greyscale

[showing at step 120 that inputs are transformed to representative embeddings, which are maintained for later use, e.g., at component 132, hence outputting values to a prefix matrix; showing at step 160 that outputs are transformed to representative embeddings, which are maintained for later use, e.g., at component 170, hence outputting values to a suffix matrix].
	Shazeer as modified by Savinov does not explicitly teach (bold only) “wherein computing the matrix multiplication comprises broadcasting the prefix matrix along a column dimension to match a row number of the suffix matrix.”
SciPy teaches (bold only) “wherein computing the matrix multiplication comprises broadcasting the prefix matrix along a column dimension to match a row number of the suffix matrix”: SciPy, paragraphs 1-2, “The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is ‘broadcast’ across the larger array so that they have compatible shapes [computing the matrix multiplication comprises broadcasting the prefix matrix along a column dimension to match a row number of the suffix matrix]. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation. NumPy operations are usually done on pairs of arrays on an element-by-element basis.”
SciPy and Shazeer are analogous arts as they are both related to matrix operations. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the broadcasting of SciPy with the teachings of Shazeer to arrive at the present invention, in order to avoid data redundancy, as stated in SciPy, paragraph 1, “It does this without making needless copies of data and usually leads to efficient algorithm implementations.”

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Vinyals et al., US Patent No. 11,227,206, discloses methods of generating output sequences from input sequences using auto-regression and iterative application of the softmax function.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT SPRAUL whose telephone number is (703) 756-1511. The examiner can normally be reached M-F 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MICHAEL HUNTLEY can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VAS/Examiner, Art Unit 2129                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Jul 14, 2023
Application Filed
Mar 27, 2026
Non-Final Rejection mailed — §103, §112
Jun 17, 2026
Examiner Interview Summary
Jun 17, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/221,305
Patent 12675684
TRAINING NEURAL NETWORKS REPRESENTED AS COMPUTATIONAL GRAPHS
5y 3m to grant Granted Jul 07, 2026
17/477,493
Patent 12675685
Distributed Fault Detection
4y 9m to grant Granted Jul 07, 2026
17/590,930
Patent 12670384
METHOD FOR DETERMINING CLASS OF DATA TO BE DETERMINED USING MACHINE LEARNING MODEL, INFORMATION PROCESSING DEVICE, AND COMPUTER PROGRAM
4y 4m to grant Granted Jun 30, 2026
19/204,644
Patent 12657462
METHOD AND DEVICE FOR TRAINING NEURAL NETWORK MODEL
1y 1m to grant Granted Jun 16, 2026
17/795,597
Patent 12651160
METHOD FOR TRAINING NEURAL NETWORK BY USING DE-IDENTIFIED IMAGE AND SERVER PROVIDING SAME
3y 10m to grant Granted Jun 09, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
58%
Grant Probability
84%
With Interview (+26.4%)
4y 4m (~1y 4m remaining)
Median Time to Grant
Low
PTA Risk
Based on 43 resolved cases by this examiner. Grant probability derived from career allowance rate.