Last updated: July 17, 2026

Application No. 18/564,859

CHARACTER-LEVEL ATTENTION NEURAL NETWORKS

Final Rejection §103

Filed

Nov 28, 2023

Priority

May 28, 2021 — provisional 63/194,855 +1 more

Examiner

SONIFRANK, RICHA MISHRA

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

2 (Final)

Interview Optional

— +25.8% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 66% grant rate with +25.8% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 386 resolved cases, 2023–2026

Examiner Intelligence

SONIFRANK, RICHA MISHRA View full profile →

Grants 66% — above average

Career Allowance Rate

256 granted / 386 resolved

+4.3% vs TC avg

Strong +26% interview lift

Without

With

+25.8%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

21 currently pending

Career history

415

Total Applications

across all art units

Statute-Specific Performance

§101

3.2%

-36.8% vs TC avg

§103

90.3%

+50.3% vs TC avg

§102

2.7%

-37.3% vs TC avg

§112

3.2%

-36.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 386 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Response to Amendment
Claims 1, 10, 12 and 13 are amended. Claims 1-20 are presented for examination. 
Response to Arguments
Applicant’s arguments filed on 3/27/2026 have been reviewed. Following are the response: 
Rejections under 35 U.S.C. § 103
Claims 1, 5-10, and 14-19 were rejected under 35 U.S.C. § 103 being unpatentable over Henderson, European Patent No. EP 3819809 and further in view of Morishita, U.S. Patent No. US 20220229982 and further in view of Malkiel, U.S. Patent No. US 20220318504. 
Applicant argues “Claim 1 as amended recites, "determine a respective relevance score for each of the 
plurality of sub-word block embeddings, comprising processing each of the plurality of sub-word block embeddings using a block scoring neural network to apply, in accordance with values of parameters of the block scoring neural network that are learned by applying gradient-based updates during a training of the gradient-based sub-word tokenizer, one or more transformations to each of the plurality of sub-word block embeddings to output an initial relevance score, and process the initial relevance scores using a softmax function to generate the respective relevance score." Applicant respectfully submits that the cited portions of Henderson, Morishita and Malkiel do not disclose or suggest these features of amended claim 1. With respect to features that relate to relevance scores, the Office Action stated that 
Henderson does not explicitly teach these features, and instead cited a portion of Morishita that reads as follows: In BPE, a process of gradually assembling subwords, from characters to a word, by dividing an input statement into the finest parts (characters) and sequentially 
merging (combining) subwords (including characters) is performed. When the 
number of merges is reached in the assembly process, the process is ended. 
[ Morishita, paragraph 0041 ] 
Applicant respectfully submits that the cited portion of Morishita does not disclose or suggest "determine a respective relevance score for each of the plurality of sub-word block embeddings... using a block scoring neural network," as recited by amended claim 1. Instead, the cited portion of Morishita teaches "assembling subwords"using Byte-Pair Encoding (BPE). Since the cited portion of Morishita is silent with respect to "using a block scoring neural network," it cannot disclose or suggest features recited in amended claim 1 that relate to the block scoring neural network, or how it is used to generate the respective relevance score. Specifically, the cited portion of Morishita cannot disclose or suggest "apply... one or more transformations to each of the plurality of sub-word block embeddings to output an initial relevance score. Nor can it disclose or suggest that "the block scoring neural network" has "values of parameters""that are learned by applying gradient-based updates during a training of the gradient-based sub-word tokenizer. Nor can it disclose or suggest "process the initial relevance scores using a softmax function to generate the respective relevance score." In Morishita, BPE is performed using an assembly process as described in the cited  portion of Morishita, and without performing any gradient-based updates to any neural network, as required by amended claim 1” However, Henderson, as modified by Morishita and Malkiel, teaches the claimed limitation. Specifically, Henderson in fig 6 and Para 0157-0160. Henderson does not specifically say block neural network which output a weighted combination which is taught by Morishita. Morishita was cited solely to teach the determination of a respective relevance score for each of the plurality of sub-word block embeddings (Para. 0041-0044, Fig. 3) and the generation of a sub-word representation via a weighted combination of those embeddings (Para. 0041-0044). 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 5-10, and 14-19  are rejected under 35 U.S.C. 103 as being unpatentable over Henderson ( EP 3819809) and further in view of Morshita  (US 20220229982) and further in view of Malkiel (US 20220318504) 


Regarding claim 1, Henderson teaches a system for performing a machine learning task on an input sequence of characters that has a respective character at each of a plurality of character positions to generate a network output ( first model 205 represent the unit text embedding, where unit text is a sequence of characters, Para 0063, Fig 5a) , the system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform one or more operations to implement: a neural network configured to perform the machine learning task, the neural network comprising a sub-word tokenizer and an output neural network ( each unit is converted to a list of unit or tokens which includes subwords, Para 0062) , the gradient- based sub-word tokenizer configured to: receive a sequence of character embeddings that includes a respective character embedding at each of the plurality of character positions ( positional encodings of the subword which includes the character positions, Para 0084-0086, Fig 5a-c) ; and for each particular character position of the plurality of character positions: generate a plurality of candidate sub-word blocks, each candidate sub-word block comprising the respective character embeddings at each character position in a corresponding set of one or more character continuous  positions that includes the particular character position ( the subword embeddings sequence is augmented with positional encodings. The positional encodings are in the form of vectors of length D, with one positional encoding vector corresponding to each embedding in the input sequence. The positional encoding vector is summed with the corresponding embedding in the sequence, Para 0086-0090; augmenting the positional embeddings of the subwords, Para 0086) ; generate a respective sub-word block embedding for each of the plurality of candidate sub-word blocks;  , comprising processing each of the plurality of sub-word block embeddings using a block scoring neural network ( neural network, Fig 6b)  to apply, in accordance with values of parameters of the block scoring neural network that are learned by applying gradient-based updates during a training of the gradient-based sub-word tokenizer, one or more transformations to each of the plurality of sub-word block embeddings to output an initial relevance score(The gradient of the loss with respect to each of the trainable parameters is then determined through back-propagation, using the stored activations, which were cached during the forward pass. The gradient of the loss with respect to each of the trainable parameters is stored using float16 format. The training may be performed in batches, resulting in an array of gradient values, each corresponding to a parameter, for each training example in the batch,  Fig 6b, Para 0157-0160) and process the initial relevance scores using a softmax function to generate the respective relevance score ( quantization using softmax function, Para 0154) ; and the output neural network configured to: receive an output neural network input derived from the latent sub-word representations at the plurality of character positions; and process the output neural network input to generate the network output for the machine learning task ( output, Fig 7 and Fig 8, task could be a translation task, Para 0106) 
Henderson does not explicitly teach  determine a respective relevance score for each of the plurality of sub- word block embeddings; and generate a latent sub-word representation for the particular character position, comprising determining a weighted combination of the plurality of sub-word block embeddings weighted by the relevance scores 
However, Morishita determine a respective relevance score for each of the plurality of sub- word block embeddings ( value of the subwords are determined, Para 0041-0044, Fig 3) ; and generate a sub-word representation for the particular character position, comprising determining a weighted combination of the plurality of sub-word block embeddings weighted by the relevance scores ( all the values are merged and meaning part of the word is determined, Para 0041-0044)
It would have been obvious having the teachings of Henderson to further include the concept of Morishita before effective filing date to improve processing efficiency of neural machine translation ( Para 0080, Morishita) 
Henderson does not teach  a gradient-based sub-word tokenizer  and updates are gradient based and an output neural network; a latent  representation for the position 
However Malkiel teaches a gradient-based sub-word tokenizer with updates are gradient based and an output neural network (gradient based tokenization )  and generate a latent representation for the position of the word ( identify a word-pair from the set of word-pairs having a highest weight, wherein the identified word-pair having the highest weight is selected; [0138] scale at least one gradient map by a multiplication with the corresponding activation maps and summed across the feature dimensions to produce one or more saliency score(s) for every token associated with a selected paragraph; [0139] maximize the similarity score between the aggregated latent representation of a matched word associated with a description of the recommended item and a word associated with a description of the seed item; and [0140] aggregate token saliency scores associated with at least one word in an item description to generate word-scores, Para 0137-0140) 
It would have been obvious having the teachings of Henderson and Morishita to further include the concept of Malkiel before effective filing date to reduce prediction error ( Para 0003, 0021, Malkiel) 

Regarding claim 4, Henderson as above in claim 1, teaches ,wherein the output neural network comprises one or more attention neural network layers that are each configured to: apply an attention mechanism to an attention layer input derived from the output neural network input to generate an attention layer output for the attention neural network layer ( attention mechanism, Para 0084, 0089-0090, Henderson) 

Regarding claim 5,  Henderson as above in claim 1, teaches, wherein, for each particular character position of the plurality of character positions: each candidate sub-word block comprises the respective character embeddings at each of the one or more continuous character positions that begin from the particular character position ( character positions, Fig 5a, Para 0084-0088,  Henderson) 

Regarding claim 7,  Henderson as above in claim 1, teaches,  ,wherein the gradient- based sub-word tokenizer is further configured to shift the sequence of character embeddings by one or more character positions prior to generating the plurality of candidate sub-word blocks  ( augmentation before generating the subword, Para 0086) 

Regarding claim 10,  Henderson as above in claim 1, teaches, wherein one or more transformations comprise a parameterized linear transformation function a parameterized linear transformation function to each of the plurality of sub-word block embeddings ( parametrized linear transformation is applied for the embedding, Fig 6 a 6b, Henderson  and the scores are calculated for the embedding as showing in Fig 3, Morishita)  


Regarding claim 12, arguments analogous to claim 1, are applicable. In addition Henderson teaches One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to implement: a neural network configured to perform a machine learning task on an input sequence of characters that has a respective character at each of a plurality of character positions to generate a network output ( Para 0082-0086, Fig 5) 
Regarding claim 13, arguments analogous to claim 1, are applicable. 
Regarding claim 14, Henderson as above in claim 13, teaches training the neural network by jointly training the gradient-based sub-word tokenizer and the output neural network based on optimizing a supervised learning objective function  ( fig 5b, para 0111, 0113) 
Regarding claim 19, arguments analogous to claim 4, are applicable
Regarding claim 20, arguments analogous to claim 5, are applicable


Claims 2,  8 and 17   rejected under 35 U.S.C. 103 as being unpatentable over Henderson ( EP 3819809) and further in view of Morishita  (US 20220229982) and further in view of Malkiel (US 20220318504) and further in view of Zhang (US 20210098134) 
Regarding claim 2, Henderson modified by Malkiel as above in claim 1, does not explicitly teach  wherein the gradient-based sub-word tokenizer is further configured to apply a down-sampling function to the latent sub-word representations at the plurality of character positions to generate the output neural network input
However, Zhang  teaches  wherein the gradient-based sub-word tokenizer is further configured to apply a down-sampling function to the latent sub-word representations at the plurality of character positions to generate the output neural network input ( downsampling the input through maxpooling , Para 0089-0090)  
It would have been obvious having the teachings of Henderson and Morishita and Malkiel to further include the concept of Zhang before effective filing date since the downsampling reduces the dimensionality which reduced complexity ( Para 0090, Zhang) 
Regarding claim 8, Henderson modified by Morishita and Malkiel as above in claim 1, does not explicitly teach wherein the gradient- based sub-word tokenizer is further configured to apply a 1-D convolution function to the sequence of character embeddings prior to generating the plurality of candidate sub-word blocks
However, Zhang teaches wherein the gradient- based sub-word tokenizer is further configured to apply a 1-D convolution function to the sequence of character embeddings prior to generating the plurality of candidate sub-word blocks ( The embeddings layer in the architecture internally learned the vectorized representations for characters. This was followed by two, one-dimensional (1D) convolutions (with different kernel sizes) which tend to capture sub-word information from the input text. Maxpooling is a sample-based discretization process, Para 0089-0090, 0159, Claim 16) 
It would have been obvious having the teachings of Henderson and Morishita and Malkiel to further include the concept of Zhang before effective filing date since the 1 d convolution will capture sub-word information hence helping with the downstream task ( Para 0159, Zhang) 
Regarding claim 17, arguments analogous to claim 2, are applicable


Allowable Subject Matter
Claims 3, 6, 9 , 11, 15-16 and  18 are being  objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Richa Sonifrank whose telephone number is (571)272-5357. The examiner can normally be reached M-T 7AM - 5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Phan Hai can be reached at (571)272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Richa Sonifrank/Primary Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Nov 28, 2023

Application Filed

Dec 30, 2025

Non-Final Rejection mailed — §103

Mar 23, 2026

Applicant Interview (Telephonic)

Mar 27, 2026

Response Filed

Apr 02, 2026

Examiner Interview Summary

Apr 16, 2026

Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/493,363

Patent 12676156

ENCODING DEVICE AND ENCODING METHOD, DECODING DEVICE AND DECODING METHOD, AND PROGRAM

2y 8m to grant Granted Jul 07, 2026

17/384,690

Patent 12664183

ONLINE QUESTION ANSWERING, USING READING COMPREHENSION WITH AN ENSEMBLE OF MODELS

4y 11m to grant Granted Jun 23, 2026

17/635,489

Patent 12664973

VOICE DIALOGUE PROCESSING METHOD AND APPARATUS

4y 4m to grant Granted Jun 23, 2026

18/015,512

Patent 12645879

ENTITY RECOGNITION METHODS AND APPARATUSES, ELECTRONIC DEVICES AND STORAGE MEDIA

3y 4m to grant Granted Jun 02, 2026

18/194,644

Patent 12602552

Machine-Learning-Based OKR Generation

3y 0m to grant Granted Apr 14, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

66%

Grant Probability

92%

With Interview (+25.8%)

3y 0m (~4m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 386 resolved cases by this examiner. Grant probability derived from career allowance rate.