Last updated: May 04, 2026

Application No. 18/587,008

GRADIENT CONTROL DEVICE AND GRADIENT CONTROL METHOD OF LANGUAGE MODEL

Non-Final OA §102

Filed

Feb 26, 2024

Priority

Apr 07, 2023 — RE 10-2023-0045812

Examiner

PASHA, ATHAR N

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Seoul National University R&Db Foundation

OA Round

3 (Non-Final)

Interview Optional

— +17.0% interview lift. Examiner has a relatively high allowance rate (90%); +17.0% interview lift. A written response may suffice.

Based on 154 resolved cases, 2023–2026

Examiner Intelligence

PASHA, ATHAR N View full profile →

Grants 90% — above average

Career Allowance Rate

138 granted / 154 resolved

+27.6% vs TC avg

Strong +17% interview lift

Without

With

+17.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 6m

Avg Prosecution

20 currently pending

Career history

174

Total Applications

across all art units

Statute-Specific Performance

§101

21.7%

-18.3% vs TC avg

§103

49.9%

+9.9% vs TC avg

§102

16.8%

-23.2% vs TC avg

§112

5.2%

-34.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 154 resolved cases

Office Action

§102

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments filed on 12/23/25 have been received.  
In light of the amendments, the examiner removes the Claim Objections.
In light of the amendments, the examiner removes the 112 rejections.
Regarding 102 rejections, the applicant states beginning on page 7  “The Office Action relies on Formula (8) of… Thus, the ratio ak/ar_bar is always less than 1…increase the degree of pushing relative to an original degree before the scale of the second gradient part is reduced.”  The gradient is prevented from dropping below a reference value. This is prescribed in equation 8 where ak >ar_bar, then g2k will never drop below the reference value of “1” which is the second argument of the min function. The min function only starts to kick in when vk ∈ Vr, otherwise it is constrained to 1 as per the second line of eq. 8, and thus increases the degree of the pushing relative to an original degree, which the examiner maps to the first part of equation 4 in the reference. In light of this, the examiner respectfully disagrees with the applicant and maintains the rejection. 


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-3,5-6,9-11 and 13 are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by the Yu (Yu, Sangwon, et al. "Rare tokens degenerate all tokens: Improving neural text generation via adaptive gradient gating for rare token embeddings." arXiv preprint arXiv:2109.03127v2 16 Mar  (2022).)
With respect to claim 1 Li teaches
(claim 1) A gradient control device of a language model, the gradient control device comprising: 
one or more processors (¶p5Sec4.2 para1: This can be easily implemented by detach() function of Pytorch…All model and training configurations are the same as in the previous section); and 
memory storing instructions that, when executed by the one or more processors, cause the gradient control device to (¶p5Sec4.2 para1: This can be easily implemented by detach() function of Pytorch…All model and training configurations are the same as in the previous section):
(claim 9) A method of controlling a gradient of a language model, the method comprising:
calculate a number of occurrences of each token, of a plurality of tokens, in batch data at each training step of a plurality of training steps ranging from a current training step to a set previous training step (¶p5Sec 4.1para2: Therefore, it is necessary to dynamically group rare tokens based on token appearances in recent batch samples.); 
group rare tokens based on a comparison of the calculated number of occurrences of each token, of the plurality of tokens, with a threshold value (¶p13Sec D ll1-6 In this sections we show how the metrics used on language modeling task change with the hyperparameter in Figure 5. We observed an interesting phenomenon about the non-rare token group when rare token group size increases over a specific threshold ) wherein the rare tokens are grouped by grouping first rare tokens and second rare tokens according to degrees of rarity (¶ p6Col1para1: We define the two rarity levels based on the average number of appearances of the entire rare tokens: if the token appearance ak is smaller than the mean of ar where r 2 Vr, corresponding token is a very rare token); 
calculate a gate tensor on embedding vectors of the grouped rare tokens (¶p5Sec4.2 ll1-6 With T context feature vectors hi (i 2 [1; T]) from the training sample, the negative log-likelihood loss gradient for the rare token embedding wr is calculated as follows, ¶p5Sec4.2 ll1-6 where xgated is a new parameter whose value is the same as x, and g ∈ [0, 1] is a gate tensor. When the xgated is fed to the function f(·) as input, the gradient for x is gated by g… where g1k denotes a k-th component of g1. g1 controls the degree to which rare token embeddings move away from non-rare feature vectors whose targets differ from each rare token embedding. Also, each component of g1 is calculated based on the rarity of each rare token, ak, so gradient gating for part (b) of Eq. 4 is adaptive for each rare tokens.); 
scale a gradient part that pushes the embedding vectors of the grouped rare tokens away from feature vectors having relatively non-rare target tokens and feature vectors having relatively rare target tokens, among gradients of a loss function for the embedding vectors of the grouped rare tokens in a training step (p5SeWHc4.2para2: As we described in section 3, part (b) of Eq. 4 should mainly be handled to solve the degeneration problem. To address part (b) of Eq. 4, given a context feature vector of the i-th position hi, we introduce a gate vector g1 ∈ RN as follows ( ¶ ).

    PNG
    media_image1.png
    83
    357
    media_image1.png
    Greyscale

where g1k denotes a k-th component of g1. g1 controls the degree to which rare token embeddings move away from non-rare feature vectors whose targets differ from each rare token embedding. Also, each component of g1 is calculated based on the rarity of each rare token, ak, so gradient gating for part (b) of Eq. 4 is adaptive for each rare tokens.
calculate, using a second gate tensor application, a second gate tensor on a second gradient part, wherein the second gradient part is configured to push the embedding vectors of the second rare tokens away from feature vectors having the rare target tokens, with a smaller number of occurrences than the non-rare target tokens, when applied to training (¶p4col2paraLast: Part (c) [second gradient part] pushes away wr from the feature vectors whose target tokens are rare.); and
control a degree of the pushing p5Sec4.2para1To solely control the gradient for rare token embeddings, we introduce a gradient gating method for a parameter x),
wherein the second gate tensor application is configured to keep a scale of the second gradient part from dropping below a reference value by calculating the second gate tensor on the second gradient part, and configured to increase the degree of the pushing relative to an original degree before the scale of the second gradient part is reduced (¶p6Col1para1 For the very rare token embeddings, part (c) of the gradient about embeddings pushes them away from the feature vectors whose targets are less rare tokens that are relatively frequent compared to them. This means that part (c) roles like part (b) in the above situation, which becomes the cause of the degeneration problem. Therefore, we need to handle part (c) of Eq. 4 for very rare tokens. To address part (c) of Eq. 4 for the very rare token embeddings, we introduce another gate vector g2  

    PNG
    media_image2.png
    57
    287
    media_image2.png
    Greyscale

 where g2k is the k-th component of g2 and ar is the mean of ar where r 2 Vr. g2 controls the degree to which very rare token embeddings move away from less rare feature vectors whose targets differ from each very rare token embedding.) Examiner Note: the min function prevents the value from dropping below 1, and the ratio ak/ar maps to pushing relative to an original degree. ) Examiner Note: The gradient is prevented from dropping below a reference value. This is prescribed in equation 8 where ak >ar_bar, then g2k will never drop below the reference value of “1” which is the second argument of the min function. The min function only starts to kick in when vk ∈ Vr, otherwise it is constrained to 1 as per the second row of eq. 8, and thus increases the degree of the pushing relative to an original degree, which the examiner maps to the first part of equation 4 in the reference.

With respect to claims 2 and 10 Yu teaches wherein the memory further stores the calculated number of occurrences of each token of the plurality of tokens ( ¶ Sec4.1para2: To consider the token appearances in recent batch samples, we introduce the token counter memory that remembers the number of the appearances of each token during the previous K training ).
With respect to claims 3 and 11 Yu teaches calculate an average number of occurrences of each token of the plurality of tokens by summing all numbers, stored in the memory, of occurrences of each token of the plurality of tokens, and wherein the instructions, when executed by the one or more processors, cause the gradient control device to group the rare tokens by determining one or more tokens, having an average number of occurrences less than the threshold value, to be the rare tokens (¶ p6Col1para1: We define the two rarity levels based on the average number of appearances of the entire rare tokens: if the token appearance ak is smaller than the mean of ar where r 2 Vr, corresponding token is a very rare token.).

With respect to claims 5 and 13 Yu  teaches calculate, using a first gate tensor application, a first gate tensor on a first gradient part, wherein the first gradient part is configured to push the embedding vectors of the grouped rare tokens away from the feature vectors having the non-rare target tokens when applied to training, among the gradients of the loss function ( p5Col1paraLast: the degeneration problem could be solved to a large extent by mainly addressing the part of the gradient for rare embeddings that pushes away rare token embeddings from non-rare feature vectors, ¶p4Sec3.3para1: With T context feature vectors hi (i 2 [1; T]) from the training sample, the negative log-likelihood loss gradient for the rare token embedding wr is calculated as follows … We divide the gradient for wr to 3 parts in Eq. 4. Part (a) pulls wr close to the feature vectors whose target tokens are vr. Part (b) pushes away wr from the feature vectors whose target tokens are not rare. Part (c) pushes away wr from the feature vectors whose target tokens are rare. As an extension of the analysis in the previous subsection, we freeze these parts of the gradient with various settings during training to identify the key cause of the degeneration problem. In other words, depending on the settings, the specific gradient parts that will not be used for embedding training is detached from the computation graph during training stage, ¶p5Sec4.2para1: where xgated is a new parameter whose value is the gated same as x, and g is a gate tensor. When the xgated is fed to the function f(.) as input, the f ( ) gated gradient for x is gated by g.); 

With respect to claims 6 Yu  teaches reduce, using the first gate tensor application, a scale of the first gradient part according to a reference value by calculating the first gate tensor on the first gradient part (¶p5Sec4.2para1: where xgated is a new parameter whose value is the gated same as x, and g is a gate tensor. When the xgated is fed to the function f(.) as input, the f ( ) gated gradient for x is gated by g, ¶ p5Sec4.2para2   To address part (b) of Eq. 4, given a context feature vector of the i-th position hi, we introduce a gate vector g1 2 RN as follows. g1k = ( ak=K if vk 2 Vr; vk 6= yi 1 else ; (7) where g1k denotes a k-th component of g1. g1 controls the degree to which rare token embeddings move away from non-rare feature vectors whose targets differ from each rare token embedding. Also, each component of g1 is calculated based on the rarity of each rare token, ak, so gradient gating for part (b) of Eq. 4 is adaptive for each rare tokens ); and 
reduce the degree of the pushing ( p5Sec4.2para1¶ where g1k denotes a k-th component of g1. g1 controls the degree to which rare token embeddings move away from non-rare feature vectors whose targets differ from each rare token embedding).; 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675. The examiner can normally be reached Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/ATHAR N PASHA/Examiner, Art Unit 2657 

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675. The examiner can normally be reached Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ATHAR N PASHA/Primary Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

Feb 26, 2024

Application Filed

Sep 18, 2025

Non-Final Rejection — §102

Dec 23, 2025

Response Filed

Jan 16, 2026

Final Rejection — §102

Apr 21, 2026

Request for Continued Examination

Apr 23, 2026

Response after Non-Final Action

Apr 23, 2026

Non-Final Rejection — §102 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/264,595

Patent 12614040

SIMULTANEOUS TRANSLATION DEVICE AND COMPUTER PROGRAM

2y 8m to grant Granted Apr 28, 2026

18/747,081

Patent 12608556

INTENTION RECOGNITION METHOD, DEVICE, ELECTRONIC DEVICE AND STORAGE MEDIUM BASED ON LARGE MODEL

1y 10m to grant Granted Apr 21, 2026

18/747,499

Patent 12608557

CHINESE DIALOGUE SYSTEM FOR COGNITIVELY IMPAIRED ADULTS BASED ON COGNITIVE STIMULATION THERAPY PRINCIPLES

1y 10m to grant Granted Apr 21, 2026

18/335,256

Patent 12596882

COMPLIANCE DETECTION USING NATURAL LANGUAGE PROCESSING

2y 9m to grant Granted Apr 07, 2026

17/418,193

Patent 12586563

Method, System and Apparatus for Understanding and Generating Human Conversational Cues

4y 9m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

90%

Grant Probability

99%

With Interview (+17.0%)

2y 6m (~4m remaining)

Median Time to Grant

High

PTA Risk

Based on 154 resolved cases by this examiner. Grant probability derived from career allowance rate.