Last updated: April 19, 2026
Application No. 18/736,768
ELECTRONIC DEVICE AND METHOD FOR REINFORCEMENT LEARNING

Non-Final OA §102§103
Filed
Jun 07, 2024
Examiner
LERNER, MARTIN
Art Unit
2658
Tech Center
2600 — Communications
Assignee
ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE
OA Round
1 (Non-Final)
Interview Optional

— +13.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 984 resolved cases, 2023–2026
Examiner Intelligence

LERNER, MARTIN View full profile →
Grants 78% — above average
Career Allow Rate
768 granted / 984 resolved
+16.0% vs TC avg
Moderate +14% lift
Without
With
+13.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
23 currently pending
Career history
1007
Total Applications
across all art units
Statute-Specific Performance

§101
12.5%
-27.5% vs TC avg
§103
53.1%
+13.1% vs TC avg
§102
9.6%
-30.4% vs TC avg
§112
16.6%
-23.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 984 resolved cases
Office Action

§102 §103
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings are objected to due to these informalities:
In Figure 1, it appears that “Large Pre-Trained Language Model 500” should be “Neural Network Model 233”.  That is, Applicant’s Specification, ¶[0025] - ¶[0030], is describing Figure 1 in reference to neural network model 233, but does not include any description of a large pre-trained language model 500.  Moreover, there is no reference numeral “500” associated with a large pre-trained model anywhere in the Specification.  The Specification, ¶[0044] and ¶[0050], only refers to “external server 500”.  
In Figure 1, there is a reference numeral “10” that is not described in the Specification.  Applicant can delete reference numeral “10” from Figure 1, or add a description of reference numeral “10” in the Specification, if this can be done without introducing new matter.
In Figure 2, “user terminal 100” should be “user terminal 300”.  See Specification, ¶[0020] - ¶[0022], ¶[0037] - ¶[0038], and ¶[0041], which describes a database 100 and a user terminal 300.
In Figure 2, there is no illustration of “similarity evaluation module 235” described at ¶[0037] of the Specification.  It appears that there should be some illustration of similarity evaluation module 235 in Figure 2.
In Figure 10, Step 1020, “Indictor” should be “Indicator”. 
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office Action to avoid abandonment of the application.  Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended.  The figure or figure number of an amended drawing should not be labeled as “amended.”  If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency.  Additional replacement sheets may be necessary to show the renumbering of the remaining figures.  Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, Applicant will be notified and informed of any required corrective action in the next Office Action.  The objection to the drawings will not be held in abeyance.

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 
The following title is suggested: Reinforcement Learning to Select a Response from Response Candidates
The abstract of the disclosure is objected to because it is not in narrative form.  A corrected abstract of the disclosure is required and must be presented on a separate sheet, apart from any other text.  See MPEP §608.01(b).
 MPEP 608.01(b) I. C. states that an abstract should be in narrative form, and the form and legal phraseology of patent claims should be avoided in the abstract.  Applicant’s abstract has a format equivalent to a patent claim, and is consequently not in narrative form.  Applicant should revise the abstract so that it is in narrative form, and submit a new abstract on a separate sheet.  See 37 CFR 1.72(b).  
The disclosure is objected to because of the following informalities:
In ¶[0044], “external server 500” is not consistent with “large pre-trained model 500” illustrated in Figures 1, 3, 4, and 6. 
In ¶[0050], “external server 500” is not consistent with “large pre-trained model 500” illustrated in Figures 1, 3, 4, and 6.    
Appropriate correction is required.

Claim Objections
Claim 17 is objected to because of the following informalities:
Claim 17 sets forth a limitation of “the specified equation”, which lacks express antecedent basis.  Independent claim 12, upon which claim 17 depends, does not include any recitation of “a specified equation”.  Applicant can overcome this objection by changing “the specified equation” to “a specified equation”.
Appropriate correction is required.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 to 3, 11 to 13, and 20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kotikalapudi et al. (U.S. Patent Publication 2024/0378394).
Regarding independent claims 1 and 20, Kotikalapudi et al. discloses a computing device, comprising:
“a memory configured to store an artificial intelligence neural network model; and processor functionally connected to the memory, wherein the processor is configured to:” – computing device 120 includes at least one processor 714 and memory 725 (¶[0115] - ¶[0120]: Figure 7); large language model (LLM) database 142A stores large language model engine 142 (¶[0042] - ¶[0043]: Figure 1); implicitly, large language models are neural networks (“an artificial intelligence neural network”); See Wikipedia, ‘Large language model”,  Large language model - Wikipedia;
“obtain a user’s current utterance and generate a plurality of response candidates according to the current utterance using the artificial neural network model” – client device 110 can include user input engine 111 that is configured to detect user input and capture audio corresponding to spoken utterances (“obtain a user’s current utterance”) (¶[0034]: Figure 1); NL based response system 120 can generate a plurality of candidate responses 230 based on processing NL based input 210 (“generate a plurality of response candidates according to the current utterance”); candidate responses 230 can be candidate LLM responses generated based on processing the NL based input 210 using an LLM (“generate a plurality of response candidates . . . using the artificial neural network model”) (¶[0048]: Figure 1);
“perform reinforcement learning on the artificial neural network model by selecting a response according to the current utterance that best matches a specified criterion including a performance indicator from among the plurality of response candidates, using a large pre-trained model” – a ‘critique response’ can be indicative of a corresponding one of the candidate responses according to the set of response evaluation criteria of an indication of an extent to which a corresponding one of the candidate complies with respective evaluation criteria (“a specified criteria including a performance indicator”) (¶[0007]); selected response 394 for NL based input 300 is selected from candidate responses 384, 386, and 388 based on response evaluation criteria (¶[0078]: Figure 3D); training instances are generated using selected response 250, NL based input 210, and data from evaluation engine 155 (¶[0080]: Figure 2C); a selected response 250 can be selected from candidate responses 230 based on it being determined that the selected response 250 best complies with evaluation criteria 233 (“selecting a response according to the current utterance that best matches a specified criterion”) (¶[0082]: Figure 2C); NL based response system 120 can be fine-tuned once training instances are generated using reinforcement learning (“perform reinforcement learning on the artificial neural network model”) (¶[0084]: Figure 2C); comparison data 416 can be used to train a separate reward model for use in fine-tuning the LLM using reinforcement learning (“perform reinforcement learning . . . using a large pre-trained model”) (¶[0086]: Figure 4); LLM can be fine-tuned using reinforcement learning (RL) based on a selected response to a NL based input using a reward model (¶[0113]: Figure 6); generating, for each of the plurality of candidate LLM responses, and based on comparing each of the plurality of candidate LLM responses to the set of response evaluation criteria using the LLM, and training a reward model based on the selected one of the plurality of candidate LLM responses can include fine-tuning the LLM with reinforcement learning (RL) using the reward model (¶[0135]).

Regarding independent claim 12, similar limitations are set forth, but additionally include performing reinforcement learning “based on a reward score”.  Kotikalapudi et al. discloses training a reward model including an indication of response evaluation criteria to provide a weighting to a corresponding selected response 250 during fine-tuning of an LLM (¶[0082]: Figure 2C); comparison data 416 can be used to train a reward model for use in fine-tuning the LLM using reinforcement learning (¶[0086]: Figure 4); here, training an LLM with a reward model provides “a reward score”.  

Regarding claims 2 and 13, Kotikalapudi et al. discloses that a set of response evaluation criteria 374 can be generated based on processing a request 370 to generate a set of response evaluation criteria; request 370 can be automatically generated and can include one or more example pairs of response evaluation criteria (“generate a first message related to response selection according to the current utterance and the performance indicator, and input the first message into the large pre-trained model”); a selected response 394 for ML based input 390 is selected from candidate responses 384, 386, 388 based on critique responses (“select a response according to the current utterance based on a result of processing the first message by the large pre-trained model”) (¶[0076] - ¶[0078]: Figures 3B to 3D); Figure 3B illustrates “a first message” that includes criteria for selecting a response of ‘here is a set of response evaluation criteria based on the input: ‘Write a story suitable for children’: Use simple words, make sure the story has a clear beginning an ending, use vivid descriptions, and use humor’; that is, request 370 is “a message” generated to evaluate candidate responses to obtain a selected response.
Regarding claim 3, Kotikalapudi et al. discloses that some or all aspects of response system 120 can be implemented remotely at remote servers (“an external server related to the large pre-trained model”) (¶[0031]: Figure 1); implementations can be hosted remotely by one or more servers (¶[0040]: Figure 1); computing device 710 can include network interface subsystem 716 (“a communication module”) to outside networks (“the external server acquires the processing result . . . through the communication module”) (¶[0116]: Figure 7); computing device 710 can be a server (¶[0122]: Figure 7).
Regarding claim 11, Kotikalapudi et al. discloses that a candidate response with the highest corresponding comparison measure can be selected as selected response 250 (¶[0063]); candidate response A receives the most votes and will be selected due to majority voting (¶[0068]); here, a numerical comparison measure is “a quantitative indicator” (“wherein the performance indicator includes at least one of a quantitative indicator and a qualitative indicator . . .”).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4 to 5 are rejected under 35 U.S.C. 103 as being unpatentable over Kotikalapudi et al. (U.S. Patent Publication 2024/0378394) in view of Yu et al. (U.S. Patent Publication 2010/0184019).
Concerning claim 4, Kotikalapudi et al. discloses a comparison measure can be obtained by dividing the number of response evaluation criteria of corresponding candidate response 230 that comply by the total number of response evaluation criteria in the set of response evaluation criteria 232 (“a specified equation”) (¶[0060]); candidate responses A, B, and C are evaluated to determine which of candidate responses A, B, and C best complies with a set of response evaluation criteria; candidate response A has a comparison measure of 0.75, candidate response B has a comparison measure of 0.875, and candidate response C has a comparison measure of 0.25 (“receive rankings of the plurality of candidate response candidates according to the performance indicator from the large pre-trained model”) (¶[0064]); candidate response A receives four votes for best complying with criteria Y and candidate C receives one vote for best complying with criteria Y, so that candidate A can be selected because it has received the most votes with four out of five votes (¶[0068]); comparison data 416 can be used to train a separate reward model for use in fine-tuning the LLM using reinforcement learning (“perform reinforcement learning on the artificial neural network model based on at least the first reward score”) (¶[0086]: Figure 4).  Kotikalapudi et al. omits the limitation of “calculate a first reward score by scoring the rankings of the plurality of response candidates based on a specified equation”.  However, Yu et al. teaches training a neural network model by reinforcement learning with performance evaluated using a reward score that may be assigned to indicate if a neural network model performed well (high or positive reward) or performed poorly (low or negative reward).  (¶[0123] - ¶[0127])  A reward score may be determined for each step of a training iteration.  (¶[0131])  Yu et al., then, teaches “calculate a first reward score” and “performing reinforcement learning on the artificial neural network model based on at least the first reward score.”  An objective is to train a neural network to select information for a request of a user.  (¶[0122])  It would have been obvious to one having ordinary skill in the art to provide a scoring measure to rank candidate responses based on a specified equation of Kotikalapudi et al. to train a neural network model by reinforcement learning using a reward score as taught by Yu et al. for a purpose of training a neural network to select information for a request of a user.  

Concerning claim 5, Kotikalapudi et al. discloses a comparison measure can be obtained by dividing the number of response evaluation criteria of corresponding candidate response 230 that comply by the total number of response evaluation criteria in the set of response evaluation criteria 232 (“wherein the performance indicator includes a plurality of indicators or . . . the processor calculates a final ranking by averaging rankings respectively determined based on the plurality of indicators or . . .”) (¶[0060]); Applicant’s “or” limitation does not require “a plurality of large pre-trained models”.

Claims 14 to 18 are rejected under 35 U.S.C. 103 as being unpatentable over Kotikalapudi et al. (U.S. Patent Publication 2024/0378394) in view of Beaver (U.S. Patent Publication 2024/0355318).
Concerning claim 14, Kotikalapudi et al. discloses comparison data 416 can be used to train a separate reward model for use in fine-tuning the LLM using reinforcement learning.  (¶[0086]: Figure 4)  Kotikalapudi et al. discloses scoring according to at least one criterion, but does not expressly disclose performing reinforcement learning on an artificial neural network “to increase the reward score according to the at least on criterion”.  However, Beaver teaches training a language model implemented by a neural network.  (Abstract; ¶[0029])  Reinforcement learning is performed and a reward or penalty is applied based on a similarity score.  By attempting to maximize a reward, reinforcement learning updates the model that subsequently provides a response that more closely resembles that of a human customer service agent.  (¶[0048] - ¶[0049])  Reinforcement training component 420 can compare responses and compute a similarity score.  A reward or penalty is determined based on the similarity score and a reward model or strategy maximize prediction accuracy by reinforcement training component 420.  (¶[0055])  Beaver, then, teaches “performing reinforcement learning on the artificial intelligence model to increase the reward score according to the at least one criterion.”  An objective is to train an intelligent virtual assistant through observational learning tasks.  (Abstract)  It would have been obvious to one having ordinary skill in the art to perform reinforcement learning according to criteria in Kotikalapudi et al. by increasing a reward score as taught by Kotikalapudi et al. for a purpose of training an intelligent virtual assistant through observational learning tasks.

Concerning claims 15 to 16, Kotikalapudi et al. discloses context engine 113 determines a current context based on one or more recent inputs provided during the dialog session (“based on a conversation history including the current utterance and the selected response”) (¶[0038]); Beaver teaches that a model response is compared to a customer service agent response to determine a similarity score; a reward can be assigned if the customer service agent utilizes one of the candidate responses in an actual response to a user (¶[0048] - ¶[0049]); reinforcement training component 420 compares a response generated by language model 120 and a response constructed by an agent to compute a similarity score that represents the similarity or dissimilarity of the two responses.  (¶[0055])  
Concerning claims 17 to 18, Kotikalapudi et al. discloses a comparison measure can be obtained by dividing the number of response evaluation criteria of corresponding candidate response 230 that comply by the total number of response evaluation criteria in the set of response evaluation criteria 232.  (¶[0060])  Evaluating candidate responses A, B, and C determines which of candidate responses A, B, and C best complies with a set of response evaluation criteria.  Candidate response A has a comparison measure of 0.75, candidate response B has a comparison measure of 0.875, and candidate response C has a comparison measure of 0.25 (“converting the determine rankings into reward scores based on the specified equation”). (¶[0064])  Broadly, taking an average by dividing the number of number of complying responses by a total number of responses is “weighted summing”.  Beaver teaches maximizing a reward score in reinforcement learning.  (¶[0048] - ¶[0049])

Claim 7 to 10 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Kotikalapudi et al. (U.S. Patent Publication 2024/0378394) in view of Yu et al. (U.S. Patent Publication 2010/0184019) as applied to claims 1 and 4 above, and further in view of Beaver (U.S. Patent Publication 2024/0355318).
Concerning claim 7, Kotikalapudi et al. discloses that context engine 113 determines a current context based on one or more recent inputs provided during the dialog session (“predict next utterance of the user based on the current utterance and the selected response”).  (¶[0038])  Comparison data 416 can be used to train a separate reward model for use in fine-tuning the LLM using reinforcement learning.  (¶[0086]: Figure 4)  Kotikalapudi et al. discloses scoring according to at least one criterion, but does not expressly disclose performing reinforcement learning on an artificial neural network to “determine a second reward score of the selected response base on similarity between next actual utterance of the user and the predicted next utterance”.  However, Beaver teaches training a language model implemented by a neural network.  (Abstract; ¶[0029])  Reinforcement learning is performed and a reward or penalty is applied based on a similarity score.  By attempting to maximize a reward, reinforcement learning updates the model that subsequently provides a response that more closely resembles that of a human customer service agent.  (¶[0048] - ¶[0049])  Reinforcement training component 420 can compare responses and compute a similarity score.  A reward or penalty is determined based on the similarity score and a reward model or strategy to maximize prediction accuracy by reinforcement training component 420.  (¶[0055])  Beaver, then, teaches “determine a second reward score of the selected response based on similarity between next actual utterance of the user and the predicted next utterance.”  An objective is to train an intelligent virtual assistant through observational learning tasks.  (Abstract)  It would have been obvious to one having ordinary skill in the art to perform reinforcement learning according to criteria in Kotikalapudi et al. by increasing a reward score as taught by Kotikalapudi et al. for a purpose of training an intelligent virtual assistant through observational learning tasks.

Concerning claim 8, Kotikalapudi et al. teaches that a model response is compared to a customer service agent response to determine a similarity score; a reward can be assigned if the customer service agent utilizes one of the candidate responses in an actual response to a user (¶[0048] - ¶[0049]); reinforcement training component 420 compares a response generated by language model 120 and a response constructed by an agent to compute a similarity score that represents the similarity or dissimilarity of the two responses.  (¶[0055]) 

Concerning claim 9, Kotikalapudi et al. discloses a comparison measure can be obtained by dividing the number of response evaluation criteria of corresponding candidate response 230 to comply with by the total number of response evaluation criteria in the set of response evaluation criteria 232.  (¶[0060])  Evaluating candidate responses A, B, and C, and determining which of candidate responses A, B, and C best complies with a set of response evaluation criteria; candidate response A has a comparison measure of 0.75, candidate response B has a comparison measure of 0.875, and candidate response C has a comparison measure of 0.25 (“calculates a final reward score of each of the response candidates by weighted summing the first reward scores and the second reward scores”). (¶[0064])  Broadly, taking an average of dividing the number of number of complying responses by a total number of responses is “weighted summing”.  Beaver teaches maximizing a reward score in reinforcement learning (“to maximize a final reward score”) (¶[0048] - ¶[0049]); a reward (“a first reward score”) or a penalty (“a second reward score for the remaining responses other than the selected response among the plurality of response candidates”) are determined based on a similarity score and a reward model ¶[0055]: Figure 10).
Concerning claims 10 and 19, Yu et al. teaches reinforcement learning based on a policy gradient descent method to assign a reward score.  (¶[0117]; ¶[0125], and ¶[0127])  Beaver teaches maximizing a reward score in reinforcement learning (“to maximize a final award score”).  (¶[0048] - ¶[0049])

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Kotikalapudi et al. (U.S. Patent Publication 2024/0378394) in view of Yu et al. (U.S. Patent Publication 2010/0184019) as applied to claims 1 and 4 above, and further in view of Kataoka et al. (U.S. Patent Publication 2025/0348739).
Kotikalapudi et al. discloses reinforcement learning, but omits a specified equation based on an Elo rating method.  However, Kataoka et al. teaches performing reinforcement learning under a competitive environment.  (Abstract)  Specifically, model competition unit 31 may compare the strengths of competitive models Mc with each other and evaluate whether or not the competitive model is strong by setting a winning rate or an ELO rating of competitive model Mc as a strength evaluation index.  (¶[0025])  An objective is to appropriately and efficiently execute learning of a learning model that includes a case where there are various opponents.  (¶[0006])  It would have been obvious to one having ordinary skill in the art to provide an equation of Elo rating to score competitive models as taught by Kataoka et al. to training a model by reinforcement learning of Kotikalapudi et al. for a purpose of appropriate and efficiently executing learning in a circumstance of competitive opponents.  

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.  Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format.  For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MARTIN LERNER/Primary Examiner
Art Unit 2658                                                                                                                                                                                                        March 11, 2026
Read full office action
Prosecution Timeline

Jun 07, 2024
Application Filed
Mar 11, 2026
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/365,535
Patent 12596880
DETERMINING CAUSALITY BETWEEN FACTORS FOR TARGET OBJECT BY ANALYZING TEXT
2y 5m to grant Granted Apr 07, 2026
17/882,447
Patent 12586592
METHODS AND APPARATUS FOR GENERATING AUDIO FINGERPRINTS FOR CALLS USING POWER SPECTRAL DENSITY VALUES
2y 5m to grant Granted Mar 24, 2026
18/336,831
Patent 12585680
CONTEXTUAL TITLES BASED ON TEMPORAL PROXIMITY AND SHARED TOPICS OF RELATED COMMUNICATION ITEMS WITH SENSITIVITY POLICY
2y 5m to grant Granted Mar 24, 2026
17/432,681
Patent 12579987
METHODS FOR FREQUENCY DOMAIN PACKET LOSS CONCEALMENT AND RELATED DECODER
2y 5m to grant Granted Mar 17, 2026
18/532,969
Patent 12579973
Natural Language Processing With Contextual Data Associated With Content Displayed By a Computing Device
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
78%
Grant Probability
92%
With Interview (+13.5%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 984 resolved cases by this examiner. Grant probability derived from career allow rate.