Last updated: April 17, 2026
Application No. 18/531,540
DARWINIAN ELO FRAMEWORKS FOR CHATBOT EVALUATION

Non-Final OA §101§103§112
Filed
Dec 06, 2023
Examiner
AGUILERA, TODD
Art Unit
2192
Tech Center
2100 — Computer Architecture & Software
Assignee
unknown
OA Round
1 (Non-Final)
This examiner grants 57% of cases after interview

— +57.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 493 resolved cases, 2023–2026
Examiner Intelligence

AGUILERA, TODD View full profile →
Grants 57% of resolved cases
Career Allow Rate
282 granted / 493 resolved
+2.2% vs TC avg
Strong +57% interview lift
Without
With
+57.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 8m
Avg Prosecution
37 currently pending
Career history
530
Total Applications
across all art units
Statute-Specific Performance

§101
16.6%
-23.4% vs TC avg
§103
39.7%
-0.3% vs TC avg
§102
9.4%
-30.6% vs TC avg
§112
29.4%
-10.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 493 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
Remarks
The present application was filed 6 December 2023.
Claims 1-14 are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: 414 (Fig. 4).
 Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Objections
Claims 7, 11, 12 and 14 is objected to for the following informalities:
Claim 7 recites that “wherein at least one match is defined by: wherein a match involves two models”, which does not appear to be grammatically correct and should perhaps read -wherein at least one match involves two models” instead.
Claim 7 refers to “our P prompts”, which appears to be a typographical error that should perhaps read -P prompts- instead.
Claim 11 refers to “leaving with”, which appears to be a typographical error that should perhaps read -leaving- instead.
Claim 12 refers to “determines”, which appears to be a typographical error that should perhaps read -determining- instead.
Claim 14 refers to “which one or more models are used”, which appears to be a typographical error that should perhaps read -which of one or more models are used- instead.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-14 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

As to claim 1, the claim refers to “a set of top models” models at line 10. It is not clear from the claims or specification what distinguishes a set of top models from a set of models that are not top models and it does not appear from anything of record that one of ordinary skill in the art would nonetheless be able to make this distinction. See M.P.E.P. § 2173.05(b). It is assumed, for the purposes of examination, that a “a set of top models” refers to any models having higher Elo scores than other models. Note that this is not suggested claim language.

As to claims 2-14, the claims are dependent on claim 1 and are rejected for the same reasons.

As to claim 9, the claim refers to “the K provisional matches.” There is insufficient antecedent basis for this limitation in the claim and it is unclear to which previously recited element, if any, the claim is referring. For the purposes of examination, “the K provisional matches” will be interpreted as -K provisional matches-.

Further as to claim 9, the claim refers “one of the prompts that are sampled.” It is unclear from this language how many prompts are “sampled”. For the purposes of examination, only one prompt will be construed as being sampled, i.e., “one of the prompts that are sampled” will be construed as -one of the prompts, which is sampled-.

Further as to claims 10-14, the claims are dependent on claim 9 and are rejected for the same reasons.

As to claim 13, the claim refers to “the baseline models” There is insufficient antecedent basis for this limitation in the claim and it is unclear to which previously recited element, if any, the claim is referring. For the purposes of examination, “the baseline models” will be interpreted as -a plurality baseline models-.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-14 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception without significantly more.  
As to claim 1, the claim recites:

[a] computerized method for Darwinian Elo frameworks for chatbot evaluation comprising:
implementing an ad-hoc development testing, wherein the ad-hoc development testing comprises a first phase of chatbot evaluation; 
implementing a response generation, wherein once a model version used for response generation is ready by flagging the model version for evaluation arena candidacy; 
implementing a simulated Elo evaluation, wherein once one or more generations of models are generated for the candidate model, each candidate model is evaluated in a simulated evaluation arena, and wherein each of the one or more generations of models undergo regular matches against one another; and 
implementing a live Elo evaluation, wherein a set of top models are used in a live environment. 

Although a process is claimed (Step 1), under the broadest reasonable interpretation in light of the specification the above underlined elements recite a mental process because they describe a process performable by the human mind with aid of pen and paper. Evaluation is a concept performable in the human mind per M.P.E.P. § 2106.04(a) and the human mind with aid of pen and paper is capable of “flagging” a model, in the sense that the human mind with aid of pen and paper is capable of marking a model for attention or specific treatment. The claim therefore recites an abstract idea. (Step 2A Prong 1).
None of the additional elements integrate the judicial exception into a practical application. (Step 2A Prong 2).
Reference to the method as “computerized” only amounts to mere instructions to implement the abstract idea on a computer. See M.P.E.P. § 2106.05(f).
And the “implementing a response generation”, as best understood by the examiner, only appears to describe using a chatbot to generate a response. only describes a technological environment in which to apply the judicial exception. This only amount to insignificant extra-solution activity because those steps are merely nominal or tangential data gathering. See M.P.E.P. § 2106.05(h). 
Looking at the claim limitations as an ordered combination yields the same conclusion as that reached when looking at the elements individually. Their collective function is merely to implement the abstract idea using a generic computer along with the necessary extra-solution data-gathering.
The claim does not include additional elements that amount to significantly more than the judicial exception for substantially the same reasons discussed above with respect to a practical application. (Step 2B). Note that re-evaluation of the “implementing a response generation” noted above does not indicate that this element is anything more than what is well-understood, routine and conventional in the field at least because Zheng et al. “Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings” shows 9 popular LLMs all generating responses.

As to claim 2, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more because triggering the flagging “automatically” via CI/CD pipelines only amounts to mere instructions to implement this step of the abstract idea using generic computing components in a particular technological environment. See M.P.E.P. §§ 2106.05(f) and § 2106.05(h).

As to claim 3, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more because adding new prompt configurations only further describes the abstract idea because pushing a model to a model registry only appears to be nominal or tangential extra-solution activity, and because courts have recognized that storing and retrieving information in memory is well-understood, routine and conventional. (See M.P.E.P. § 2106.05(d).

As to claim 4, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more. Pushing a new model version only appears to be nominal or tangential extra-solution activity and courts have recognized that storing and retrieving information in memory is well understood, routine and conventional (see M.P.E.P. § 2106.05(d)). Prompting the model to generate a response is only insignificant pre-solution data gathering and is well-understood, routine and conventional as demonstrated by the prompting of various LLMs in Zheng et al. “Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings.”

As to claim 5, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because managing the responses in a document database as opposed to the human mind with aid of pen and paper only amounts to mere instructions to implement the abstract idea using generic computer components. See M.P.E.P. §§ 2106.05(f).

As to claim 6, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because they only further describe the abstract idea itself.

As to claim 7, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because they only further describe the abstract idea itself apart from the passing the prompt to an LLM, which is insignificant pre-solution data gathering that is well-understood, routine and conventional as set forth above with respect to claim 4.

As to claim 8, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more because they only amount to mere instructions to implement this step of the abstract idea using generic computing components or in a particular technological environment. See M.P.E.P. §§ 2106.05(f) and § 2106.05(h).
As to claim 9, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because they only further describe the abstract idea itself.

As to claim 10, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because they only further describe the abstract idea itself.

As to claim 11, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because they only further describe the abstract idea itself.

As to claim 12, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because they only further describe the abstract idea itself.

As to claim 13, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because they only further describe the abstract idea itself.

As to claim 14, the features of this claim do not integrate the abstract idea into a practical application or amount to significantly more at least because they only further describe the abstract idea itself.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (CN 116501592) (art made of record – hereinafter Zhang) in view of Zheng et al. “Chatbot arena: Benchmarking LLMs in the Wild with Elo Ratings” (art made of record – hereinafter Zheng).

Note: Zhang is in Chinese. Citations to Zhang herein refer to the machine translation of the reference mad of record with this action.

As to claim 1, Zhang discloses a computerized method for Darwinian Elo frameworks for chatbot evaluation comprising:
implementing an ad-hoc development testing, wherein the ad-hoc development testing comprises a first phase of chatbot evaluation; (e.g., Zhang, par. [0029]: the human interaction model in this application can be a pre-trained language model, a multimodal pre-trained model, etc. [A pre-trained model is model for which development testing has been implemented because training a model by definition involves evaluating its accuracy or performance. That testing/training is ad-hoc in the sense that it done for a particular purpose or need]; par. [0026]: during a session, the user inputs a command and the human-computer interaction system responds to that command. This is one round of dialogue [chat]; par. [0043] and subsequent two paragraphs: Information: What is the largest Animal on Earth? Response information: Blut whale is a marine mammal belonging to the family Baleen whales and the genus Baleen Whales [so the models are chatbots because they receive and response to questions from the user])
 implementing a response generation, (e.g., Zhang, par. [0034]: the server inputs the input information into the model and outputs the response results) wherein once a model version used for response generation is ready by flagging the model version for evaluation arena candidacy;  (e.g., Zhang, par. [0036]: users can send an evaluation request for the multi-turn capability of the new version of the human-computer interaction model [flag the model for evaluation candidacy]; par. [0039]: the first server compares the evaluation of the multi-turn interaction capabilities of each human-computer interaction model. The comparison results guide users in selecting a model with stronger capabilities [the evaluation is an arena because the models compete against each other])
implementing a simulated Elo evaluation, (e.g., Zhang, pars. [0043] and following paragraph: an example of pre-annotated multi-turn interaction information is as follows: Input information: What is the largest animal on Earth?; par. [0047-0048]: in subsequent steps the multi-turn capabilities of one or more human-computer interaction models are evaluated. Step S202: based on the pre-annotated multi-turn input information, input the input information into the human-computer interaction model [this is simulated evaluation because a human is not actually interacting with the model, pre-annotated input is used instead]; par. [0115]: the Elo rating of each human-computer interaction model is calculated. By outputting the Elo of each model, the relative quality of each model among models to be compared is displayed) wherein once one or more generations of models are generated for the candidate model, each candidate model is evaluated in a simulated evaluation arena, (e.g., Zhang, par. [0039]: the first server compares the evaluation of the multi-turn interaction capabilities of each human-computer interaction model. The comparison results guide users in selecting a model with stronger capabilities) and wherein each of the one or more generations of models undergo regular matches against one another; (e.g., Zhang, par. [0060]: by comparing the evaluation information of the multi-turn interaction capability of multiple human-computer interaction model versions [generations], a superior human-computer interaction model version can be determined; par. [0036]: during the iterative optimization of the human-computer interaction model, the resulting new version is evaluated [so each time there is a new version, it is evaluated, i.e., regular evaluations. And since the versions are compared with other versions during the evaluation to determine the superior version, the versions are matched against each other]).
Zhang does not explicitly disclose implementing a live Elo evaluation, wherein a set of top models are used in a live environment.  
However, in an analogous art, Zheng discloses:
implementing a live Elo evaluation, wherein a set of top models are used in a live environment (e.g., Zheng, p. 3 last two pars: Chatbot Arena adopts the Elo rating system. We launched the arena with several popular open-sourced LLMs one week ago. In the arena, a user can chat with two anonymous models side-by-side and vote for which one is better [a live environment because users are actually chatting with the models]; pp. 1-2 [note that various models have higher Elo rankings than others]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the model evaluation of Zhang, to include implementing wherein a set of top models are used in a live environment, as taught by Zheng, as Zheng would provide the advantage of a means of evaluating model performance based on real “in the wild” use cases. (See Zheng, p. 3 last par.).

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN 116501592) in view of Zheng (“Chatbot arena: Benchmarking LLMs in the Wild with Elo Ratings”) in further view of Goyer et al. (US 2024/0256242) (art made of record – hereinafter Goyer).

As to claim 2, Zhang/Zheng discloses the computerized method of claim 1 (see rejection of claim 1 above) and further discloses flagging the model version for evaluation arena candidacy (see rejection of claim 1 above) but does not explicitly disclose wherein the flagging the model version for evaluation arena candidacy is triggered automatically via a plurality of CI/CD pipelines.  
However, in an analogous art, Goyer discloses:
wherein the flagging the version is triggered automatically via a plurality of CI/CD pipelines (e.g., Goyer, Fig. 1 and associated text, par. [0061]: the validation system 104 may be an intermediary system between any number of developer devices and any number of CI/CD systems; par. [0031]: a software change request from a developer may be initially transmitted to the validation system 104, and then forwarded to an appropriate CI/CD system [this process, the process of each developer making a change request and forwarding the request to the appropriate CI/CD system is a CI/CD pipeline. Since there are multiple developers and multiple CI/CD systems, there are multiple pipelines]; par. [0024]: a software deployment request “(e.g., change event 106)” may include information describing [flagging] the change “(e.g., the updated source code or executable)”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the flagging of model versions for evaluation in an evaluation arena taught by Zhang such that the versions are flagged automatically via a plurality of CI/CD pipelines, as taught by Goyer, as Goyer would provide the advantage of a means of automatically evaluating each new version once it has been created. (See Goyer, par. [0008]).

Claims 3-6 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN 116501592) in view of Zheng (“Chatbot arena: Benchmarking LLMs in the Wild with Elo Ratings”) in view of Goyer (US 2024/0256242) in further view of Ramanujan et al. (US 2024/0143414) (art made of record – hereinafter Ramanujan).

As to claim 3, Zhang/Zheng/Goyer discloses the computerized method of claim 2 (see rejection of claim 2 above) Zhang further discloses a new model version (e.g., Zhang, par. [0036] a new version of the human-computer interaction model) does not explicitly disclose wherein a new model version is pushed to a model registry and a new plurality of prompt configurations are added.
However, in an analogous art, Goyer discloses:
wherein a new version is pushed to a registry (e.g., Goyer, par. [0032]: the developer may submit a request to check-in the modified code back into the application codebase; par. [0033]: the CI/CD system 102 may receive or upload the modified code and merge the modified code into the codebase within the shared repository).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the new model versions of Zhang by incorporating pushing the new version to a registry, as taught by Goyer, as Goyer would provide the advantage of a means of persisting the new version. (See Goyer, par. [0032]).
Further, in an analogous art, Ramanujan discloses:
 a new plurality of prompt configurations are added (e.g., Bernstein, par. [0041]: through the modified input dataset 144 [new plurality of prompt configurations], feedback layer enables the system 100 to evaluate the performance of the artificial intelligence model 126 under different scenarios and/or contexts; par. [0031]: the input data sets 106 are the data that the artificial intelligence model will process as part of the performance benchmark [data input to a model being a prompt]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the prompts taught by Zhang such that a new plurality of prompt configurations is added, as taught by Ramanujan, as Ramanujan would provide the advantage of a means evaluating the performance of the model under different scenarios. (See Ramanujan, par. [0041]). 

As to claim 4, Zhang/Zheng/Goyer/Ramanujan discloses the computerized method of claim 3 (see rejection of claim 3 above), Zhang further discloses:
wherein for each of a plurality of curated prompts for evaluation, a new model version is pushed and is prompted to generate a response (e.g., Zhang, par. [0067]: uploading models; par., [0036]: a new version of the model; par. [0048]: input the input information [prompts] of each turn of input information into the human-computer interaction model; [ar. [0028]: pre-annotated [curated] input information [prompts]; par. [0056]1: Input information: What is the largest animal on Earth?)

As to claim 5, Zhang/Zheng/Goyer/Ramanujan discloses the computerized method of claim 4 (see rejection of claim 4 above), Zhang further discloses:
wherein the responses are managed in a database (e.g. Zhang, par. [0056]: the first server can receive the result output by the human-computer interaction model [the response must be stored to exist within a computer, that storage is the database]; par. [0139]: the memory 601 can be configured to store various other data to support operations of the server).
Zhang/Zheng/Goyer does not explicitly disclose a document database.
However, in an analogous art, Ramanujan discloses:
a document database (e.g., Ramanujan, par. [0082]: datastores 726 can store documents, data structures and/or other data utilized by any application program)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify database Zhang such that it is a document database, as taught by Ramanujan, as Ramanujan would provide the advantage of a means of storing documents or data in documents. (See Ramanujan, par. [0082]).

  	As to claim 6, Zhang/Zheng/Goyer/Ramanujan discloses the computerized method of claim 5 (see rejection of claim 1 above), Zhang further discloses:
wherein the wherein each of the one or more generations of models undergo at least one match against one another (e.g., Zhang, par. [0060]: by comparing the evaluation information of the multi-turn interaction capability of multiple human-computer interaction model versions [generations], a superior human-computer interaction model version can be determined).

Claims 7-10 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN 116501592) in view of Zheng (“Chatbot arena: Benchmarking LLMs in the Wild with Elo Ratings”) in view of Goyer (US 2024/0256242) in view of Ramanujan (US 2024/0143414) in further view of Kieffer et al. (US 2025/0068937) (art made of record – hereinafter Kieffer) and Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (art made of record – hereinafter ZhengA).

As to claim 7, Zhang/Zheng/Goyer/Ramanujan discloses the computerized method of claim 6 (see rejection of claim 6), Zhang further discloses wherein the at least one match is defined by:
wherein a match involves two models, (e.g., Zhang, par. [0037]: the first server compares the evaluation information of the new version of the mode with the evaluation information of the previous version)
Zheng does not explicitly disclose wherein a random prompt from our P prompts is selected, and the generation from each model for this prompt is passed to an LLM along with a prompt to evaluate which response is better.  
However, in an analogous art, Kieffer discloses:
wherein a random prompt from our P prompts is selected (e.g., Kieffer, par. [0080]: a number of additional verification questions [prompts] may be selected at random from the verification question repository; par. [0028]: verification questions “(e.g., questions that test the artificial intelligence model’s accuracy on particular subject matter)” and present those questions to the artificial intelligence model).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the prompts taught by Zhang to include selecting random prompts from P prompts, as taught by Kieffer, as Kieffer would provide the advantage of a means to limit the ability of the model to learn trends in the interrogation process. (See Kieffer, par. [0069]).
Further, in an analogous art, ZhengA discloses wherein the at least one match is defined by: 
the generation from each model for this prompt is passed to an LLM along with a prompt to evaluate which response is better (e.g., ZhengA, p. 4 last par: GPT-4 [an LLM] is tasked to evaluate two responses [generation] from GPT-3.5 [a model] and Vicuna-13B [a model] to an open ended question [prompt]; p. 4 Sec. 3.1: an LLM judge is presented with a question [prompt] and two answers [generations], and tasked to determine which one is better).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the questions provided to chatbots, answers to those questions from the chatbots and judgement of a winning answer taught by Zhang/Zheng, by incorporating passing the generated answer from each chatbot to an LLM along with the prompt to perform the judging, as taught by ZhengB, as ZhengB would provide the advantage of a means of providing a scalable and expandable way of approximating human judgements. (See ZhengB, p. 1 abstract).

As to claim 8, Zhang/Zheng/Goyer/Ramanujan/Kieffer/ZhengB discloses the computerized method of claim 7 (see rejection of claim 7 above), but Zhang does not explicitly disclose wherein the LLM comprises a GPT-4 LLM as a judge model.  
However, in an analogous art, ZhengB discloses
wherein the LLM comprises a GPT-4 LLM as a judge model (e.g., ZhengB, p. 4 last par: GPT-4 [an LLM] is tasked to evaluate [judge] two responses from GPT-3.5 and Vicuna-13B to an open ended question).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the questions provided to chatbots, answers to those questions from the chatbots and judgement of a winning answer taught by Zhang/Zheng, by incorporating passing the generated answer from each chatbot to an GPT-4 LLM along with the prompt to perform the judging, as taught by ZhengB, as ZhengB would provide the advantage of a means of providing a scalable and expandable way of approximating human judgements. (See ZhengB, p. 1 abstract).

As to claim 9, Zhang/Zheng/Goyer/Ramanujan/Kieffer/ZhengB discloses the computerized method of claim 8 (see rejection of claim 8 above), Zheng further discloses:
wherein each of the K provisional matches comprises the candidate model, one of the other models and one of the prompts (e.g., Zhang, par. [0034]: the server inputs the input information [prompt, see above] into the model and outputs the response results; par. [0039]: the first server compares the evaluation of the multi-turn interaction capabilities of each human-computer interaction model. The comparison results guide users in selecting a model with stronger capabilities)
Zheng does not explicitly disclose one of the prompts that are sampled as per a sampling policy.  	However, in an analogous art, Kieffer discloses:
one of the prompts that are sampled as per a sampling policy (e.g., Kieffer, par. [0080]: a number of additional verification questions [prompts] may be selected [sampled] at random [a policy] from the verification question repository; par. [0028]: verification questions “(e.g., questions that test the artificial intelligence model’s accuracy on particular subject matter)” and present those questions to the artificial intelligence model).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the prompts taught by Zhang to include prompts sampled per a policy such as by selecting random prompts from P prompts, as taught by Kieffer, as Kieffer would provide the advantage of a means to limit the ability of the model to learn trends in the interrogation process. (See Kieffer, par. [0069]).

As to claim 10, Zhang/Zheng/Goyer/Ramanujan/Kieffer/ZhengB discloses the computerized method of claim 9 (see rejection of claim 9 above) but Zhang does not explicitly disclose wherein after the provisional matches are complete, performing a smoothing match.  
However, in an analogous art, Zheng discloses:
wherein after the provisional matches are complete, performing a smoothing match (e.g., Zheng, p. 4 last par.: after getting the responses from the two models, users can vote for the model they think is better. Once a vote is submitted, users can restart a new battle with two new randomly chosen models [a smoothing match. Per par. [0043] of the specification, smoothing matches are additional matches]; p. 9: you can visit [a website] to vote for better models).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the provisional matches taught by Zhang to include performing a smoothing match after the provisional matches, as taught by Zheng, as Zheng would provide the advantage of a means to of performing a battle with additional models. (See Zheng, p. 4 last par.). 

Claims 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN 116501592) in view of Zheng (“Chatbot arena: Benchmarking LLMs in the Wild with Elo Ratings”) in view of Goyer (US 2024/0256242) in view of Ramanujan (US 2024/0143414) in view of Kieffer (US 2025/0068937) in view of ZhengB (“Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”) in further view of Dahl et al. (US 2009/0036214) (art made of record – hereinafter Dahl).

As to claim 11, Zhang/Zheng/Goyer/Ramanujan/Kieffer/ZhengB discloses the computerized method of claim 10 (see rejection of claim 10), but Zhang does not explicitly disclose wherein at the end of a specified number of matches, a lowest-rated model on a model leaderboard is deprecated, leaving with a preset number of models.  
However, in an analogous art, Zheng discloses
a lowest-rated model on a model leader board and models (e.g., Zheng pp. 1-2:a benchmark platform for large language models that features anonymous randomized battles. In this blog post, we are releasing a leaderboard).
It would have been obvious one of ordinary skill in the art before the effective filing date of the claimed invention to modify the comparison of model performance taught by Zhang to include a leader board of models with a lower-rated model, as taught by Zheng, as Zheng would provide the advantage of a means of identifying the relative performance of each model with respect to the others. (See Zheng, pp. 1-2).
Further, in an analogous art, Dahl discloses:
wherein at the end of a specified number of matches, a lowest-rated object on a object leaderboard is deprecated, leaving with a preset number of object (e.g., Dahl, par. [0088]: in a first round of games, each team meets another team and after the first round of games, the teams are arranged in a table. After a number of rounds at least one team is removed from the table; Fig. 3 and associated text, par. [0064]: the computational device is arranged to remove at least one object from the table. This is done periodically, for instance once after each event round [pre-specified number of matches] or after a number of event rounds. The computational device removes a pre-selected amount of lowest ranked objects from the table; par. [0077]: the ranking process according to the present invention may be stopped when a pre-determined number of objects remain).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the leaderboard of models taught by Zhang/Zheng such that wherein at the end of a specified number of matches, a lowest-rated object on the leaderboard is deprecated, leaving with a preset number of objects, as taught by Dahl, as Dahl would provide the advantage of a means of providing a ranking of a pre-determined number of top models (see Dahl, par. [0077], [0062]), such as in the case of a top ten list.

As to claim 12, Zhang/Zheng/Goyer/Ramanujan/Kieffer/ZhengB/Dahl discloses the computerized method of claim 11 (see rejection of claim 11 above), Zhang further discloses:
 wherein based on a specified performance metric, determines that the candidate model has outperformed a set of other models (e.g., Zhang, par. [0115]: the Elo rating of each model is calculated and the average is taken as the relative evaluation information. By comparing the average Elo rating of each model, the relative quality of each model among models can be displayed; par. [0117]: by comparing the evaluation information, a superior model version can be determined).

Claims 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN 116501592) in view of Zheng (“Chatbot arena: Benchmarking LLMs in the Wild with Elo Ratings”) in view of Goyer (US 2024/0256242) in view of Ramanujan (US 2024/0143414) in view of Kieffer (US 2025/0068937) in view of ZhengB (“Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”) in view of Dahl (US 2009/0036214) in further view of Cross et al. (US 2017/0315523) (art made of record – hereinafter Cross).

As to claim 13, Zhang/Zheng/Goyer/Ramanujan/Kieffer/ZhengB/Dahl discloses the computerized method of claim 12 (see rejection of claim 12 above), but does not explicitly disclose wherein one of the baseline models is replaced by the candidate model.  
However, in an analogous art, Cross discloses:
wherein one of the baseline models is replaced by the candidate model (e.g., Cross, par. [0157]: to modify one or more existing models from the pool, such as to remove one or more existing models based on analysis of their past performance “(e.g., to remove the lowest ranked M forecasting models)” and/or to add one or more new forecasting models to the pool [so the new model replaces the one(s) removed]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the ranked models of Zhang/Zheng such that one of the baseline models is replaced by a candidate model, as taught by Cross, as Cross would provide the advantage of a means of providing a constant number of highest performing models. (See Cross, par. [0074]). 

As to claim 14, Zhang/Zheng/Goyer/Ramanujan/Kieffer/ZhengB/Dahl The computerized method of claim 12 (see rejection of claim 12 above), but does not explicitly disclose wherein an update to which one or more models are used is performed per a specified policy comprising each time a new candidate model replaces an existing baseline model or based on a manual trigger. 
However, in an analogous art, Cross discloses:
wherein an update to which one or more models are used is performed per a specified policy comprising each time a new candidate model replaces an existing baseline model or based on a manual trigger (e.g., Cross, par. [0157]: to modify one or more existing models from the pool, such as to remove one or more existing models based on analysis of their past performance “(e.g., to remove the lowest ranked M forecasting models)” and/or to add one or more new forecasting models to the pool [so the new model replaces the one(s) removed]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the ranked models of Zhang/Zheng by incorporating updating which models are used per a specified policy comprising each time a new candidate model replaces an existing baseline model as taught by Cross, as Cross would provide the advantage of a means of providing a number of highest performing models. (See Cross, par. [0074]). 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TODD AGUILERA whose telephone number is (571)270-5186. The examiner can normally be reached M-F 11AM - 7:30PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hyung S Sough can be reached at (571)272-6799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/TODD AGUILERA/Primary Examiner, Art Unit 2192                                                                                                                                                                                                        

        1 There are at least two paragraphs labeled [0056] in Zhang
Read full office action
Prosecution Timeline

Dec 06, 2023
Application Filed
Mar 19, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/309,570
Patent 12596638
SYSTEMS AND METHODS FOR SELECTING TEST COMBINATIONS OF HARDWARE AND SOFTWARE FEATURES FOR FEATURE VALIDATION
2y 5m to grant Granted Apr 07, 2026
17/695,134
Patent 12554623
AUTOMATIC METAMORPHIC TESTING
2y 5m to grant Granted Feb 17, 2026
18/312,826
Patent 12554627
TESTING FRAMEWORK WITH DYNAMIC APPLICABILITY MANAGEMENT
2y 5m to grant Granted Feb 17, 2026
17/854,375
Patent 12547532
CONFIGURATION-BASED SYSTEM AND METHOD FOR HANDLING TRANSIENT DATA IN COMPLEX SYSTEMS
2y 5m to grant Granted Feb 10, 2026
18/121,910
Patent 12541352
CONTROLLING INSTALLATION OF DRIVERS BASED ON HARDWARE AND SOFTWARE COMPONENTS PRESENT ON INFORMATION TECHNOLOGY ASSETS
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
57%
Grant Probability
99%
With Interview (+57.1%)
3y 8m
Median Time to Grant
Low
PTA Risk
Based on 493 resolved cases by this examiner. Grant probability derived from career allow rate.