DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDSs) submitted on 07/09/2024, 01/15/2025, and 10/23/2025 were filed in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Ruochen Zhao et al: (Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-Battles and Committee Discussions”, ARXIV.ORG, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853, 30 May 2024 (2024-05-30), XP091772881, IDS supplied).
Regarding Claim 1, Ruochen Zhao et al discloses a non-transitory computer-readable media storing computer instructions which when executed by one or more processors of a device cause the device to evolve a large language model (LLM)-based chatbot (Peer Battle) (page 2, Figure 2, Section 3.2) over at least one iteration (Overall, the peer battle consists of 3 rounds, where the candidates take turns to speak) (page 2, Figure 2, Section 3.2) that includes: presenting, by a large language model (LLM)-based evaluator, a question to a LLM-based chatbot during a dialog with the LLM-based chatbot (For debate questions, as using a static dataset could incur data contamination concerns and result in unfair evaluations, we ask an LLM examiner agent to dynamically generate questions. The examiner agent could be any capable LLM) (page 4, Section 3.1) comprised of a sequence of question and answer pairs (The process is illustrated in Figure 2. In the first round, A gives an initial response to the examiner's question; B criticizes the weaknesses in A's response and raises a targeted follow-up question; and A responds to B's question) (page 2, Figure 2, Section 3.2); receiving, by the LLM-based evaluator, an answer to the question from the LLM-based chatbot (In the first round, A gives an initial response to the examiner's question) (page 2, Figure 2, Section 3.2: Peer Debate); evaluating, by the LLM-based evaluator, the answer according to one or more evaluation metrics and a ground truth (Given the questions, the LLM-produced answers are compared to ground-truth answers using metrics such as accuracy) (page 1, Section 1); determining, by the LLM-based evaluator, that a result of the evaluation is unsatisfactory (Candidate A (powered by Yi-34B-Chat) gives a wrong answer as it miscounts occurrences for repeated letters and miscalculates factorials) (page 9, Section 5.1); and presenting, by the LLM-based evaluator, a follow-up question to the LLM-based chatbot designed to encourage a new answer of the LLM-based chatbot (The opponent B (powered by Claude-3-Haiku) quickly and precisely points out these two issues and skillfully raised a follow-up that targets A's weaknesses: "how about the word 'BANANA?" ) (page 9, Section 5.1) to be satisfactory with respect to the ground truth and to cause an optimization of the LLM-based chatbot (Given the questions, the LLM-produced answers are compared to ground-truth answers using metrics such as accuracy) (page 1, Section 1).
Regarding Claim 2, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein the LLM-based chatbot is evolved over a plurality of iterations each corresponding to different question and answer pair in the sequence of question and answer pairs (Overall, the peer battle consists of 3 rounds, where the candidates take turns to speak. The entire dialogue history is visible to both candidates. The process is illustrated in Figure 2. In the first round, A gives an initial response to the examiner's question; B criticizes the weaknesses in A's response and raises a targeted follow-up question; and A responds to B's question. In the second round, A and B are reversed: B gives an initial response to the examiner's question (without seeing A's response); A criticizes and raises questions; and B responds to A's question. In the third round, A and B cross-examine each other. A starts by criticizing B's previous loopholes and raises follow-up questions. After responding, B also criticizes A's loopholes and raises questions. A concludes the battle by responding again. In this process, both A and B get an equal number of each action to ensure fairness. To further reduce position bias, A and B's order is randomly shuffled at the beginning of each debate (pages 4 and 5, Section 3.2).
Regarding Claim 3, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein when the LLM-based evaluator determines that a result of the evaluation for a given question and answer pair is satisfactory with respect to the ground truth (Given the questions, the LLM-produced answers are compared to ground-truth answers using metrics such as accuracy) (page 1, Section 1), then the LLM-based evaluator begins a next iteration of the plurality of iterations (Overall, the peer battle consists of 3 rounds, where the candidates take turns to speak. The entire dialogue history is visible to both candidates. The process is illustrated in Figure 2. In the first round, A gives an initial response to the examiner's question; B criticizes the weaknesses in A's response and raises a targeted follow-up question; and A responds to B's question. In the second round, A and B are reversed: B gives an initial response to the examiner's question (without seeing A's response); A criticizes and raises questions; and B responds to A's question. In the third round, A and B cross-examine each other. A starts by criticizing B's previous loopholes and raises follow-up questions. After responding, B also criticizes A's loopholes and raises questions. A concludes the battle by responding again. In this process, both A and B get an equal number of each action to ensure fairness. To further reduce position bias, A and B's order is randomly shuffled at the beginning of each debate (pages 4 and 5, Section 3.2).
Regarding Claim 4, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein the evaluating of the answer is further performed according to prior question and answer pairs occurring in the dialog (In the third round, A and B cross-examine each other. A starts by criticizing B's previous loopholes and raises follow-up questions. After responding, B also criticizes A's loopholes and raises questions. A concludes the battle by responding again. In this process, both A and B get an equal number of each action to ensure fairness. To further reduce position bias, A and B's order is randomly shuffled at the beginning of each debate (pages 4 and 5, Section 3.2).
Regarding Claim 5, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein the one or more evaluation metrics include one or more automatically calculable natural language processing (NLP) measures (This is a competitive chatbot arena. You are competing against another chatbot assistant in a debate and being judged by a committee on factors such as helpfulness, relevance, accuracy, depth, and creativity) (Page 15, Section A.1.2, Prompts).
Regarding Claim 6, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein evaluating, by the LLM-based evaluator, the answer according to the one or more evaluation metrics and the ground truth includes: calculating a score for the answer based on the one or more evaluation metrics and the ground truth (For logical-reasoning questions that have ground-truth answers (reasoning, code, math), LLM-as- a-judge is known to show weak performances in judging the quality of responses. We adopt prior approaches to establish the reference-based judge [32]. Specifically, we utilize the strongest model (according to the current ranking) to generate a reference answer and provide it to the judge when evaluating the peer battle) (Page 5, Section 3.3)
Regarding Claim 7, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein the result of the evaluation is unsatisfactory when the score is below a predefined threshold (In the first round, the committee is initialized with MMLU [15] scores to approximate LLM performances. They will first be asked to read through the battle history, elaborate judgment reasons, and give a verdict on whether A is better, or B is better, or if there is a tie) (Page 5, Section 3.3)
Regarding Claim 8, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein the LLM-based evaluator presents up to a threshold number of follow-up questions until the new answer of the LLM-based chatbot is evaluated to be satisfactory with respect to the ground truth (Each pair of candidates engage in 40 peer battles, with 5 questions from each of the 8 categories. The questions are generated by GPT-4. As each battle consists of 3 rounds (each candidate speaks for 4 times), we expect the competition scale to be approximately the same as MT-Bench (80 questions, each candidate speaks twice)) (Page 6, Section 4.1).
Regarding Claim 9, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein when the LLM-based evaluator presents the threshold number of follow-up questions without the new answer of the LLM-based chatbot being evaluated as satisfactory with respect to the ground truth, then an error analysis is caused to be performed on the LLM-based chatbot (Each pair of candidates engage in 40 peer battles, with 5 questions from each of the 8 categories. The questions are generated by GPT-4. As each battle consists of 3 rounds (each candidate speaks for 4 times), we expect the competition scale to be approximately the same as MT-Bench (80 questions, each candidate speaks twice)) (Page 6, Section 4.1).
Regarding Claim 10, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein the LLM-based chatbot is initially trained on a dataset comprised of individual question and answer pairs (One line of research conducts automatic evaluation with static datasets. Among these, static datasets with predefined metrics, such as GSM8k [9] and MMLU [15], are constructed with aspect-specific input-output pairs, such as questions and their corresponding answers) (page 1, Section 1).
Regarding Claim 11, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein the LLM-based chatbot evolved to include a multi-turn question and answer dataset (Secondly, two candidate LLMs interact with each other and engage in a multi-round peer battle by answering the seed question individually, criticizing the opponent's weaknesses, and raising targeted follow-up queries to challenge the opponent further) (page 2, Section 1).
Regarding Claim 12, Ruochen Zhao et al discloses the non-transitory computer-readable media, wherein the device is further caused to: output the evolved LLM-based chatbot for use (A noticeable example is Chatbot Arena [32], which is a crowdsourced voting platform that gathers anonymous votes on LLM performances and calculates ELO scores to rank these models) (page 2, Section 1).
Claims 13 and 20 are rejected for the same reason as claim 1.
Claims 14 is rejected for the same reason as claim 2.
Claims 15 is rejected for the same reason as claim 3.
Claims 16 is rejected for the same reason as claim 4.
Claims 17 is rejected for the same reason as claim 5.
Claims 18 is rejected for the same reason as claim 6.
Claims 19 is rejected for the same reason as claim 7.
Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Barron et al. (US 2024/0311407) discloses an artificial intelligence agricultural advisor chatbot system powered by large language models (LLMs) and customized for the agricultural domain using a blend of agricultural datasets can include tools providing custom context relevant to user queries.
Gado et al. (US 2025/0384280) discloses training data generation for large language model (LLM) training and/or benchmarking.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SATWANT K SINGH whose telephone number is (571)272-7468. The examiner can normally be reached Monday thru Friday 9:00 AM to 6:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D Shah can be reached at (571}270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SATWANT K SINGH/Primary Examiner, Art Unit 2653