Prosecution Insights
Last updated: April 19, 2026
Application No. 18/759,471

EVALUATION OF ARTIFICIAL INTELLIGENCE MODELS USABLE FOR A CONVERSATIONAL ARTIFICIAL INTELLIGENCE SYSTEM

Final Rejection §103
Filed
Jun 28, 2024
Examiner
HOLZMACHER, DERICK J
Art Unit
3625
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Paypal Inc.
OA Round
2 (Final)
44%
Grant Probability
Moderate
3-4
OA Rounds
3y 3m
To Grant
73%
With Interview

Examiner Intelligence

Grants 44% of resolved cases
44%
Career Allow Rate
120 granted / 270 resolved
-7.6% vs TC avg
Strong +28% interview lift
Without
With
+28.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
33 currently pending
Career history
303
Total Applications
across all art units

Statute-Specific Performance

§101
42.6%
+2.6% vs TC avg
§103
28.9%
-11.1% vs TC avg
§102
6.0%
-34.0% vs TC avg
§112
16.1%
-23.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 270 resolved cases

Office Action

§103
DETAILED ACTION 1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . The following FINAL office action is in response to Applicant communication filed on 12/01/2025 regarding application 18/759,471. Claims 1-5, 8-16 and 19 have been amended. Claims 1-20 are currently pending have been rejected. Response to Amendments 2. Applicant’s amendment filed on 12/01/2025 necessitated new grounds of rejection in this office action. Information Disclosure Statements (IDS) 3. The 1 Information Disclosure Statement (IDS) filed on 01/22/2026 complies with the provisions of 37 CFR 1.97, 1.98 and MPEP § 609 and is considered by the Examiner. Foreign Priority 4. The Examiner has noted the Applicants claiming Foreign Priority from Application # IN202341054463 filed on 08/14/2023 and IN202341076442 filed on 11/08/2023 and also a Child Application PCT/US24/40299 of Case #18/759,471 filed on 07/31/2024. Moreover, receipt is acknowledged of papers submitted under 35 U.S.C. § 119(a)-(d), which papers have been placed of record in the file. The earliest effective filing date examined for this application is reflective of 08/14/2023. Response to Arguments 5. Applicant’s arguments, see page 8 of 13 filed on 12/01/2025, with respect to the 35 U.S.C. § 112 (b) Claim Rejections for Claim 9 have been fully considered and is found to be persuasive. Therefore, the 35 U.S.C. § 112 (b) Claim Rejections for Claim 9 has been withdrawn. 6. Applicant’s arguments, see pages 8-11 of 13 filed on 12/01/2025, with respect to the 35 U.S.C. § 101 Claim Rejections for Claims 1-20 have been fully considered and are found to be persuasive. Therefore, the 35 U.S.C. § 101 Claim Rejections for Claims 1-20 have been withdrawn. See the 35 U.S.C. § 101 Subject Matter Eligibility Analysis Section below explaining why Claims 1-20 are deemed patent eligible over 35 U.S.C. § 101. 7. Applicant’s arguments, see pages 11-12 of 13 filed on 12/01/2025, with respect to the 35 U.S.C. § 103 Claim Rejections for Claims 1-20 have been fully considered and are found to be not persuasive. Applicant’s arguments with respect to Claims 1-20 have been considered, but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. 35 U.S.C. § 101 Subject Matter Eligibility Analysis 8. 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. 9. Step 1: Claims 1-20 are focused to a statutory category namely, a “system” or an “apparatus” (Claims 1-7), a “method” or a “process” (Claims 8-14) and a “non-transitory machine-readable medium” or “article of manufacture” (Claims 15-20). Step 2A Prong One: Independent Claims 1, 8 and 15 recites limitations that set forth the abstract idea(s), namely (see in bold except where strikethrough): “” (see Independent Claim 1); “” (see Independent Claim 1); “” (see Independent Claim 15); “monitor a conversation between an model and a user conducted during a chat session, wherein the model is trained to provide a first instructions for performing a task based on a first utterance provided by the user during the chat session” (see Independent Claim 1); “derive a set of quality metrics associated with the model based on a second utterance provided by the user during the chat session” (see Independent Claim 1); “determine that a performance level of the model in generating instructions is below a threshold based on the set of quality metrics” (see Independent Claim 1); “generate training data for retraining the model, wherein the training data comprises a second instructions usable and generated based on the first utterance” (see Independent Claim 1); “retrain the model using the training data” (see Independent Claim 1); “monitoring, , interactions between an model during an online chat session, wherein the interactions comprise a first utterance provided by a user to the model , and wherein the model is configured and trained for performing tasks based on user submitted utterances” (see Independent Claim 8); “provided, by the model, a first instruction for performing a task for the user, wherein the first instruction is generated by the model based on the first utterance provided by the user” (see Independent Claim 8); “deriving a set of quality metrics associated with the model based on a second utterance provided by the user during the online chat session” (see Independent Claim 8); “determining that a performance level of the model in generating instructions is below a threshold based on the set of quality metrics” (see Independent Claim 8); “in response to determining that the performance level is below the threshold, generating training data for retraining the model, wherein the training data comprises a second instruction usable and generated based on the first utterance” (see Independent Claim 8); “retraining the model using the training data” (see Independent Claim 8); “accessing a conversation conducted between an model and a user during a chat session, wherein the conversation comprises a first utterance provided by the user, and wherein the model is configured and trained to provide a first instruction for performing a task for the user based on the first utterance” (see Independent Claim 15); “deriving a set of quality metrics associated with the model based on a second utterance provided by the user during the chat session” (see Independent Claim 15); “determining that a performance level of the model in generating instructions is below a threshold based on the set of quality metrics” (see Independent Claim 15); “generating training data for retraining the model, wherein the training data comprises a second instruction usable and generated based on the first utterance” (see Independent Claim 15); “retraining the model using the training data” (see Independent Claim 15). Here, for Independent Claims 1, 8 and 15, the system is directed to the abstract idea of generating, evaluating, and improving conservation AI instructions based on human feedback. The first step of "Monitor a conversation between an AI model and a user... to provide a first computer module instruction” describes monitoring a conversation and interpreting it to generate a command, which is a process that can be performed in the human mind or by a human reviewing chatlogs. It is a "collection of information" and "translation" of user intent into a command. This is classified under the “Mental Process” category as (receiving information/observing) or “Certain Method of Organizing Human Activities” category as (managing user interaction). This process involves tracking user input (utterances) in a chat interface to analyze interaction. Monitoring, collecting, and storing conversational data is a fundamental activity that humans do (e.g., watching a service agent) and can be done mentally. The second step of "Derive a set of quality metrics associated with the AI model based on a second utterance" involves analyzing data (user utterances) to calculate metrics (e.g., accuracy, satisfaction). Deriving metrics is a form of mathematical calculation and assessment of performance. This is classified under “Mathematical Concept” category (calculations) or Mental Process” category (comparing/evaluating). This is a Mental Process. It involves analyzing data (conversations) to generate, analyze, and apply metrics (e.g., accuracy, precision). Calculating metrics and interpreting conversational quality based on user input is a cognitive, mathematical, and analytical process. The third step of "Determine that a performance level of the AI model... is below a threshold" is analyzed as comparing a calculated value to a threshold is a fundamental mental step of judgment or evaluation, which is categorized under “Mental Processes” as (evaluations/judgments). It is a comparison step—comparing calculated metrics against a predetermined threshold (performance evaluation). Determining if a result is "good" or "bad" compared to a standard is a mental activity, akin to grading or evaluating a person. The fourth step of "Generate training data for retraining the AI model, wherein the training data comprises a second computer module instruction" is analyzed as creating training data sets is analogous to organizing information or creating a checklist to manage how an agent learns, which is categorized under “Certain Methods of Organizing Human Activities” as (compiling, organizing and selecting data). It is the act of formulating instructions (creating data) based on the previous evaluation. The information content (the instruction) is not intrinsically technological. The last step of "Retrain the AI model using the training data" describes retraining involves applying learning algorithms to update weights, which is a mathematical process, and is categorized under “Mental Process” (learning/improving) or “Mathematical Concept” (algorithm updating). Therefore, these abstract idea limitations (as identified above in bold), under their broadest reasonable interpretation of the claims as a whole, cover performance of their limitations as “Mental Processes” which pertains to (1) concepts performed in the human mind (including observations or evaluations or judgments) or (2) using pen and paper as a physical aid, which in order to help perform these mental steps does not negate the mental nature of these limitations. The use of "physical aids" in implementing the abstract mental process, does not preclude the claim from reciting an abstract idea. See MPEP § 2106.04(a) III C. Additionally, or alternatively, these abstract idea limitations (as identified above in bold), under their broadest reasonable interpretation of the claims as a whole, cover performance of their limitations as “Certain Methods of Organizing Human Activities” which pertains to (3) managing personal behavior or relationships or interactions between people (including teachings or following rules or instructions) and additionally or alternatively cover performance of their limitations as “Mathematical Concepts” which pertains to (4) mathematical calculations or (5) mathematical relationships. That is, other than reciting (e.g., “a non-transitory memory” & “chat interface” & “computer module” & “a plurality of computer modules” & “a computer system” & “first computer module” & “a second computer module” & “one or more computer modules” & “user device”, etc…), nothing in the claim elements precludes the steps from being performed as “Mental Processes” which pertains to (1) concepts performed in the human mind (including observations or evaluations or judgments) or (2) using pen and paper as a physical aid and additionally or alternatively as “Certain Methods of Organizing Human Activities” which pertains to (3) managing personal behavior or relationships or interactions between people (including teachings or following rules or instructions) and additionally or alternatively cover performance of their limitations as “Mathematical Concepts” which pertains to (4) mathematical calculations or (5) mathematical relationships. Therefore, at step 2a prong 1, Yes, Claims 1-20 recites an abstract idea. We proceed onto analyzing the claims at step 2a prong 2. Step 2A Prong Two: With respect to Step 2A Prong Two of the eligibility inquiry (as explained in MPEP § 2106.04(d)), the judicial exception is integrated into a practical application. For Independent Claims 1, 8 and 15, while the steps involve analyzing conversation (which could be argued as a "method of organizing human activity" or "mental process"), the claimed invention, taken as a whole, goes beyond simply performing these tasks on a computer. It is directed to a specific, technological improvement in the functioning of a computer (an AI model's capability to generate accurate computer module instructions) by implementing a specialized feedback loop (monitoring, metrics calculation, and automatic retraining). For example; Independent Claim 1 is broken down according to the following steps: Step 1: Monitor a conversation between an AI model and a user... This step represents a generic "monitoring" or "receiving" function commonly found in user-interface contexts. It uses conventional computer capabilities (chat interface, AI) to observe input. Likely does not add an inventive concept, as it describes a fundamental, conventional, and well-understood computer activity (input monitoring). Step 2: Derive a set of quality metrics associated with the AI model based on a second utterance… Deriving metrics is a mathematical, analytical process, often considered abstract. The metrics are derived from conversational attributes (e.g., accuracy, confidence). Likely does not add an inventive concept unless the metrics themselves or the method of deriving them is novel and unconventional. Measuring performance (precision/recall) is standard AI MLOps practice. Step 3: Determine that a performance level of the AI model... is below a threshold... This is a classic "if-then" algorithmic determination. Comparing a calculated metric against a predefined threshold is a well-known, conventional, and fundamental logical operation (a "mental process" or "mathematical concept"). Does not add an inventive concept because it is a routine, automated decision-making step commonly used to trigger further processes. Step 4: Generate training data for retraining the AI model... generated based on the first utterance... This step involves formatting or creating new data based on previous inputs (second instruction based on first utterance). While essential, automatic generation of data for retraining is becoming a standard technique in MLOps and automated machine learning (AutoML). Adds an inventive concept if the specific, non-obvious method of generating the training data from the interaction is novel (e.g., a proprietary or specialized heuristic), but generally, it is considered a conventional data manipulation step. Step 5. “Retrain the AI model using the training data…” The act of retraining a model is a standard machine learning technique. Retraining, by itself, is a well-understood, conventional, and ubiquitous technique in AI. If interpreted as a whole: While individual steps are common (monitoring, calculating metrics, thresholding, retraining), Examiner argues that the specific, automated, instantaneous, and combined nature of this closed-loop, on-the-fly, retraining based on a single conversation (not just batch retraining) creates a "technical improvement to the functioning of the computer system" (specifically, improving the efficiency and accuracy of a chatbot). The following additional elements (e.g., such as “training of artificial intelligence (AI) model” & “retraining of the AI model” & “first computer module” & “second computer module” & “ a plurality of computer modules” & “chat interface”) go beyond "routine or conventional" computing and provide the "meaningful limitations" required by Step 2A, Prong 2: Closed-Loop Automated Feedback: The system doesn't just use AI; it monitors its own output, assesses the quality of the interaction (metrics), and automatically triggers retraining, providing a specific, self-improving technical system. Conversion of Utterance to Task Instruction: The system translates natural language ("first utterance") directly into a functional "computer module instruction," improving the technical capability of the conversational agent. Context-Aware Data Generation: The training data is generated specifically to correct the failure identified in the current chat session, ensuring the AI model is retrained on high-value, relevant data. "Attributes Associated with the Conversation": Using these attributes to derive quality metrics indicates a specific, non-routine application of data analytics to evaluate AI performance. Direct Interaction with a "Computer Module": The AI isn't just generating text; it is generating commands for another computer component, framing the invention within a "technological environment". Therefore, Independent Claims 1, 8 and 15 recite additional elements that integrate the judicial exception into a practical application according to (1) improvements to the functioning of a computer, or to any other technology or technical field (e.g., conversational AI chatbots) (see MPEP § 2106.05 (a)) or (2) applying or using the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claims as a whole is more than a drafting effort designed to monopolize the judicial exception (see MPEP § 2106.05 (e)). Thus, Claims 1-20 are patent eligible under 35 U.S.C. § 101 step 2a prong 2 of the 35 U.S.C. § 101 analysis. Step 2B: Assuming arguendo that the claim limitation steps of Independent Claims 1, 8 and 15 do not recite additional elements to integrate the judicial exception into a practical application, Examiner alternatively notes that Independent Claims 1, 8 and 15 are patent eligible under 35 U.S.C § 101 step 2B of the 35 U.S.C. § 101 analysis as reciting additional elements that are significantly more than the recited judicial exceptions. For example; Independent Claim 1 is broken down according to the following steps: Step 1. Monitor a conversation between an AI model and a user... If this monitoring is purely for recording or data entry, it may be deemed abstract. However, in the context of this claim, the monitoring is a necessary, non-conventional step to trigger technical evaluation. If the "monitoring" is integrated with a specialized chat interface and specifically designed to parse for "first computer module instructions" (a technical action) rather than general language, it contributes to a non-conventional, technical process. Step 2. Derive a set of quality metrics associated with the AI model... While evaluating accuracy is often abstract, "deriving a set of quality metrics... based on attributes associated with the conversation" (such as, for instance, response latency, semantic coherence to a technical task, or error rates in instructions) suggests a technical, rather than manual, evaluation. By using machine-readable attributes specifically related to the "computer module instruction," this step moves beyond human-based feedback to a technical, automated evaluation of AI behavior. Step 3. Determine that a performance level of the AI model... is below a threshold... This step is a "threshold" determination, which can sometimes be considered an abstract mental process. However, when integrated with step 2, this is the technical triggering mechanism for an automated, closed-loop, machine-based correction. It operates as part of an improvement to the functionality of the AI model itself, which is a technical, not mental, process. Step 4. Generate training data for retraining the AI model... This step is likely the most significant "inventive concept" in the sequence. Generating training data automatically (specifically, a new "computer module instruction" generated from an existing, flawed interaction) represents a technological, automated "closed-loop" feedback system. It directly improves the AI model's capability to generate better technical instructions, thus improving the computer technology itself. Step 5. Retrain the AI model using the training data... Retraining is the culmination of the improvement. Following the reasoning in Ex parte Desjardins (2025), which the USPTO now follows, automatically retraining a model to maintain or improve performance on specific tasks (like generating module instructions) constitutes a technical solution to a technical problem (AI performance degradation). This is not just using a computer as a tool but, instead, improving the computer-implemented AI model. Summary of Step 2B: The combination of these steps—monitoring specific technical interactions, generating performance metrics from them, creating new training data on-the-fly, and retraining the AI—forms a specialized, automated, and non-conventional, closed-loop machine learning pipeline. This, in the aggregate, constitutes an improvement to the technological functioning of the computer system, thus adding "significantly more" than an abstract idea. Claims 1, 8 and 15 are patent eligible under 35 U.S.C. § 101 step 2B of the 35 U.S.C. § 101 analysis as reciting additional elements that are significantly more than the recited judicial exception due to: (1) improvements to the functioning of a computer, or to any other technology or technical field (e.g., conversational AI chatbots) (see MPEP § 2106.05 (a)) or (2) applying or using the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claims as a whole is more than a drafting effort designed to monopolize the judicial exception (see MPEP § 2106.05 (e)). Therefore, Claims 1-20 are alternatively deemed patent eligible under 35 U.S.C. § 101 step 2B of the 35 U.S.C. § 101 analysis. Claim Rejections - 35 USC § 103 10. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 11. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. 12. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 13. Claims 1-3, 5, 8-9, 11-16 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over US Patent # (US 11,743,378 B1) hereinafter Johnston, et. al., and in view of US PG Pub (US 2022/0058347 A1) hereinafter Singaraju, et. al. Regarding Independent Claim 1, Johnston system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following: - a non-transitory memory (see at least Johnston: Col. 22, Lns. 30-51.); - one or more hardware processors (see at least Johnston: Col. 22, Lns. 48-51.) coupled with the non-transitory memory (see at least Johnston: Col. 22, Lns. 30-51.) and configured to execute instructions (see at least Johnston: Col. 22, Lns. 21-29.) from the non-transitory memory (see at least Johnston: Col. 22, Lns. 30-51.) to cause the system (see at least Johnston: Col. 22, Lns. 21-29.) to: - monitor a conversation between an artificial intelligence (AI) model and a user conducted via a chat interface during a chat session (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 5, Lns. 20-32. Johnston teaches that the system in its simplest form can be seen in FIG. 2, in which a consumer interacts with the system, and the system uses a dialog 600 to manage the conversation and uses services 800 and 900 to understand and respond to the consumer (AI services 800, or human interaction 900). The next steps are determined by the dialog 600. In case of asynchronous conversations, such as with a social post, SMS, and chat, the steps can be asynchronous. There may also be asynchronous steps within a synchronous conversation, such as during live voice communication, in which the system may proceed to interpret the next task of the user while also performing an action (e.g., handling payment) corresponding to prior recognized tasks. See also Johnston at Col. 4, Lns. 40-47: Johnston teaches that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. A conversation 103 is then started. See also Johnston at Col. 6, Lns. 17-27: “AI 800 would train on AI data, which can be reinforced with AI models.”), wherein the AI model is trained (see at least Johnston: Col. 6, Lns. 15-20. Johnston teaches that the system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. Without this distinction, AI 800 would train on AI data, which can be reinforced with AI models.) to provide a first computer module instruction to a computer module for performing a task based on a first utterance provided by the user during the chat session (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 4, Lns. 40-47 & Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer). See also Johnston at Col. 4, Lns. 40-47: Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. See also Johnston at Figs. 10-11 and 13 noting: “a first computer module instruction to a computer module for performing a task based on a first utterance”, specifically “AI assist to HI module 1360”. See also Johnston at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950.”); - derive a set of quality metrics (see at least Johnston: Col. 6, Lns. 22-27. Johnston notes that the system uses success metrics, such as time to perform the same tasks across all understanding services 550, or the number of conversational turns to perform the same tasks, as examples of quality metrics. Quality metrics are used by machine learning to weight HI data for either supervised or unsupervised learning. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-3. Johnston teaches that the data provided by HI 900 becomes valuable training data for the machine learning 300. The system also tracks the agent 950, using analytics and metrics to assess and quantify agent quality. Quality of agent performance enables the machine learning 300 to weight the data for building better models. Also, measurements of agent quality can be obtained by comparing the effort and time an agent takes to perform certain tasks to the effort and time that other agents take. The system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric. Agent voice quality is analyzed and used as one of the characteristics of the agent. Other indications of agent quality can be obtained through surveys. See also Johnston at Col. 11, Lns. 4-8: “Agent quality grades, and data entered by the agent (successful transactions).” See also Johnston at Col. 14, Lns. 18-25.) associated with the AI model (see at least Johnston: Col. 6, Lns. 15-20 & Col. 14, Lns. 55-62. Johnston teaches that the system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. Without this distinction, AI 800 would train on AI data, which can be reinforced with AI models. See also Johnston at Col. 14, Lns. 55-62. Johnston teaches that the knowledge graph gives a dialog manager an organization and method regarding what can be ordered and how to select choices, where the agent scripts give the workflow-script analysis 150A the types of "prompts" associated with order, the many consumer conversations provide data for the AI models (e.g., transcribed words, associated sounds, entities and intents from transcriptions.) based on a second utterance (see also Johnston at Figs. 10-11 & Fig. 13 & at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950”. See also Johnston at Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer).) provided by the user during the chat session (see also Johnston at Col. 4, Lns. 40-47 & Figs. 10-11 & Fig. 13. Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples.). Johnston system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system does not explicitly disclose, but Singaraju in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - determine that a performance level (see at least Singaraju: Fig. 3 & ¶ [0138] & Figs. 11-16. Singaraju teaches that the new bot system may need to be monitored, debugged, and modified in order to improve the performance of the bot system and user experience with the bot system. In many cases, it may be difficult to more specifically identify the root causes of the lower than desired performance of the bot system and determine how to improve the bot system without using an analytics or optimization tools. See also Singaraju at ¶ [0141]: “FIG. 3 depicts an integrated system 300 including a bot system (such as bot system 220) and a bot analytic system for monitoring, analyzing, visualizing, and improving the performance of the bot system”. See also Singaraju at Fig. 11 steps 1150->back to 1110 has re-train.) of the AI model (see at least Singaraju: ¶ [abstract] & ¶ [0052-0053]. Singaraju notes that the one or more synthetic utterances may be used to generate a new training dataset for training a machine-learning model. The training dataset may be refined according to a threshold confidence values to filter out datasets for training.) in generating computer module instructions to the computer module (see at least Singaraju: ¶ [0151-0153] & Figs. 11-16. Singaraju notes that the events may be generated based upon one or more instructions included in the bot system. For example, an event may be generated when the bot system has entered into a particular state, where the particular state is defined by an administrator or developer of the bot system. Custom components 318 may include customized modules for the specific bot system. For example, a financial bot may include custom components that may be used to, for example, check balance, transfer funds, or pay bills.) is below a threshold (see at least Singaraju: ¶ [0014] & ¶ [0226-0228] & Figs. 11-16. Singaraju notes “one or more utterances having a confidence value less than or equal to the threshold value”. See also Singaraju at Fig. 11 and ¶ [0228]: “At 1130, a subset of training utterances is determined, each utterance of the subset of training utterances corresponding to a confidence value less than or equal to the threshold confidence value. Any utterance in the one or more received utterances corresponding to a confidence level less than or equal to the threshold confidence value will be included in the subset of training utterances and ultimately used to retrain the model”. See also Singaraju at Fig. 12 and ¶ [0236]: The selected utterance in interactive intent selector 1250 may be used as the ground truth intent for the corresponding utterance and cause inclusion of the utterance in the subset of training utterances, provided the utterance is associated with a confidence score less than or equal to the confidence score.) based on the set of quality metrics (see at least Singaraju: ¶ [0163-0165] & ¶ [0198]. Singaraju teaches that “higher confidence metrics will correspond to relatively higher degrees of correspondence between the nodes of the trained neural network when the predicted intent/skill is output. The confidence value of an associated set of anchors may be derived, in part, from the confidence metrics of each of the synthetic utterances generated for the associated set of anchors”. See also Singaraju at ¶ [0163-0165]: “REST server 380 may analyze the enriched events and other information and generate various reports based on certain aggregate metrics 372. The reports may be displayed to an owner, administrator, or developer of the bot system on user interface 392 through UI server 390. For example, if the skill enables users to perform various banking transactions, the intents for the skill can include, for example, “Check Balance” or “Transfer Money.” Intents not only describe what the skill can do, but may also be an integral part of the skill's intelligence. The intents enable the skill to recognize user input because each intent can have a set of typical user statements (i.e., utterances) associated with it. While these utterances may share the same meaning, they may be different (for example, “What's my savings account balance?” and “How much is in my checking account?”); - generate training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219]. Singaraju notes that a skill bot is trained based upon the intents configured for the skill bot and the example utterances associated with the intents (collectively, the training data), so that the skill bot can resolve user input to one of its configured intents. A skill bot is represented by a model that is trained using the training data and allows the skill bot to discern what end users say (or in some cases, are trying to say). See also Singaraju at ¶ [0143]: “Intent modeler may also include logic to detect words which have the same meaning within an end user message. For example, if the training dataset includes: “Mary ran to Texas” and “Bob walked to Detroit,” both mapped to the same intent, and run/walk appear in the same set of intents, intent modeler 314 may learn that for the purposes of intent resolution run=walk. In one illustrative example, “Mary ran to Texas” may become “PERSON run to LOCATION” and “Bob walked to Detroit” may become “PERSON walk to LOCATION.” See also Singaraju at ¶ [0145]: Every sentence in a training dataset, once normalized, may automatically become a rule. In such examples, a training dataset may include a very small number of short sentences. The template rule may return a probability of 1. New rules may be generated from rules via a process of induction. For example, the following sentences may belong to track spending: “How much did I spend last month on gas?” and “How much did I spend in May on food?”. The sentences may be used to induce the rule “How much did I spend” as that is the part which is shared between them. In other examples, the training dataset may include the phrase “How much did I spend” to achieve the same result. See also Singaraju at ¶ [0166]: The skill may be trained to infer user intents when it parses the user input. Specifically, the skill may be trained with the intents and their utterances (collectively, the training data), so that the skill can resolve the user input to one of the intents. The trained skill may not only recognize the sample phrases that belong to each intent, but also recognize similar phrases that correspond to each intent. See also Singaraju at ¶ [0219]: Then a training dataset must be generated that will effectively reduce those deficiencies. The training dataset will include sets of training utterances and “ground truth” intents/skills corresponding to the training utterances. The ML model will then be trained using the training utterances using the ground truth intent/skills to refine the model to better predict intents/skills for utterances. Obtaining the training dataset is difficult and manually selecting and building a training dataset is inefficient and resource intensive.) for retraining the AI model (see at least Singaraju: Figs. 11-13 & ¶ [0026] & ¶ [0055]. Singaraju notes that the number of utterances are presented in a comprehensive user interface for selecting a correspondence between an expected intent of a particular utterance for retraining a machine learning model. The number of utterances are altered by an utterance generation engine to form one or more altered utterance which are used to retrain a machine learning model. See also Singaraju at Fig. 11 noting “a process for retraining models using artificial utterances”, Fig. 12 noting “an interface for retraining models using artificial utterances” and Fig. 13 “an interface for retraining models using artificial utterances”.), wherein the training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219].) comprises a second computer module instruction usable by the computer module (see at least Singaraju: Figs. 11-16 & ¶ [0108] & ¶ [0151]. Singaraju teaches that custom components 318 may include customized modules for the specific bot system. For example, a financial bot may include custom components that may be used to, for example, check balance, transfer funds, or pay bills. See also Singaraju at ¶ [0108]: “There might be times when it is desired to provide end users with the option to temporarily leave a first skill they are engaged with to do something in a second skill within the digital assistant. In one example, if an end user is engaged in a conversation with a shopping skill (e.g., the user has made some selections for purchase), the end user may want to jump to a banking skill (e.g., the end user may want to ensure that he/she has enough money for the purchase), and then return to the shopping skill to complete the end user's order. To address this, an action in the first skill can be configured to initiate an interaction with the second different skill in the same digital assistant and then return to the original flow.” See also Singaraju at ¶ [0272].) and generated based on the first utterance (see at least Singaraju: ¶ [0228] & Fig. 11 (step 1130) & Figs. 12-16. Singaraju teaches that a confidence value corresponding to a first utterance of the one or more utterances may be 0.90 (or 90% confidence that the utterance corresponds to a particular intent/skill). A confidence value corresponding to a second utterance may be 0.67 (or 67%). The first utterance is not determined to be in the subset of training utterances, because its corresponding confidence level is greater than the threshold confidence value.); - retrain the AI model (see at least Singaraju: Figs. 11-13 & ¶ [0026] & ¶ [0055]. Singaraju notes that the number of utterances are presented in a comprehensive user interface for selecting a correspondence between an expected intent of a particular utterance for retraining a machine learning model. The number of utterances are altered by an utterance generation engine to form one or more altered utterance which are used to retrain a machine learning model. See also Singaraju at Fig. 11 noting “a process for retraining models using artificial utterances”, Fig. 12 noting “an interface for retraining models using artificial utterances” and Fig. 13 “an interface for retraining models using artificial utterances”.) using the training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219]. Singaraju notes that a skill bot is trained based upon the intents configured for the skill bot and the example utterances associated with the intents (collectively, the training data), so that the skill bot can resolve user input to one of its configured intents. A skill bot is represented by a model that is trained using the training data and allows the skill bot to discern what end users say (or in some cases, are trying to say). See also Singaraju at ¶ [0143]: “Intent modeler may also include logic to detect words which have the same meaning within an end user message. For example, if the training dataset includes: “Mary ran to Texas” and “Bob walked to Detroit,” both mapped to the same intent, and run/walk appear in the same set of intents, intent modeler 314 may learn that for the purposes of intent resolution run=walk. In one illustrative example, “Mary ran to Texas” may become “PERSON run to LOCATION” and “Bob walked to Detroit” may become “PERSON walk to LOCATION.” See also Singaraju at ¶ [0145]: Every sentence in a training dataset, once normalized, may automatically become a rule. In such examples, a training dataset may include a very small number of short sentences. The template rule may return a probability of 1. New rules may be generated from rules via a process of induction. For example, the following sentences may belong to track spending: “How much did I spend last month on gas?” and “How much did I spend in May on food?”. The sentences may be used to induce the rule “How much did I spend” as that is the part which is shared between them. In other examples, the training dataset may include the phrase “How much did I spend” to achieve the same result. See also Singaraju at ¶ [0166]: The skill may be trained to infer user intents when it parses the user input. Specifically, the skill may be trained with the intents and their utterances (collectively, the training data), so that the skill can resolve the user input to one of the intents. The trained skill may not only recognize the sample phrases that belong to each intent, but also recognize similar phrases that correspond to each intent. See also Singaraju at ¶ [0219]: Then a training dataset must be generated that will effectively reduce those deficiencies. The training dataset will include sets of training utterances and “ground truth” intents/skills corresponding to the training utterances. The ML model will then be trained using the training utterances using the ground truth intent/skills to refine the model to better predict intents/skills for utterances. Obtaining the training dataset is difficult and manually selecting and building a training dataset is inefficient and resource intensive.). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: determine that a performance level of the AI model in generating computer module instructions to the computer module is below a threshold based on the set of quality metrics & generate training data for retraining the AI model, wherein the training data comprises a second computer module instruction usable by the computer module and generated based on the first utterance and retrain the AI model using the training data, and in view of Singaraju, whereby an intelligent bot, generally powered by artificial intelligence (AI), can communicate more intelligently and contextually in live conversations, and thus may allow for a more natural conversation between the bot and the end users for improved conversational experience. Rather than the end user learning a fixed set of keywords or commands that the bot knows how to respond to, an intelligent bot may be able to understand the end user's intention based upon user utterances in natural language and respond accordingly (see at least Singaraju: ¶ [0003].). The outputs of the machine learning model or the output for a particular inference may not provide insights into the machine learning model such that a user may understand the behavior of the machine learning model, such as why a particular input would generate a particular output, to determine whether a model and/or a particular prediction is trustworthy and how to improve the training data and the model. Time for training or retraining machine learning models may be constrained by the ability to generate these training sets. The system of Singaraju provides a comprehensive way to integrate an inference explanation system into a retraining system to improve a machine learning model (see at least Singaraju: ¶ [0005-0007].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Singaraju, the results of the combination were predictable. Regarding Independent Claim 8, Johnston method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following: - monitoring, by a computer system (see at least Johnston: Figs. 1-3 & Col. 1, Lns. 64-67 & Col. 2, Lns. 1-14. Johnston notes that it is also difficult for human agents during a customer conversation simultaneously both to respond to the customer and also to interact with a computer system to carry out the customer's wishes. See also Johnston at Col. 2, Lns. 1-14. Johnston teaches that as travel agents interact with customers they must simultaneously type data in to forms that enable them to look up available flights, car reservations, hotel rooms and so on. This process can be cumbersome and adversely impact the experience for the customer while the customer waits for the agent to complete an interaction with the system and resume focusing on the conversation, and additionally puts cognitive strain on the agent.), interactions between an artificial intelligence (AI) model (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 5, Lns. 20-32. Johnston teaches that the system in its simplest form can be seen in FIG. 2, in which a consumer interacts with the system, and the system uses a dialog 600 to manage the conversation and uses services 800 and 900 to understand and respond to the consumer (AI services 800, or human interaction 900). The next steps are determined by the dialog 600. In case of asynchronous conversations, such as with a social post, SMS, and chat, the steps can be asynchronous. There may also be asynchronous steps within a synchronous conversation, such as during live voice communication, in which the system may proceed to interpret the next task of the user while also performing an action (e.g., handling payment) corresponding to prior recognized tasks. See also Johnston at Col. 4, Lns. 40-47: Johnston teaches that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. A conversation 103 is then started. See also Johnston at Col. 6, Lns. 17-27: “AI 800 would train on AI data, which can be reinforced with AI models.”) and a user device (see at least Johnston: Col. 4, Lns. 34-47. Johnston teaches that this disclosure will focus on a conversation which can include a variety of channels and modes of communication, but is simplified into the concept of a conversation, which could include a multimodal conversation on a smart device, such as a phone or tablet.) during an online chat session (see also Johnston at Col. 4, Lns. 40-47 & Figs. 10-11 & Fig. 13. Johnston teaches that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. A conversation 103 is then started.”), wherein the instructions comprise a first utterance (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 4, Lns. 40-47 & Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer). See also Johnston at Col. 4, Lns. 40-47: Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. See also Johnston at Figs. 10-11 and 13 noting: “a first computer module instruction to a computer module for performing a task based on a first utterance”, specifically “AI assist to HI module 1360”. See also Johnston at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950.”) provided by a user of the user device (see at least Johnston: Col. 4, Lns. 34-47. Johnston teaches that this disclosure will focus on a conversation which can include a variety of channels and modes of communication, but is simplified into the concept of a conversation, which could include a multimodal conversation on a smart device, such as a phone or tablet.) to the AI model via a chat interface (see at least Johnston: Col. 6, Lns. 15-20 & Figs. 8-9), and wherein the AI model is configured and trained (see also Johnston at Col. 6, Lns. 17-27 & Figs. 10-11 & Fig. 13: “AI 800 would train on AI data, which can be reinforced with AI models. See at least Johnston: Col. 6, Lns. 15-20. Johnston teaches that the system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. Without this distinction, AI 800 would train on AI data, which can be reinforced with AI models.) to use a plurality of computer modules for performing tasks based on user submitted utterances (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 4, Lns. 40-47 & Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer). See also Johnston at Col. 4, Lns. 40-47: Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. See also Johnston at Figs. 10-11 and 13 noting: “a first computer module instruction to a computer module for performing a task based on a first utterance”, specifically “AI assist to HI module 1360”. See also Johnston at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950.”); - providing, by the AI model (see at least Johnston: Col. 5, Lns. 20-32 & Col. 6, Lns. 15-20. Johnston teaches that the system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. Without this distinction, AI 800 would train on AI data, which can be reinforced with AI models.), a first instruction to a first computer module from the plurality of computer modules for performing a task for the user (see at least Johnston: Figs. 10-11 & Fig. 13.), wherein the first instruction is generated by the AI model (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 5, Lns. 20-32.) based on the first utterance provided by the user (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 4, Lns. 40-47 & Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer). See also Johnston at Col. 4, Lns. 40-47: Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. See also Johnston at Figs. 10-11 and 13 noting: “a first computer module instruction to a computer module for performing a task based on a first utterance”, specifically “AI assist to HI module 1360”. See also Johnston at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950.”); - deriving a set of quality metrics (see at least Johnston: Col. 6, Lns. 22-27. Johnston notes that the system uses success metrics, such as time to perform the same tasks across all understanding services 550, or the number of conversational turns to perform the same tasks, as examples of quality metrics. Quality metrics are used by machine learning to weight HI data for either supervised or unsupervised learning. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-3. Johnston teaches that the data provided by HI 900 becomes valuable training data for the machine learning 300. The system also tracks the agent 950, using analytics and metrics to assess and quantify agent quality. Quality of agent performance enables the machine learning 300 to weight the data for building better models. Also, measurements of agent quality can be obtained by comparing the effort and time an agent takes to perform certain tasks to the effort and time that other agents take. The system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric. Agent voice quality is analyzed and used as one of the characteristics of the agent. Other indications of agent quality can be obtained through surveys. See also Johnston at Col. 11, Lns. 4-8: “Agent quality grades, and data entered by the agent (successful transactions).” See also Johnston at Col. 14, Lns. 18-25.) associated with the AI model (see at least Johnston: Col. 6, Lns. 15-20 & Col. 14, Lns. 55-62. Johnston teaches that the system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. Without this distinction, AI 800 would train on AI data, which can be reinforced with AI models. See also Johnston at Col. 14, Lns. 55-62. Johnston teaches that the knowledge graph gives a dialog manager an organization and method regarding what can be ordered and how to select choices, where the agent scripts give the workflow-script analysis 150A the types of "prompts" associated with order, the many consumer conversations provide data for the AI models (e.g., transcribed words, associated sounds, entities and intents from transcriptions.) based on a second utterance (see also Johnston at Figs. 10-11 & Fig. 13 & at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950”. See also Johnston at Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer).) provided by the user during the chat session (see also Johnston at Col. 4, Lns. 40-47 & Figs. 10-11 & Fig. 13. Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples.). Johnston method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system does not explicitly disclose, but Singaraju in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - determining that a performance level (see at least Singaraju: Fig. 3 & ¶ [0138] & Figs. 11-16. Singaraju teaches that the new bot system may need to be monitored, debugged, and modified in order to improve the performance of the bot system and user experience with the bot system. In many cases, it may be difficult to more specifically identify the root causes of the lower than desired performance of the bot system and determine how to improve the bot system without using an analytics or optimization tools. See also Singaraju at ¶ [0141]: “FIG. 3 depicts an integrated system 300 including a bot system (such as bot system 220) and a bot analytic system for monitoring, analyzing, visualizing, and improving the performance of the bot system”. See also Singaraju at Fig. 11 steps 1150->back to 1110 has re-train.) of the AI model (see at least Singaraju: ¶ [abstract] & ¶ [0052-0053]. Singaraju notes that the one or more synthetic utterances may be used to generate a new training dataset for training a machine-learning model. The training dataset may be refined according to a threshold confidence values to filter out datasets for training.) in generating instructions to the plurality of computer modules (see at least Singaraju: ¶ [0151-0153] & Figs. 11-16. Singaraju notes that the events may be generated based upon one or more instructions included in the bot system. For example, an event may be generated when the bot system has entered into a particular state, where the particular state is defined by an administrator or developer of the bot system. Custom components 318 may include customized modules for the specific bot system. For example, a financial bot may include custom components that may be used to, for example, check balance, transfer funds, or pay bills.) is below a threshold (see at least Singaraju: ¶ [0014] & ¶ [0226-0228] & Figs. 11-16. Singaraju notes “one or more utterances having a confidence value less than or equal to the threshold value”. See also Singaraju at Fig. 11 and ¶ [0228]: “At 1130, a subset of training utterances is determined, each utterance of the subset of training utterances corresponding to a confidence value less than or equal to the threshold confidence value. Any utterance in the one or more received utterances corresponding to a confidence level less than or equal to the threshold confidence value will be included in the subset of training utterances and ultimately used to retrain the model”. See also Singaraju at Fig. 12 and ¶ [0236]: The selected utterance in interactive intent selector 1250 may be used as the ground truth intent for the corresponding utterance and cause inclusion of the utterance in the subset of training utterances, provided the utterance is associated with a confidence score less than or equal to the confidence score.) based on the set of quality metrics (see at least Singaraju: ¶ [0163-0165] & ¶ [0198]. Singaraju teaches that “higher confidence metrics will correspond to relatively higher degrees of correspondence between the nodes of the trained neural network when the predicted intent/skill is output. The confidence value of an associated set of anchors may be derived, in part, from the confidence metrics of each of the synthetic utterances generated for the associated set of anchors”. See also Singaraju at ¶ [0163-0165]: “REST server 380 may analyze the enriched events and other information and generate various reports based on certain aggregate metrics 372. The reports may be displayed to an owner, administrator, or developer of the bot system on user interface 392 through UI server 390. For example, if the skill enables users to perform various banking transactions, the intents for the skill can include, for example, “Check Balance” or “Transfer Money.” Intents not only describe what the skill can do, but may also be an integral part of the skill's intelligence. The intents enable the skill to recognize user input because each intent can have a set of typical user statements (i.e., utterances) associated with it. While these utterances may share the same meaning, they may be different (for example, “What's my savings account balance?” and “How much is in my checking account?”); - in response to determining that the performance (see at least Singaraju: Fig. 3 & ¶ [0138] & Figs. 11-16.) level is below the threshold (see at least Singaraju: ¶ [0014] & ¶ [0226-0228] & Figs. 11-16. Singaraju notes “one or more utterances having a confidence value less than or equal to the threshold value”. See also Singaraju at Fig. 11 and ¶ [0228]: “At 1130, a subset of training utterances is determined, each utterance of the subset of training utterances corresponding to a confidence value less than or equal to the threshold confidence value. Any utterance in the one or more received utterances corresponding to a confidence level less than or equal to the threshold confidence value will be included in the subset of training utterances and ultimately used to retrain the model”. See also Singaraju at Fig. 12 and ¶ [0236]: The selected utterance in interactive intent selector 1250 may be used as the ground truth intent for the corresponding utterance and cause inclusion of the utterance in the subset of training utterances, provided the utterance is associated with a confidence score less than or equal to the confidence score.), generating training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219]. Singaraju notes that a skill bot is trained based upon the intents configured for the skill bot and the example utterances associated with the intents (collectively, the training data), so that the skill bot can resolve user input to one of its configured intents. A skill bot is represented by a model that is trained using the training data and allows the skill bot to discern what end users say (or in some cases, are trying to say). See also Singaraju at ¶ [0143]: “Intent modeler may also include logic to detect words which have the same meaning within an end user message. For example, if the training dataset includes: “Mary ran to Texas” and “Bob walked to Detroit,” both mapped to the same intent, and run/walk appear in the same set of intents, intent modeler 314 may learn that for the purposes of intent resolution run=walk. In one illustrative example, “Mary ran to Texas” may become “PERSON run to LOCATION” and “Bob walked to Detroit” may become “PERSON walk to LOCATION.” See also Singaraju at ¶ [0145]: Every sentence in a training dataset, once normalized, may automatically become a rule. In such examples, a training dataset may include a very small number of short sentences. The template rule may return a probability of 1. New rules may be generated from rules via a process of induction. For example, the following sentences may belong to track spending: “How much did I spend last month on gas?” and “How much did I spend in May on food?”. The sentences may be used to induce the rule “How much did I spend” as that is the part which is shared between them. In other examples, the training dataset may include the phrase “How much did I spend” to achieve the same result. See also Singaraju at ¶ [0166]: The skill may be trained to infer user intents when it parses the user input. Specifically, the skill may be trained with the intents and their utterances (collectively, the training data), so that the skill can resolve the user input to one of the intents. The trained skill may not only recognize the sample phrases that belong to each intent, but also recognize similar phrases that correspond to each intent. See also Singaraju at ¶ [0219]: Then a training dataset must be generated that will effectively reduce those deficiencies. The training dataset will include sets of training utterances and “ground truth” intents/skills corresponding to the training utterances. The ML model will then be trained using the training utterances using the ground truth intent/skills to refine the model to better predict intents/skills for utterances. Obtaining the training dataset is difficult and manually selecting and building a training dataset is inefficient and resource intensive.) for retraining the AI model (see at least Singaraju: Figs. 11-13 & ¶ [0026] & ¶ [0055]. Singaraju notes that the number of utterances are presented in a comprehensive user interface for selecting a correspondence between an expected intent of a particular utterance for retraining a machine learning model. The number of utterances are altered by an utterance generation engine to form one or more altered utterance which are used to retrain a machine learning model. See also Singaraju at Fig. 11 noting “a process for retraining models using artificial utterances”, Fig. 12 noting “an interface for retraining models using artificial utterances” and Fig. 13 “an interface for retraining models using artificial utterances”.), wherein the training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219].) comprises a second instruction usable by one or more computer modules from the plurality of computer modules (see at least Singaraju: Figs. 11-16 & ¶ [0108] & ¶ [0151]. Singaraju teaches that custom components 318 may include customized modules for the specific bot system. For example, a financial bot may include custom components that may be used to, for example, check balance, transfer funds, or pay bills. See also Singaraju at ¶ [0108]: “There might be times when it is desired to provide end users with the option to temporarily leave a first skill they are engaged with to do something in a second skill within the digital assistant. In one example, if an end user is engaged in a conversation with a shopping skill (e.g., the user has made some selections for purchase), the end user may want to jump to a banking skill (e.g., the end user may want to ensure that he/she has enough money for the purchase), and then return to the shopping skill to complete the end user's order. To address this, an action in the first skill can be configured to initiate an interaction with the second different skill in the same digital assistant and then return to the original flow.” See also Singaraju at ¶ [0272].) and generated based on the first utterance (see at least Singaraju: ¶ [0228] & Fig. 11 (step 1130) & Figs. 12-16. Singaraju teaches that a confidence value corresponding to a first utterance of the one or more utterances may be 0.90 (or 90% confidence that the utterance corresponds to a particular intent/skill). A confidence value corresponding to a second utterance may be 0.67 (or 67%). The first utterance is not determined to be in the subset of training utterances, because its corresponding confidence level is greater than the threshold confidence value.); - retraining the AI model (see at least Singaraju: Figs. 11-13 & ¶ [0026] & ¶ [0055]. Singaraju notes that the number of utterances are presented in a comprehensive user interface for selecting a correspondence between an expected intent of a particular utterance for retraining a machine learning model. The number of utterances are altered by an utterance generation engine to form one or more altered utterance which are used to retrain a machine learning model. See also Singaraju at Fig. 11 noting “a process for retraining models using artificial utterances”, Fig. 12 noting “an interface for retraining models using artificial utterances” and Fig. 13 “an interface for retraining models using artificial utterances”.) using the training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219]. Singaraju notes that a skill bot is trained based upon the intents configured for the skill bot and the example utterances associated with the intents (collectively, the training data), so that the skill bot can resolve user input to one of its configured intents. A skill bot is represented by a model that is trained using the training data and allows the skill bot to discern what end users say (or in some cases, are trying to say). See also Singaraju at ¶ [0143]: “Intent modeler may also include logic to detect words which have the same meaning within an end user message. For example, if the training dataset includes: “Mary ran to Texas” and “Bob walked to Detroit,” both mapped to the same intent, and run/walk appear in the same set of intents, intent modeler 314 may learn that for the purposes of intent resolution run=walk. In one illustrative example, “Mary ran to Texas” may become “PERSON run to LOCATION” and “Bob walked to Detroit” may become “PERSON walk to LOCATION.” See also Singaraju at ¶ [0145]: Every sentence in a training dataset, once normalized, may automatically become a rule. In such examples, a training dataset may include a very small number of short sentences. The template rule may return a probability of 1. New rules may be generated from rules via a process of induction. For example, the following sentences may belong to track spending: “How much did I spend last month on gas?” and “How much did I spend in May on food?”. The sentences may be used to induce the rule “How much did I spend” as that is the part which is shared between them. In other examples, the training dataset may include the phrase “How much did I spend” to achieve the same result. See also Singaraju at ¶ [0166]: The skill may be trained to infer user intents when it parses the user input. Specifically, the skill may be trained with the intents and their utterances (collectively, the training data), so that the skill can resolve the user input to one of the intents. The trained skill may not only recognize the sample phrases that belong to each intent, but also recognize similar phrases that correspond to each intent. See also Singaraju at ¶ [0219]: Then a training dataset must be generated that will effectively reduce those deficiencies. The training dataset will include sets of training utterances and “ground truth” intents/skills corresponding to the training utterances. The ML model will then be trained using the training utterances using the ground truth intent/skills to refine the model to better predict intents/skills for utterances. Obtaining the training dataset is difficult and manually selecting and building a training dataset is inefficient and resource intensive.). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: determining that a performance level of the AI model in generating instructions to the plurality of computer modules is below a threshold based on the set of quality metrics & in response to determining that the performance level is below the threshold, generating training data for retraining the AI model, wherein the training data comprises a second instruction usable by one or more computer modules from the plurality of computer modules and generated based on the first utterance and retraining the AI model using the training data, and in view of Singaraju, whereby an intelligent bot, generally powered by artificial intelligence (AI), can communicate more intelligently and contextually in live conversations, and thus may allow for a more natural conversation between the bot and the end users for improved conversational experience. Rather than the end user learning a fixed set of keywords or commands that the bot knows how to respond to, an intelligent bot may be able to understand the end user's intention based upon user utterances in natural language and respond accordingly (see at least Singaraju: ¶ [0003].). The outputs of the machine learning model or the output for a particular inference may not provide insights into the machine learning model such that a user may understand the behavior of the machine learning model, such as why a particular input would generate a particular output, to determine whether a model and/or a particular prediction is trustworthy and how to improve the training data and the model. Time for training or retraining machine learning models may be constrained by the ability to generate these training sets. The system of Singaraju provides a comprehensive way to integrate an inference explanation system into a retraining system to improve a machine learning model (see at least Singaraju: ¶ [0005-0007].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Singaraju, the results of the combination were predictable. Regarding Independent Claim 15, Johnston non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following: - having stored thereon machine-readable instructions executable to cause a machine to perform operations comprising (see at least Johnston: Col. 22, Lns. 21-29.); - accessing a conversation conducted between an artificial intelligence (AI) model and a user via a chat interface during a chat session (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 5, Lns. 20-32. Johnston teaches that the system in its simplest form can be seen in FIG. 2, in which a consumer interacts with the system, and the system uses a dialog 600 to manage the conversation and uses services 800 and 900 to understand and respond to the consumer (AI services 800, or human interaction 900). The next steps are determined by the dialog 600. In case of asynchronous conversations, such as with a social post, SMS, and chat, the steps can be asynchronous. There may also be asynchronous steps within a synchronous conversation, such as during live voice communication, in which the system may proceed to interpret the next task of the user while also performing an action (e.g., handling payment) corresponding to prior recognized tasks. See also Johnston at Col. 4, Lns. 40-47: Johnston teaches that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. A conversation 103 is then started. See also Johnston at Col. 6, Lns. 17-27: “AI 800 would train on AI data, which can be reinforced with AI models.”), wherein the conversation comprises a first utterance provided by the user (see also Johnston at Figs. 10-11 and 13 noting: “a first computer module instruction to a computer module for performing a task based on a first utterance”, specifically “AI assist to HI module 1360”. See also Johnston at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950.”), and wherein the AI model is configured and trained (see at least Johnston: Col. 6, Lns. 15-20. Johnston teaches that the system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. Without this distinction, AI 800 would train on AI data, which can be reinforced with AI models.) to provide a first instruction to a computer module for performing a task for the user based on the first utterance (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 4, Lns. 40-47 & Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer). See also Johnston at Col. 4, Lns. 40-47: Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. See also Johnston at Figs. 10-11 and 13 noting: “a first computer module instruction to a computer module for performing a task based on a first utterance”, specifically “AI assist to HI module 1360”. See also Johnston at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950.”); - deriving a set of quality metrics (see at least Johnston: Col. 6, Lns. 22-27. Johnston notes that the system uses success metrics, such as time to perform the same tasks across all understanding services 550, or the number of conversational turns to perform the same tasks, as examples of quality metrics. Quality metrics are used by machine learning to weight HI data for either supervised or unsupervised learning. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-3. Johnston teaches that the data provided by HI 900 becomes valuable training data for the machine learning 300. The system also tracks the agent 950, using analytics and metrics to assess and quantify agent quality. Quality of agent performance enables the machine learning 300 to weight the data for building better models. Also, measurements of agent quality can be obtained by comparing the effort and time an agent takes to perform certain tasks to the effort and time that other agents take. The system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric. Agent voice quality is analyzed and used as one of the characteristics of the agent. Other indications of agent quality can be obtained through surveys. See also Johnston at Col. 11, Lns. 4-8: “Agent quality grades, and data entered by the agent (successful transactions).” See also Johnston at Col. 14, Lns. 18-25.) associated with the AI model (see at least Johnston: Col. 6, Lns. 15-20 & Col. 14, Lns. 55-62. Johnston teaches that the system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. Without this distinction, AI 800 would train on AI data, which can be reinforced with AI models. See also Johnston at Col. 14, Lns. 55-62. Johnston teaches that the knowledge graph gives a dialog manager an organization and method regarding what can be ordered and how to select choices, where the agent scripts give the workflow-script analysis 150A the types of "prompts" associated with order, the many consumer conversations provide data for the AI models (e.g., transcribed words, associated sounds, entities and intents from transcriptions.) based on a second utterance (see also Johnston at Figs. 10-11 & Fig. 13 & at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950”. See also Johnston at Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer).) provided by the user during the chat session (see also Johnston at Col. 4, Lns. 40-47 & Figs. 10-11 & Fig. 13. Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples.). Johnston non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system does not explicitly disclose, but Singaraju in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - determining that a performance level (see at least Singaraju: Fig. 3 & ¶ [0138] & Figs. 11-16. Singaraju teaches that the new bot system may need to be monitored, debugged, and modified in order to improve the performance of the bot system and user experience with the bot system. In many cases, it may be difficult to more specifically identify the root causes of the lower than desired performance of the bot system and determine how to improve the bot system without using an analytics or optimization tools. See also Singaraju at ¶ [0141]: “FIG. 3 depicts an integrated system 300 including a bot system (such as bot system 220) and a bot analytic system for monitoring, analyzing, visualizing, and improving the performance of the bot system”. See also Singaraju at Fig. 11 steps 1150->back to 1110 has re-train.) of the AI model (see at least Singaraju: ¶ [abstract] & ¶ [0052-0053]. Singaraju notes that the one or more synthetic utterances may be used to generate a new training dataset for training a machine-learning model. The training dataset may be refined according to a threshold confidence values to filter out datasets for training.) in generating instructions to the computer module (see at least Singaraju: ¶ [0151-0153] & Figs. 11-16. Singaraju notes that the events may be generated based upon one or more instructions included in the bot system. For example, an event may be generated when the bot system has entered into a particular state, where the particular state is defined by an administrator or developer of the bot system. Custom components 318 may include customized modules for the specific bot system. For example, a financial bot may include custom components that may be used to, for example, check balance, transfer funds, or pay bills.) is below a threshold (see at least Singaraju: ¶ [0014] & ¶ [0226-0228] & Figs. 11-16. Singaraju notes “one or more utterances having a confidence value less than or equal to the threshold value”. See also Singaraju at Fig. 11 and ¶ [0228]: “At 1130, a subset of training utterances is determined, each utterance of the subset of training utterances corresponding to a confidence value less than or equal to the threshold confidence value. Any utterance in the one or more received utterances corresponding to a confidence level less than or equal to the threshold confidence value will be included in the subset of training utterances and ultimately used to retrain the model”. See also Singaraju at Fig. 12 and ¶ [0236]: The selected utterance in interactive intent selector 1250 may be used as the ground truth intent for the corresponding utterance and cause inclusion of the utterance in the subset of training utterances, provided the utterance is associated with a confidence score less than or equal to the confidence score.) based on the set of quality metrics (see at least Singaraju: ¶ [0163-0165] & ¶ [0198]. Singaraju teaches that “higher confidence metrics will correspond to relatively higher degrees of correspondence between the nodes of the trained neural network when the predicted intent/skill is output. The confidence value of an associated set of anchors may be derived, in part, from the confidence metrics of each of the synthetic utterances generated for the associated set of anchors”. See also Singaraju at ¶ [0163-0165]: “REST server 380 may analyze the enriched events and other information and generate various reports based on certain aggregate metrics 372. The reports may be displayed to an owner, administrator, or developer of the bot system on user interface 392 through UI server 390. For example, if the skill enables users to perform various banking transactions, the intents for the skill can include, for example, “Check Balance” or “Transfer Money.” Intents not only describe what the skill can do, but may also be an integral part of the skill's intelligence. The intents enable the skill to recognize user input because each intent can have a set of typical user statements (i.e., utterances) associated with it. While these utterances may share the same meaning, they may be different (for example, “What's my savings account balance?” and “How much is in my checking account?”); - generating training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219]. Singaraju notes that a skill bot is trained based upon the intents configured for the skill bot and the example utterances associated with the intents (collectively, the training data), so that the skill bot can resolve user input to one of its configured intents. A skill bot is represented by a model that is trained using the training data and allows the skill bot to discern what end users say (or in some cases, are trying to say). See also Singaraju at ¶ [0143]: “Intent modeler may also include logic to detect words which have the same meaning within an end user message. For example, if the training dataset includes: “Mary ran to Texas” and “Bob walked to Detroit,” both mapped to the same intent, and run/walk appear in the same set of intents, intent modeler 314 may learn that for the purposes of intent resolution run=walk. In one illustrative example, “Mary ran to Texas” may become “PERSON run to LOCATION” and “Bob walked to Detroit” may become “PERSON walk to LOCATION.” See also Singaraju at ¶ [0145]: Every sentence in a training dataset, once normalized, may automatically become a rule. In such examples, a training dataset may include a very small number of short sentences. The template rule may return a probability of 1. New rules may be generated from rules via a process of induction. For example, the following sentences may belong to track spending: “How much did I spend last month on gas?” and “How much did I spend in May on food?”. The sentences may be used to induce the rule “How much did I spend” as that is the part which is shared between them. In other examples, the training dataset may include the phrase “How much did I spend” to achieve the same result. See also Singaraju at ¶ [0166]: The skill may be trained to infer user intents when it parses the user input. Specifically, the skill may be trained with the intents and their utterances (collectively, the training data), so that the skill can resolve the user input to one of the intents. The trained skill may not only recognize the sample phrases that belong to each intent, but also recognize similar phrases that correspond to each intent. See also Singaraju at ¶ [0219]: Then a training dataset must be generated that will effectively reduce those deficiencies. The training dataset will include sets of training utterances and “ground truth” intents/skills corresponding to the training utterances. The ML model will then be trained using the training utterances using the ground truth intent/skills to refine the model to better predict intents/skills for utterances. Obtaining the training dataset is difficult and manually selecting and building a training dataset is inefficient and resource intensive.) for retraining the AI model (see at least Singaraju: Figs. 11-13 & ¶ [0026] & ¶ [0055]. Singaraju notes that the number of utterances are presented in a comprehensive user interface for selecting a correspondence between an expected intent of a particular utterance for retraining a machine learning model. The number of utterances are altered by an utterance generation engine to form one or more altered utterance which are used to retrain a machine learning model. See also Singaraju at Fig. 11 noting “a process for retraining models using artificial utterances”, Fig. 12 noting “an interface for retraining models using artificial utterances” and Fig. 13 “an interface for retraining models using artificial utterances”.), wherein the training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219].) comprises a second computer module instruction usable by the computer module (see at least Singaraju: Figs. 11-16 & ¶ [0108] & ¶ [0151]. Singaraju teaches that custom components 318 may include customized modules for the specific bot system. For example, a financial bot may include custom components that may be used to, for example, check balance, transfer funds, or pay bills. See also Singaraju at ¶ [0108]: “There might be times when it is desired to provide end users with the option to temporarily leave a first skill they are engaged with to do something in a second skill within the digital assistant. In one example, if an end user is engaged in a conversation with a shopping skill (e.g., the user has made some selections for purchase), the end user may want to jump to a banking skill (e.g., the end user may want to ensure that he/she has enough money for the purchase), and then return to the shopping skill to complete the end user's order. To address this, an action in the first skill can be configured to initiate an interaction with the second different skill in the same digital assistant and then return to the original flow.” See also Singaraju at ¶ [0272].) and generated based on the first utterance (see at least Singaraju: ¶ [0228] & Fig. 11 (step 1130) & Figs. 12-16. Singaraju teaches that a confidence value corresponding to a first utterance of the one or more utterances may be 0.90 (or 90% confidence that the utterance corresponds to a particular intent/skill). A confidence value corresponding to a second utterance may be 0.67 (or 67%). The first utterance is not determined to be in the subset of training utterances, because its corresponding confidence level is greater than the threshold confidence value.); - retraining the AI model (see at least Singaraju: Figs. 11-13 & ¶ [0026] & ¶ [0055]. Singaraju notes that the number of utterances are presented in a comprehensive user interface for selecting a correspondence between an expected intent of a particular utterance for retraining a machine learning model. The number of utterances are altered by an utterance generation engine to form one or more altered utterance which are used to retrain a machine learning model. See also Singaraju at Fig. 11 noting “a process for retraining models using artificial utterances”, Fig. 12 noting “an interface for retraining models using artificial utterances” and Fig. 13 “an interface for retraining models using artificial utterances”.) using the training data (see at least SIngaraju: ¶ [0099] & ¶ [0143-0145] & ¶ [0219]. Singaraju notes that a skill bot is trained based upon the intents configured for the skill bot and the example utterances associated with the intents (collectively, the training data), so that the skill bot can resolve user input to one of its configured intents. A skill bot is represented by a model that is trained using the training data and allows the skill bot to discern what end users say (or in some cases, are trying to say). See also Singaraju at ¶ [0143]: “Intent modeler may also include logic to detect words which have the same meaning within an end user message. For example, if the training dataset includes: “Mary ran to Texas” and “Bob walked to Detroit,” both mapped to the same intent, and run/walk appear in the same set of intents, intent modeler 314 may learn that for the purposes of intent resolution run=walk. In one illustrative example, “Mary ran to Texas” may become “PERSON run to LOCATION” and “Bob walked to Detroit” may become “PERSON walk to LOCATION.” See also Singaraju at ¶ [0145]: Every sentence in a training dataset, once normalized, may automatically become a rule. In such examples, a training dataset may include a very small number of short sentences. The template rule may return a probability of 1. New rules may be generated from rules via a process of induction. For example, the following sentences may belong to track spending: “How much did I spend last month on gas?” and “How much did I spend in May on food?”. The sentences may be used to induce the rule “How much did I spend” as that is the part which is shared between them. In other examples, the training dataset may include the phrase “How much did I spend” to achieve the same result. See also Singaraju at ¶ [0166]: The skill may be trained to infer user intents when it parses the user input. Specifically, the skill may be trained with the intents and their utterances (collectively, the training data), so that the skill can resolve the user input to one of the intents. The trained skill may not only recognize the sample phrases that belong to each intent, but also recognize similar phrases that correspond to each intent. See also Singaraju at ¶ [0219]: Then a training dataset must be generated that will effectively reduce those deficiencies. The training dataset will include sets of training utterances and “ground truth” intents/skills corresponding to the training utterances. The ML model will then be trained using the training utterances using the ground truth intent/skills to refine the model to better predict intents/skills for utterances. Obtaining the training dataset is difficult and manually selecting and building a training dataset is inefficient and resource intensive.). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: determining that a performance level of the AI model in generating instructions to the computer module is below a threshold based on the set of quality metrics & generating training data for retraining the AI model, wherein the training data comprises a second instruction usable by the computer module and generated based on the first utterance and retraining the AI model using the training data, and in view of Singaraju, whereby an intelligent bot, generally powered by artificial intelligence (AI), can communicate more intelligently and contextually in live conversations, and thus may allow for a more natural conversation between the bot and the end users for improved conversational experience. Rather than the end user learning a fixed set of keywords or commands that the bot knows how to respond to, an intelligent bot may be able to understand the end user's intention based upon user utterances in natural language and respond accordingly (see at least Singaraju: ¶ [0003].). The outputs of the machine learning model or the output for a particular inference may not provide insights into the machine learning model such that a user may understand the behavior of the machine learning model, such as why a particular input would generate a particular output, to determine whether a model and/or a particular prediction is trustworthy and how to improve the training data and the model. Time for training or retraining machine learning models may be constrained by the ability to generate these training sets. The system of Singaraju provides a comprehensive way to integrate an inference explanation system into a retraining system to improve a machine learning model (see at least Singaraju: ¶ [0005-0007].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Singaraju, the results of the combination were predictable. Regarding Dependent Claim 2, Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 1 above, and Johnston further teaches the system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - obtain a first response provided by the user via the chat interface during the chat session, the first response being responsive to a first question provided by the AI model during the chat session (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 9, Lns. 1-9. Johnston teaches that when the consumer is asked for choice of payment type (e.g., "how do you want to pay for ... "), the consumer may answer "by credit card". See also Johnston at Col. 4, Lns. 48-57: Johnston notes that one of the first responses from the system may be to identify the consumer. For example, the consumer may hear or see a system greeting such as, "Welcome to ABC company's virtual agent! What is your name?" Identification of the consumer may use consumer security services 106, biometrics, and/or other means of identification, such as personal information. The user may answer the system by stating his or her name, and the system could use both name biometrics, and name lookup based on the user's provided name, as two factor modes of identification. See also Johnston at Col. 6, Lns. 17-27: “AI 800 would train on AI data, which can be reinforced with AI models.”) - evaluate a quality of the first response, wherein the set of quality metrics is derived further based on the quality of the first response (see at least Johnston: Col. 9, Lns. 21-30 & Col. 10, Lns. 50-67 & Col. 11, Lns. 1-8. Johnston teaches that when the automation obtains a configured success threshold, which can be measure as the percent successful AI performance of the tasks, vs HI task performance, these tasks 640 are added into the capabilities of the advanced dialog 600 and can be assigned by the distributor 500. This threshold could be calculated automatically by the system, where the system weighs the goals of support costs vs. consumer time, and determines that 90% of HI success is an acceptable threshold, and/or could be added into the system in 185. See also Johnston at Col. 6, Lns. 21-27: Johnston teaches notes that the system uses success metrics, such as time to perform the same tasks across all understanding services 550, or the number of conversational turns to perform the same tasks, as examples of quality metrics. Quality metrics are used by machine learning to weight HI data for either supervised or unsupervised learning. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-8. Also, Johnston at Col. 14, Lns. 18-25. Johnston notes successful approaches to performing a function (as indicated by confidence scores), whether automated or by agent, characteristic of the consumer, capability of the agent, their quality as measured by results, these successful features will be given higher weighting and the distributor will use this information to assign work appropriately.). Regarding Dependent Claim 3, Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Claims 1-2 above, and Singaraju further teaches the system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - determine that the quality of the first response falls below a response quality threshold (see at least Singaraju: ¶ [0231] & Figs. 11-16.); - in response to determining that the quality of the first response falls below the response quality threshold (see at least Singaraju: ¶ [0231] & Figs. 11-16.), cause the AI model (see at least Singaraju: Figs. 11-13 & ¶ [0112] & ¶ [0231].) to generate a second question based on modifying the first question (see at least Singaraju: ¶ [0135] & ¶ [0212] & ¶ [0247]. Singaraju teaches that FIG. 8A illustrates an example of simulated CPU usage during an explanation computation. In the illustrated example, the utterance to be classified and explained is “another agent question”. See also Singaraju at ¶ [0247] & Fig. 18: The utterance analyzed is “another agent question” as shown in the table above. The output may include a link 1810 to a report, a list of anchors 1820 selected, for example as part of 530 of process 500, and the corresponding confidence levels 1830 for each set of anchors of the list of anchors 1820. For example, the confidence level associated with anchor set “question” may be about 33.8%, corresponding to a 0.33 confidence metric that a correct intent was predicted for an input utterance. The confidence associated with anchor “question another”, using the anchor words “question” and “another” is about 100%. The represents a predicted correct answer with complete accuracy given the anchor words (e.g., the ML model have previously been trained on those anchor words in great volume).) It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: determine that the quality of the first response falls below a response quality threshold and in response to determining that the quality of the first response falls below the response quality threshold, cause the AI model to generate a second question based on modifying the first question, and in further view of Singaraju, whereby an intelligent bot, generally powered by artificial intelligence (AI), can communicate more intelligently and contextually in live conversations, and thus may allow for a more natural conversation between the bot and the end users for improved conversational experience. Rather than the end user learning a fixed set of keywords or commands that the bot knows how to respond to, an intelligent bot may be able to understand the end user's intention based upon user utterances in natural language and respond accordingly (see at least Singaraju: ¶ [0003].). The outputs of the machine learning model or the output for a particular inference may not provide insights into the machine learning model such that a user may understand the behavior of the machine learning model, such as why a particular input would generate a particular output, to determine whether a model and/or a particular prediction is trustworthy and how to improve the training data and the model. Time for training or retraining machine learning models may be constrained by the ability to generate these training sets. The system of Singaraju provides a comprehensive way to integrate an inference explanation system into a retraining system to improve a machine learning model (see at least Singaraju: ¶ [0005-0007].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Singaraju, the results of the combination were predictable. Regarding Dependent Claim 5, Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 1 above, and Johnston further teaches the system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - obtain data generated by the AI model and provided to the user via the chat interface during the chat session based on monitoring the conversation (see at least Johnston: Col. 4, Lns. 40-47 & Figs. 10-11 & Fig. 13. Johnston teaches that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. A conversation 103 is then started. See also Johnston at Col. 6, Lns. 17-27: “AI 800 would train on AI data, which can be reinforced with AI models.” See also Johnston at Col. 9, Lns. 60-67: “Recognition/redaction component 850 is configured to run on live conversation, or on recorded conversations, transcribing and monitoring conversations.” “See the chat interfaces monitoring the conversation shown at Johnston in Figs. 10-11 & Fig. 13.”); - evaluate a quality of the data generated by the AI model, wherein the set of quality metrics is derived further based on the quality of the data (see at least Johnston: Col. 6, Lns. 10-27. Johnston notes that the lake 150 can consist of conversation data, "consumer inputs into the system", and "responses system outputs to the consumer" (which can be live conversations), recordings, recording transcriptions, text, images, and logs, events, error messages, biometrics AI data, models, model revisions, revisions of the dialog, HI data, HI quality data, consumer data, consumer profile data, company history data, knowledge-bases data, distributor data, and dialog state data. The system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-3. Johnston teaches that the data provided by HI 900 becomes valuable training data for the machine learning 300. The system also tracks the agent 950, using analytics and metrics to assess and quantify agent quality. Quality of agent performance enables the machine learning 300 to weight the data for building better models. Also, measurements of agent quality can be obtained by comparing the effort and time an agent takes to perform certain tasks to the effort and time that other agents take. The system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric. Agent voice quality is analyzed and used as one of the characteristics of the agent. Other indications of agent quality can be obtained through surveys. See also Johnston at Col. 11, Lns. 4-8: “Agent quality grades, and data entered by the agent (successful transactions).” See also Johnston at Col. 14, Lns. 18-25. See also Johnston at Col. 15, Lns. 10-16: “If AI understanding at a task 640 doesn't understand what the consumer wants (i.e., the confidence score of the Al's interpretation is below a given threshold), it elevates the part of the conversation to the dialog component "above" in the hierarchy.”) Regarding Dependent Claim 9, Johnston / Singaraju method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 8 above, and Johnston further teaches the method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - determining a request for performing a transaction for the user based on the first utterance (see at least Johnston: Col. 10, Lns. 60-67 & Col. 11, Lns. 1-8 & Col. 13, Lns. 39-40. Johnston teaches that the capabilities of the resources of the system can be adjusted in real time, as the AI learns more transactions. See also Johnston at Col. 9, Lns. 1-11: Johnston notes that the task manager 630 could represent the process of paying by credit card. In this case, the tasks 640 could be to collect credit card number, expiry date, or security code. When the consumer is asked for choice of payment type (e.g., "how do you want to pay for ... "), the consumer may answer "by credit card". The Function Manager 620 for "payment" may manage the task managers 630 for all sorts of payment options, such as a check, credit card, PayPal™, Rewards, etc., assigning the work to the "credit card payment" task manager to perform credit card payments. See also Johnston at Col. 10, Lns. 60-67. Johnston teaches that the system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric. See also Johnston at Figs. 10-11 and Fig. 13 noting the utterances.), wherein the deriving the set of quality metrics for the interactions (see at least Johnston: Col. 6, Lns. 22-27. Johnston notes that the system uses success metrics, such as time to perform the same tasks across all understanding services 550, or the number of conversational turns to perform the same tasks, as examples of quality metrics. Quality metrics are used by machine learning to weight HI data for either supervised or unsupervised learning. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-3. Johnston teaches that the data provided by HI 900 becomes valuable training data for the machine learning 300. The system also tracks the agent 950, using analytics and metrics to assess and quantify agent quality. Quality of agent performance enables the machine learning 300 to weight the data for building better models. Also, measurements of agent quality can be obtained by comparing the effort and time an agent takes to perform certain tasks to the effort and time that other agents take. The system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric. Agent voice quality is analyzed and used as one of the characteristics of the agent. Other indications of agent quality can be obtained through surveys. See also Johnston at Col. 11, Lns. 4-8: “Agent quality grades, and data entered by the agent (successful transactions).” See also Johnston at Col. 14, Lns. 18-25.) comprises determining a number of dialogue turns between the AI model and the user before the transaction is completed (see at least Johnston: Col. 6, Lns. 22-27 & Col. 14, Lns. 36-45. Johnston teaches that the system uses success metrics, such as time to perform the same tasks across all understanding services 550, or the number of conversational turns to perform the same tasks, as examples of quality metrics. Quality metrics are used by machine learning to weight HI data for either supervised or unsupervised learning. See also Johnston at Figs. 10-11 and Fig. 13.). Regarding Dependent Claim 11, Johnston / Singaraju method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 8 above, and Johnston further teaches the method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - selecting, from a plurality of prompt templates (see at least Johnston: Col. 5, Lns. 12-19 & Col. 15, Lns. 41-63 & Col. 20, Lns. 45-55. Johnston teaches select a sample template of their industry, from a GUI listing industry types, e.g. Hospitality, and framework dialogs to configure (reservations, rewards, billing, etc. The default forms from the template would show the required fields for entry by the user, where data entry could add or delete fields within the forms. The form field ordering may dynamically change by specification. For example, a hotel booking may order the fields as property, then day, then number of nights. Data entry could change the order to day, then number of nights, and on another form facilitate a property search to select a hotel. See also Johnston at Col. 20, Lns. 45-55: “To design the appropriate language model, three different language models were interpolated, including the current generic ASR model, a database of conversations about hotel reservations ("real hotel data"), and synthetic data generated using the templates of known applicable intents and entities for this domain.”), a particular prompt template usable by the AI model to conduct a set of dialogues with the user (see at least Johnston: Col. 5, Lns. 12-19 & Col. 14, Lns. 56-62 & Col. 17, Lns. 9-54: “For example, the advanced dialog 600 may create dynamic prompting by using NLG (Natural Language Generation) and text-to-speech 107 and media services 108 by asking "How can I help you <first name of consumer>?", where the prompt can include text-to-speech audio and recorded audio.”. See also Johnston at Col. 11, Lns. 46-57: “For instance, the prompt may be "welcome to"+<ABC Company>+"how can I help you?", where <ABC Company> represents a variable data item that may be filled with a company name. If a dialog component lacks the required data to execute (e.g., it lacks a value for <ABC Company>), the system can use HI to converse with the consumer while collecting data to configure the task 640.”), wherein the AI model (see at least Johnston: Col. 14, Lns. 56-62: Johnston teaches that the knowledge graph gives a dialog manager an organization and method regarding what can be ordered and how to select choices, where the agent scripts give the workflow-script analysis 150A the types of "prompts" associated with order, the many consumer conversations provide data for the AI models (e.g., transcribed words, associated sounds, entities and intents from transcriptions.) is configured to: - generate a set of questions based on the particular prompt template (see at least Johnston: Col. 5, Lns. 12-19 & Col. 14, Lns. 56-62 & Col. 17, Lns. 9-54: “For example, the advanced dialog 600 may create dynamic prompting by using NLG (Natural Language Generation) and text-to-speech 107 and media services 108 by asking "How can I help you <first name of consumer>?", where the prompt can include text-to-speech audio and recorded audio.”. See also Johnston at Col. 4, Lns. 59-62: “The dialog 600 may respond with additional questions.” See also Johnston at Col. 11, Lns. 46-57: “For instance, the prompt may be "welcome to"+<ABC Company>+"how can I help you?", where <ABC Company> represents a variable data item that may be filled with a company name. If a dialog component lacks the required data to execute (e.g., it lacks a value for <ABC Company>), the system can use HI to converse with the consumer while collecting data to configure the task 640.”), and wherein the method further comprises modifying the particular prompt template (see at least Johnston: Col. 11, Lns. 45-67 & Col. 20, Lns. 25-30 & Figs. 10-12. Johnston notes that "beginning turn" event from the ASR recognizer can be used to detect the time slot when there is a new input for the stream (either the agent stream or the caller stream) and based on this parameter the new tentative prompt can be separated from the rest of the accumulated context for the stream. See also Johnston at Col. 8, Lns. 29-35: “When execution of a dialog component is completed and the business rules 700 are satisfied, the advanced dialog 600 updates the state 695 of the dialog, determines the next step in execution (and possibly executes another prompt to obtain additional needed information, if any), and then supplies the distributor 530 with dialog requirements for an understanding service.”) Regarding Dependent Claim 12, Johnston / Singaraju method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 8 above, and Johnston further teaches the method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - determining a request for performing a transaction for the user (see at least Johnston: Col. 10, Lns. 60-67 & Col. 11, Lns. 1-8 & Col. 13, Lns. 39-40. Johnston teaches that the capabilities of the resources of the system can be adjusted in real time, as the AI learns more transactions. See also Johnston at Col. 9, Lns. 1-11: Johnston notes that the task manager 630 could represent the process of paying by credit card. In this case, the tasks 640 could be to collect credit card number, expiry date, or security code. When the consumer is asked for choice of payment type (e.g., "how do you want to pay for ... "), the consumer may answer "by credit card". The Function Manager 620 for "payment" may manage the task managers 630 for all sorts of payment options, such as a check, credit card, PayPal™, Rewards, etc., assigning the work to the "credit card payment" task manager to perform credit card payments. See also Johnston at Col. 10, Lns. 60-67. Johnston teaches that the system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric.) based on the first utterance (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 4, Lns. 40-47 & Col. 18, Lns. 34-58. Johnston teaches that FIG. 10 provides a schematic view of the key elements of the agent desktop interface 1000 provided by some embodiments of the agent assistance module. The dialog view 1010 contains a view of the transcription of the conversation that unfolds in real time as the customer and agent communicate. This view also contains annotation of utterance classification or the results of other types of NLP on the transcription or chat. The concept scratchpad 120 contains visual elements corresponding to semantic concepts that have appeared in the dialog (e.g., the location "Portland, Oregon", and the date "Jan. 5, 2020"). These can be dragged from the scratchpad and dropped into particular fields in the form 1030, representing a workflow that the agent is completing (e.g., in the example of FIG. 10, to identify available rooms for the customer). See also Johnston at Col. 4, Lns. 40-47: Johnston notes that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. See also Johnston at Figs. 10-11 and 13 noting: “a first computer module instruction to a computer module for performing a task based on a first utterance”, specifically “AI assist to HI module 1360”. See also Johnston at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950.”) - selecting, from a plurality of prompt templates (see at least Johnston: Col. 5, Lns. 12-19 & Col. 15, Lns. 41-63 & Col. 20, Lns. 45-55. Johnston teaches select a sample template of their industry, from a GUI listing industry types, e.g. Hospitality, and framework dialogs to configure (reservations, rewards, billing, etc. The default forms from the template would show the required fields for entry by the user, where data entry could add or delete fields within the forms. The form field ordering may dynamically change by specification. For example, a hotel booking may order the fields as property, then day, then number of nights. Data entry could change the order to day, then number of nights, and on another form facilitate a property search to select a hotel. See also Johnston at Col. 20, Lns. 45-55: “To design the appropriate language model, three different language models were interpolated, including the current generic ASR model, a database of conversations about hotel reservations ("real hotel data"), and synthetic data generated using the templates of known applicable intents and entities for this domain.”) and based on the request (see at least Johnston: Col. 5, Lns. 53-57 & Col. 7, Lns. 14-16. Johnston teaches that any time that AI services 800 return less than desirable confidence recognition results to the proxy 145, the proxy requests the distributor 500 to use HI services 900 to assist with the understanding of the conversation. See also Johnston at Col. 7, Lns. 14-16: “tasks could be scheduled in the distributor 500 and sent via the API to request.”), a particular prompt template usable by the AI model (see at least Johnston: Col. 11, Lns. 46-67 & Col. 14, Lns. 56-62: Johnston teaches that the knowledge graph gives a dialog manager an organization and method regarding what can be ordered and how to select choices, where the agent scripts give the workflow-script analysis 150A the types of "prompts" associated with order, the many consumer conversations provide data for the AI models (e.g., transcribed words, associated sounds, entities and intents from transcriptions.) to instruct the first computer module for performing the transaction (see at least Johnston: Col. 10, Lns. 50-67 & Col. 11, Lns. 1-13. Johnston notes that given a sufficient amount of data such as tasks performed by all agents 950, transcribed conversations, agent quality grades, and data entered by the agent (successful transactions), machine learning 300 can use that data to generate better models and thus improve AI in the system. The application frameworks 450 consist of executable components which are configurable and designed to execute typical company applications, such as reservations, tech support, collections, or banking. See also Johnston at Col. 13, Lns. 39-40: The capabilities of the resources of the system can be adjusted in real time, as the AI learns more transactions. See also Johnston at Figs. 10-11 & Fig. 13.); - causing the AI model (see at least Johnston: Fig. 8 & Col. 6, Lns 18-20 & Col. 14, Lns. 59-62: “Workflow-script analysis 150A the types of "prompts" associated with order, the many consumer conversations provide data for the AI models (e.g., transcribed words, associated sounds, entities and intents from transcriptions).” AI 800 would train on AI data, which can be reinforced with AI models.) to generate the first instruction for the first computer module based on the particular prompt template (see at least Johnston: Figs. 10-11 & Fig. 13.) Regarding Dependent Claim 13, Johnston / Singaraju method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Claims 8 and 12 above, and Singaraju further teaches the method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - further comprising modifying the particular prompt template (see at least Singaraju: ¶ [0119] & ¶ [0145-0146].) in response to determining that the performance (see at least Singaraju: Fig. 3 & ¶ [0138] & Figs. 11-16. Singaraju teaches that the new bot system may need to be monitored, debugged, and modified in order to improve the performance of the bot system and user experience with the bot system. In many cases, it may be difficult to more specifically identify the root causes of the lower than desired performance of the bot system and determine how to improve the bot system without using an analytics or optimization tools. See also Singaraju at ¶ [0141]: “FIG. 3 depicts an integrated system 300 including a bot system (such as bot system 220) and a bot analytic system for monitoring, analyzing, visualizing, and improving the performance of the bot system”. See also Singaraju at Fig. 11 steps 1150->back to 1110 has re-train.) level is below the threshold (see at least Singaraju: ¶ [0014] & ¶ [0226-0228] & Figs. 11-16. Singaraju notes “one or more utterances having a confidence value less than or equal to the threshold value”. See also Singaraju at Fig. 11 and ¶ [0228]: “At 1130, a subset of training utterances is determined, each utterance of the subset of training utterances corresponding to a confidence value less than or equal to the threshold confidence value. Any utterance in the one or more received utterances corresponding to a confidence level less than or equal to the threshold confidence value will be included in the subset of training utterances and ultimately used to retrain the model”. See also Singaraju at Fig. 12 and ¶ [0236]: The selected utterance in interactive intent selector 1250 may be used as the ground truth intent for the corresponding utterance and cause inclusion of the utterance in the subset of training utterances, provided the utterance is associated with a confidence score less than or equal to the confidence score.). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: further comprising modifying the particular prompt template in response to determining that the performance level is below the threshold, and in further view of Singaraju, whereby an intelligent bot, generally powered by artificial intelligence (AI), can communicate more intelligently and contextually in live conversations, and thus may allow for a more natural conversation between the bot and the end users for improved conversational experience. Rather than the end user learning a fixed set of keywords or commands that the bot knows how to respond to, an intelligent bot may be able to understand the end user's intention based upon user utterances in natural language and respond accordingly (see at least Singaraju: ¶ [0003].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Singaraju, the results of the combination were predictable. Regarding Dependent Claim 14, Johnston / Singaraju method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 8 above, and Johnston further teaches the method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - wherein the first utterance (see also Johnston at Figs. 10-11 and 13 noting: “a first computer module instruction to a computer module for performing a task based on a first utterance”, specifically “AI assist to HI module 1360”. See also Johnston at Col. 10, Lns. 14-20 noting “a credit card number can be broken up into several utterances (e.g., the credit identification number, and two sets of four digits, the expiry date, and the security code) and distributed to different agents 950.”) is associated with request for information related to a topic provided by the user (see at least Johnston: Col. 10, Lns. 40-49 & Col. 21, Lns. 20-26 & Figs. 10-13.), wherein the first computer module (see at least Johnston: Fig. 8 & Figs. 10-13) is configured to obtain data related to the topic from a plurality of data sources based on the first instruction (see at least Johnston: Col. 14, Lns. 46-53 & Figs. 10-13. Johnston teaches that the advanced dialog 600, the workflow-script analysis 400, and the machine learning 300, may use information from external sources to build the dialog. See also Col. 14, Lns. 46-53. Johnston notes that the advanced dialog 600, the workflow-script analysis 400, and the machine learning 300, may use information from external sources to build the dialog. For an example, a food menu with agent "scripts" can be used to build a knowledge graph of items, and the menu data, with agent conversation, agent scripts and transcriptions of conversations, can be used to match an application of the application framework 450 to order food.) and wherein the method further comprises: - generated by the AI model (see at least Johnston: Fig. 8 & Col. 6, Lns 18-20 & Col. 14, Lns. 59-62: “Workflow-script analysis 150A the types of "prompts" associated with order, the many consumer conversations provide data for the AI models (e.g., transcribed words, associated sounds, entities and intents from transcriptions).” AI 800 would train on AI data, which can be reinforced with AI models.), content for the user based on the data, wherein the deriving the set of quality metrics (see at least Johnston: Col. 9, Lns. 21-30 & Col. 19, Lns. 34-43.) comprises evaluating a quality of the content generated by the AI model (see at least Johnston: Col. 6, Lns. 22-27. Johnston notes that the system uses success metrics, such as time to perform the same tasks across all understanding services 550, or the number of conversational turns to perform the same tasks, as examples of quality metrics. Quality metrics are used by machine learning to weight HI data for either supervised or unsupervised learning. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-3. Johnston teaches that the data provided by HI 900 becomes valuable training data for the machine learning 300. The system also tracks the agent 950, using analytics and metrics to assess and quantify agent quality. Quality of agent performance enables the machine learning 300 to weight the data for building better models. Also, measurements of agent quality can be obtained by comparing the effort and time an agent takes to perform certain tasks to the effort and time that other agents take. The system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric. Agent voice quality is analyzed and used as one of the characteristics of the agent. Other indications of agent quality can be obtained through surveys. See also Johnston at Col. 11, Lns. 4-8: “Agent quality grades, and data entered by the agent (successful transactions).” See also Johnston at Col. 14, Lns. 18-25.). Regarding Dependent Claim 16, Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 15 above, and Johnston further teaches the non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - obtain a first response provided by the user via the chat interface during the chat session based on the accessing the conversation, the first response being responsive to a first question provided by the AI model during the chat session (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 9, Lns. 1-9. Johnston teaches that when the consumer is asked for choice of payment type (e.g., "how do you want to pay for ... "), the consumer may answer "by credit card". See also Johnston at Col. 4, Lns. 48-57: Johnston notes that one of the first responses from the system may be to identify the consumer. For example, the consumer may hear or see a system greeting such as, "Welcome to ABC company's virtual agent! What is your name?" Identification of the consumer may use consumer security services 106, biometrics, and/or other means of identification, such as personal information. The user may answer the system by stating his or her name, and the system could use both name biometrics, and name lookup based on the user's provided name, as two factor modes of identification. See also Johnston at Col. 6, Lns. 17-27: “AI 800 would train on AI data, which can be reinforced with AI models.”) - evaluate a quality of the first response, wherein the set of quality metrics is derived further based on the quality of the first response (see at least Johnston: Col. 9, Lns. 21-30 & Col. 10, Lns. 50-67 & Col. 11, Lns. 1-8. Johnston teaches that when the automation obtains a configured success threshold, which can be measure as the percent successful AI performance of the tasks, vs HI task performance, these tasks 640 are added into the capabilities of the advanced dialog 600 and can be assigned by the distributor 500. This threshold could be calculated automatically by the system, where the system weighs the goals of support costs vs. consumer time, and determines that 90% of HI success is an acceptable threshold, and/or could be added into the system in 185. See also Johnston at Col. 6, Lns. 21-27: Johnston teaches notes that the system uses success metrics, such as time to perform the same tasks across all understanding services 550, or the number of conversational turns to perform the same tasks, as examples of quality metrics. Quality metrics are used by machine learning to weight HI data for either supervised or unsupervised learning. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-8. Also, Johnston at Col. 14, Lns. 18-25. Johnston notes successful approaches to performing a function (as indicated by confidence scores), whether automated or by agent, characteristic of the consumer, capability of the agent, their quality as measured by results, these successful features will be given higher weighting and the distributor will use this information to assign work appropriately.) Regarding Dependent Claim 19, Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 15 above, and Johnston further teaches the non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - obtain content generated by the AI model and provided to the user via the chat interface during the chat session based on the accessing the conversation (see at least Johnston: Col. 4, Lns. 40-47 & Figs. 10-11 & Fig. 13. Johnston teaches that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. A conversation 103 is then started. See also Johnston at Col. 6, Lns. 17-27: “AI 800 would train on AI data, which can be reinforced with AI models.” See also Johnston at Col. 9, Lns. 60-67: “Recognition/redaction component 850 is configured to run on live conversation, or on recorded conversations, transcribing and monitoring conversations.” “See the chat interfaces monitoring the conversation shown at Johnston in Figs. 10-11 & Fig. 13.”); - evaluate a quality of the content, wherein the set of quality metrics is derived further based on the quality of the data (see at least Johnston: Col. 6, Lns. 10-27. Johnston notes that the lake 150 can consist of conversation data, "consumer inputs into the system", and "responses system outputs to the consumer" (which can be live conversations), recordings, recording transcriptions, text, images, and logs, events, error messages, biometrics AI data, models, model revisions, revisions of the dialog, HI data, HI quality data, consumer data, consumer profile data, company history data, knowledge-bases data, distributor data, and dialog state data. The system tries to maximize the quality of the data it stores. The system distinguishes data that is obtained from AI 800 from data that is obtained from HI 900. See also Johnston at Col. 10, Lns. 50-67 & Col. 11, Lns. 1-3. Johnston teaches that the data provided by HI 900 becomes valuable training data for the machine learning 300. The system also tracks the agent 950, using analytics and metrics to assess and quantify agent quality. Quality of agent performance enables the machine learning 300 to weight the data for building better models. Also, measurements of agent quality can be obtained by comparing the effort and time an agent takes to perform certain tasks to the effort and time that other agents take. The system understands how to perform "gold standard" transactions. A new agent, or existing agents, are asked to perform these transactions by sending text, media, or recorded consumer conversations for specific tasks to the agent. The results of these transactions are then graded as a quality metric. Agent voice quality is analyzed and used as one of the characteristics of the agent. Other indications of agent quality can be obtained through surveys. See also Johnston at Col. 11, Lns. 4-8: “Agent quality grades, and data entered by the agent (successful transactions).” See also Johnston at Col. 14, Lns. 18-25. See also Johnston at Col. 15, Lns. 10-16: “If AI understanding at a task 640 doesn't understand what the consumer wants (i.e., the confidence score of the Al's interpretation is below a given threshold), it elevates the part of the conversation to the dialog component "above" in the hierarchy.”) 14. Claims 6-7, 10 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over US Patent # (US 11,743,378 B1) hereinafter Johnston, et. al., in view of US PG Pub (US 2022/0058347 A1) hereinafter Singaraju, et. al., and in further view of US PG Pub (US 2023/0342557 A1) hereinafter Hasan, et. al. Regarding Dependent Claim 6, Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system as applied to Claims 1 and 5 does not explicitly disclose, but Hasan in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - wherein the quality of the data comprises a semantic metric and a syntactic metric (see at least Hasan: Fig. 6 & Figs. 9-10 & ¶ [0231-0232]. Hasan teaches that the “Qscore the primary inputs are syntactic similarity 604, semantic similarity 601, word error rate 606 and language code 608 a person skilled in the art would understand that the mentioned inputs are not to be construed as limiting. The mentioned inputs are combined in different weightages to obtain the Qscore. The method 900, at step 908 may mark utterances with Qscores greater than a first threshold and lesser than a second threshold as optimal. In an embodiment, the first threshold is 70% and the second threshold is 80%”. See also Hasan at ¶ [0237-0238]: “FIG. 10 for using optimal utterances to train a virtual agent may include, at step 1004, computing Qscore for utterances. For computing the Qscore the primary inputs are syntactic similarity 604, semantic similarity 601, word error rate 606 and language code 608 a person skilled in the art would understand that the mentioned inputs are not to be construed as limiting. The mentioned inputs are combined in different weightages to obtain the Qscore.” See also Hasan at ¶ [0248].). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: wherein the quality of the data comprises a semantic metric and a syntactic metric, and in further view of Hasan, whereby analyzing the flow of conversations helps understand the structure and dynamics of the interactions. Techniques such as sequence analysis and dialogue modeling can reveal patterns in the order and structure of user and virtual agent turns. This analysis can identify frequently occurring sequences of actions or identify areas where the conversation flow could be improved (see at least Hasan: ¶ [0131].). Also by considering the language code, scoring mechanisms can be tailored to account for language-specific nuances and improve accuracy in assessing the quality or relevance of the utterance (see at least Hasan: ¶ [0191].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Hasan, the results of the combination were predictable. Regarding Dependent Claim 7, Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system as applied to Claims 1 and 5 does not explicitly disclose, but Hasan in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - wherein evaluating the quality of the data (see at least Hasan: Fig. 6 & Figs. 9-10 & ¶ [0231-0232]. Hasan teaches that the “Qscore the primary inputs are syntactic similarity 604, semantic similarity 601, word error rate 606 and language code 608 a person skilled in the art would understand that the mentioned inputs are not to be construed as limiting. The mentioned inputs are combined in different weightages to obtain the Qscore. The method 900, at step 908 may mark utterances with Qscores greater than a first threshold and lesser than a second threshold as optimal. In an embodiment, the first threshold is 70% and the second threshold is 80%”. See also Hasan at ¶ [0237-0238]: “FIG. 10 for using optimal utterances to train a virtual agent may include, at step 1004, computing Qscore for utterances. For computing the Qscore the primary inputs are syntactic similarity 604, semantic similarity 601, word error rate 606 and language code 608 a person skilled in the art would understand that the mentioned inputs are not to be construed as limiting. The mentioned inputs are combined in different weightages to obtain the Qscore.” See also Hasan at ¶ [0248].) comprises comparing the data against benchmark data (see at least Hasan: ¶ [0151-0153] & ¶ [0164-0168] & ¶ [0189]. Hasan teaches that WER provides a measure of the accuracy of the ASR system by quantifying the errors made in recognizing the utterance compared to the ground truth. WER is a widely used metric for evaluating and benchmarking ASR and virtual agent systems. See also Hasan at ¶ [0151-0153]: This may be done by comparing the similarity between the user's query and the virtual agent's response. Techniques such as cosine similarity, semantic matching, or vector representations can be employed to measure relevance. If the virtual agent provided accurate and helpful inform ad on, those utterances may be ranked higher. This may be determined by comparing the response with a ground truth or through manual review by experts. See also Hasan at Fig. 6 and Fig. 8). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: wherein evaluating the quality of the data comprises comparing the data against benchmark data, and in further view of Hasan, whereby analyzing the flow of conversations helps understand the structure and dynamics of the interactions. Techniques such as sequence analysis and dialogue modeling can reveal patterns in the order and structure of user and virtual agent turns. This analysis can identify frequently occurring sequences of actions or identify areas where the conversation flow could be improved (see at least Hasan: ¶ [0131].). Also, by considering the language code, scoring mechanisms can be tailored to account for language-specific nuances and improve accuracy in assessing the quality or relevance of the utterance (see at least Hasan: ¶ [0191].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Hasan, the results of the combination were predictable. Regarding Dependent Claim 10, Johnston / Singaraju method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system as applied to Independent Claim 8 does not explicitly disclose, but Hasan in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - wherein the retraining the AI model comprises adjusting one or more parameters of the AI model (see at least Hasan: ¶ [0051] & [0097] & ¶ [0204-0205]. Hasan teaches to train the model using the training data, optimizing a suitable loss function with an optimization algorithm such as gradient descent. This involves feeding the training data through the model iteratively, adjusting the model's parameters to minimize the loss and improve classification accuracy. See also Hasan at ¶ [0097]: Hasan notes a review and retrain module 408 feeds the optimal utterances 410 to the virtual agent for training. See also Hasan at ¶ [0051]: The term “artificial intelligence” may be used to refer to a model built using simple or complex Neural Networks using deep learning techniques and computer vision algorithms. Artificial intelligence model learns from the data and applies that learning to achieve specific pre-defined objectives. See also Hasan at Figs. 10-11.). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju method for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: wherein the retraining the AI model comprises adjusting one or more parameters of the AI model, and in further view of Hasan, whereby analyzing the flow of conversations helps understand the structure and dynamics of the interactions. Techniques such as sequence analysis and dialogue modeling can reveal patterns in the order and structure of user and virtual agent turns. This analysis can identify frequently occurring sequences of actions or identify areas where the conversation flow could be improved (see at least Hasan: ¶ [0131].). Also, by considering the language code, scoring mechanisms can be tailored to account for language-specific nuances and improve accuracy in assessing the quality or relevance of the utterance (see at least Hasan: ¶ [0191].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Hasan, the results of the combination were predictable. Regarding Dependent Claim 20, Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system as applied to Claims 15 and 19 does not explicitly disclose, but Hasan in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - wherein the quality of the content comprises a semantic quality and a syntactic quality (see at least Hasan: Fig. 6 & Figs. 9-10 & ¶ [0231-0232]. Hasan teaches that the “Qscore the primary inputs are syntactic similarity 604, semantic similarity 601, word error rate 606 and language code 608 a person skilled in the art would understand that the mentioned inputs are not to be construed as limiting. The mentioned inputs are combined in different weightages to obtain the Qscore. The method 900, at step 908 may mark utterances with Qscores greater than a first threshold and lesser than a second threshold as optimal. In an embodiment, the first threshold is 70% and the second threshold is 80%”. See also Hasan at ¶ [0237-0238]: “FIG. 10 for using optimal utterances to train a virtual agent may include, at step 1004, computing Qscore for utterances. For computing the Qscore the primary inputs are syntactic similarity 604, semantic similarity 601, word error rate 606 and language code 608 a person skilled in the art would understand that the mentioned inputs are not to be construed as limiting. The mentioned inputs are combined in different weightages to obtain the Qscore.” See also Hasan at ¶ [0248].). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: wherein the quality of the content comprises a semantic quality and a syntactic quality, and in further view of Hasan, whereby analyzing the flow of conversations helps understand the structure and dynamics of the interactions. Techniques such as sequence analysis and dialogue modeling can reveal patterns in the order and structure of user and virtual agent turns. This analysis can identify frequently occurring sequences of actions or identify areas where the conversation flow could be improved (see at least Hasan: ¶ [0131].). Also, by considering the language code, scoring mechanisms can be tailored to account for language-specific nuances and improve accuracy in assessing the quality or relevance of the utterance (see at least Hasan: ¶ [0191].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Hasan, the results of the combination were predictable. 15. Claims 4 and 17-18 is rejected under 35 U.S.C. 103 as being unpatentable over US Patent # (US 11,743,378 B1) hereinafter Johnston, et. al., in view of US PG Pub (US 2022/0058347 A1) hereinafter Singaraju, et. al., and in further view of US PG Pub (US 2024/0184991 A1) hereinafter Mahabaleshwarkar, et. al. Regarding Dependent Claim 4, Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system as applied to Claims 1-3 does not explicitly disclose, but Mahabaleshwarkar in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - wherein the first question was generated by the AI model (see at least Mahabaleshwarkar: ¶ [0006-0008] & ¶ [0016] & ¶ [0021].) to prompt the user for data corresponding to a particular data type or particular content using a first syntax (see at least Mahabaleshwarkar: ¶ [0021] & ¶ [0025] & ¶ [0055-0057]].), and wherein the second question is generated to prompt the user for the data corresponding to the particular data type or the particular content using a second syntax different from the first syntax (see at least Mahabaleshwarkar: ¶ [0016] & ¶ [0055-0057] & ¶ [0075].). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju system for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: wherein the first question was generated by the AI model to prompt the user for data corresponding to a particular data type or particular content using a first syntax and wherein the second question is generated to prompt the user for the data corresponding to the particular data type or the particular content using a second syntax different from the first syntax, and in further view of Mahabaleshwarkar, by generating dialogue responses from structured data for conversational artificial intelligence (AI) systems and applications. Systems and methods are disclosed for training or updating a machine learning model—such as a deep neural network—for deployment using structured data from dialogues of multiple domains. The systems and methods can generate responses to users to provide a more natural user experience, such as by generating alternative outputs that vary in syntax with respect to how the outputs incorporate data used to respond to user utterances, while still accurately providing information to satisfy requests from users (see at least Mahabaleshwarkar: ¶ [abstract].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Mahabaleshwarkar, the results of the combination were predictable. Regarding Dependent Claim 17, Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system as applied to Claims 15-16 does not explicitly disclose, but Mahabaleshwarkar in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - wherein the first question (see at least Mahabaleshwarkar: ¶ [0006] & (Table 1 at ¶ [0055]). Mahabaleshwarkar teaches that the query can be a first query, and the plurality of fields corresponding to the query can be a plurality of first fields. The second query can be linked to the first query.) prompts the user for first data corresponding to a first data type, and wherein the evaluating the quality of the first response comprises determining whether the first response comprises the first data corresponding to the first data type (see at least Mahabaleshwarkar: ¶ [0055-0057] & ¶ [0061]. Mahabaleshwarkar notes that the training system 100 can use a function that assigns scores to the candidate outputs based on (i) a response metric and (ii) one or more variation metrics. The response metric can indicate an accuracy in responding to the query represented by the input of the training data element 116 used to generate the candidate output; for example, the training system 100 can determine the response metric according to matching values represented in the candidate output with values represented in the sample response 128.) It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: wherein the first question prompts the user for first data corresponding to a first data type, and wherein the evaluating the quality of the first response comprises determining whether the first response comprises the first data corresponding to the first data type, and in further view of Mahabaleshwarkar, by generating dialogue responses from structured data for conversational artificial intelligence (AI) systems and applications. Systems and methods are disclosed for training or updating a machine learning model—such as a deep neural network—for deployment using structured data from dialogues of multiple domains. The systems and methods can generate responses to users to provide a more natural user experience, such as by generating alternative outputs that vary in syntax with respect to how the outputs incorporate data used to respond to user utterances, while still accurately providing information to satisfy requests from users (see at least Mahabaleshwarkar: ¶ [abstract].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Mahabaleshwarkar, the results of the combination were predictable. Regarding Dependent Claim 18, Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the limitations of Independent Claim 15 above, and Johnston further teaches the non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system comprising: - obtaining a first question generated by the AI model and provided to the user via the chat interface based on the accessing the conversation (see at least Johnston: Figs. 10-11 & Fig. 13 & Col. 5, Lns. 20-32. Johnston teaches that the system in its simplest form can be seen in FIG. 2, in which a consumer interacts with the system, and the system uses a dialog 600 to manage the conversation and uses services 800 and 900 to understand and respond to the consumer (AI services 800, or human interaction 900). The next steps are determined by the dialog 600. In case of asynchronous conversations, such as with a social post, SMS, and chat, the steps can be asynchronous. There may also be asynchronous steps within a synchronous conversation, such as during live voice communication, in which the system may proceed to interpret the next task of the user while also performing an action (e.g., handling payment) corresponding to prior recognized tasks. See also Johnston at Col. 4, Lns. 40-47: Johnston teaches that a conversation 103 is initiated by a consumer or the system (e.g., by an individual consumer, or by an employee of a company), by connecting a device to the system or connecting to the consumer. This connection could be made by a traditional phone call, mobile call, text message, email, social direct message (DM), a chat session 105, or a social post 160, as non-limiting examples. A conversation 103 is then started. See also Johnston at Col. 6, Lns. 17-27: “AI 800 would train on AI data, which can be reinforced with AI models.”). Regarding Dependent Claim 18, Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system as applied to Independent Claim 15 does not explicitly disclose, but Mahabaleshwarkar in the analogous art for evaluation of artificial intelligence models usable for a conversational artificial intelligence system teaches the following limitations: - wherein the first question (see at least Mahabaleshwarkar: ¶ [0006] & (Table 1 at ¶ [0055]). Mahabaleshwarkar teaches that the query can be a first query, and the plurality of fields corresponding to the query can be a plurality of first fields. The second query can be linked to the first query.) prompts the user for data corresponding to a particular data type using a first syntax (see at least Mahabaleshwarkar: ¶ [0021] & ¶ [0025] & (Table 1 at ¶ [0055]). Mahabaleshwarkar notes that the variational outputs can include at least a first output having a first syntax and a second output having a second syntax that is a variant of the first syntax. The method can include obtaining the one or more values using an application programming interface (API) corresponding to a domain associated with at least one query of the one or more queries. See also Mahabaleshwarkar at ¶ [0055-0056].) - determining that a first answer provided by the user does not correspond to the particular data type (see at least Mahabaleshwarkar: ¶ [0038] & ¶ [0052]. Mahabaleshwarkar notes that the system can include a post-processor to convert the output of the model into the response to be presented to the user (e.g., an answer to the question of “What is the weather in Mountain View tomorrow?”). As noted herein, because the model is trained using training data examples that have sample output responses with variational speech or sentence structure, the output that the model generates can similarly have variations to provide a more natural user experience. This can include generating output that more succinctly and/or precisely provides the information requested in the utterance, as compared with rules-based template response generators that may provide information from all slots for a template even where all the information may not be necessary to satisfy the information requested. See also Mahabaleshwarkar at ¶ [0052]: Mahabaleshwarkar notes that the sample responses 128 can be variational by having variations of syntax (e.g., length; whether particular values 124 are included or not included in the sample responses 128. See also Mahabaleshwarkar at ¶ [0075]: Each of these model outputs are variational in syntax based on features such as whether or not the high temperature is included in the model output, the length of the model outputs, and the relative positioning of values such as “Mountain View,” “sunny” and “warm.” The machine learning model 180 can generate a first output responsive to receiving an input at a first instance and a second output having a different syntax than the first output responsive to receiving the same input at a second instance.) - causing the AI model to generate a second question (see at least Mahabaleshwarkar: ¶ [0012] & ¶ [0055-0056]. A second training data instance include a second query linked to the first query, second values corresponding to a plurality of second fields corresponding to the second query, and a plurality of second sample responses corresponding to the second query. The one or more circuits can further update the one or more parameters of the neural network based at least on the plurality of second sample responses, the second values, and a third query comprising the first query and the second query.) to prompt the user for the data corresponding to the particular data type using a second syntax different from the first syntax (see at least Mahabaleshwarkar: ¶ [0055-0056] & ¶ [0075].). It would have been obvious for one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified the teachings of Johnston / Singaraju non-transitory machine-readable medium for evaluation of artificial intelligence models usable for a conversational artificial intelligence system with the aforementioned teachings of: wherein the first question prompts the user for data corresponding to a particular data type using a first syntax & determining that a first answer provided by the user does not correspond to the particular data type & causing the AI model to generate a second question to prompt the user for the data corresponding to the particular data type using a second syntax different form the first syntax, and in further view of Mahabaleshwarkar, by generating dialogue responses from structured data for conversational artificial intelligence (AI) systems and applications. Systems and methods are disclosed for training or updating a machine learning model—such as a deep neural network—for deployment using structured data from dialogues of multiple domains. The systems and methods can generate responses to users to provide a more natural user experience, such as by generating alternative outputs that vary in syntax with respect to how the outputs incorporate data used to respond to user utterances, while still accurately providing information to satisfy requests from users (see at least Mahabaleshwarkar: ¶ [abstract].). Further, the claimed invention is merely a combination of old elements in a similar field for evaluation of artificial intelligence models usable for a conversational artificial intelligence system, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that, given the existing technical ability to combine the elements as evidenced by Mahabaleshwarkar, the results of the combination were predictable. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to DERICK HOLZMACHER whose telephone number is (571) 270-7853. The examiner can normally be reached on Monday-Friday 9:00 AM – 6:30 PM EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, Applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Brian Epstein can be reached on 571-270-5389. The fax phone number for the organization where this application or proceeding is assigned is 571-270-8853. Information regarding the status of an application may be obtained from Patent Center. Status information for published applications may be obtained from Patent Center. Status information for unpublished applications is available through Patent Center for authorized users only. Should you have questions about access to Patent Center, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). /DERICK J HOLZMACHER/Patent Examiner, Art Unit 3625A /BRIAN M EPSTEIN/Supervisory Patent Examiner, Art Unit 3625
Read full office action

Prosecution Timeline

Jun 28, 2024
Application Filed
Aug 23, 2025
Non-Final Rejection — §103
Oct 01, 2025
Interview Requested
Oct 09, 2025
Applicant Interview (Telephonic)
Oct 09, 2025
Examiner Interview Summary
Dec 01, 2025
Response Filed
Mar 05, 2026
Final Rejection — §103
Apr 02, 2026
Interview Requested

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12586015
RESOURCE-RELATED FORECASTING USING MACHINE LEARNING TECHNIQUES
2y 5m to grant Granted Mar 24, 2026
Patent 12561708
SYSTEMS AND METHODS FOR PREDICTING CHURN IN A MULTI-TENANT SYSTEM
2y 5m to grant Granted Feb 24, 2026
Patent 12499404
SYSTEM AND METHOD FOR QUALITY PLANNING DATA EVALUATION USING TARGET KPIS
2y 5m to grant Granted Dec 16, 2025
Patent 12493838
Translation Decision Assistant
2y 5m to grant Granted Dec 09, 2025
Patent 12450541
SYSTEMS AND METHODS FOR PROVIDING TIERED SUBSCRIPTION DATA STORAGE IN A MULTI-TENANT SYSTEM
2y 5m to grant Granted Oct 21, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
44%
Grant Probability
73%
With Interview (+28.4%)
3y 3m
Median Time to Grant
Moderate
PTA Risk
Based on 270 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month