Last updated: May 29, 2026
Application No. 18/517,290
VIDEO VERIFICATION SYSTEMS WITH GENERATIVE ARTIFICIAL INTELLIGENCE

Non-Final OA §103
Filed
Nov 22, 2023
Examiner
WINDER, PATRICE L
Art Unit
2453
Tech Center
2400 — Computer Networks
Assignee
Paypal Inc.
OA Round
3 (Non-Final)
Interview Optional

— +11.2% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 87% grant rate with +11.2% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 637 resolved cases, 2023–2026
Examiner Intelligence

WINDER, PATRICE L View full profile →
Grants 87% — above average
Career Allowance Rate
554 granted / 637 resolved
+29.0% vs TC avg
Moderate +11% lift
Without
With
+11.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 4m
Avg Prosecution
16 currently pending
Career history
661
Total Applications
across all art units
Statute-Specific Performance

§101
2.7%
-37.3% vs TC avg
§103
77.5%
+37.5% vs TC avg
§102
9.7%
-30.3% vs TC avg
§112
4.3%
-35.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 637 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Applicant's request for reconsideration of the finality of the rejection of the last Office action is persuasive and, therefore, the finality of that action is withdrawn.


Election/Restrictions
Applicant's request for reconsideration of the restriction of the rejection of the last Office action is persuasive and, therefore, the finality of that action is withdrawn.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. (See claims 1-15).
Applicant argues – “Agnihotram then teaches that ‘the context of the contextual variable 213 derived from Yi may be determined as the context corresponding to the emotion variable having highest context score.’ Id., [0054]. Accordingly, while context score between range 0 and 1 may be mapped to happy, sad, and angry emotions, the ranges are not actually mapped to ‘a range corresponding to the difficulty level,’ as recited in claim 16.”
There is only one range claimed by application, “range corresponding to the difficulty level” which only implies that there is more than one level or different levels would not overlap. As described by Agnihotram, the highest context score is chosen as the context score for determining the score for the question. And p. 50, the emotion variable having a highest context score may be identified as the emotion variable corresponding to “the contextual variable 213.” The claim limitation is “where the score is mapped to a range corresponding to the difficulty level” is met by Agnihotram because the context score is used first to determine which emotion and then the range is used to determine the difficulty level. Said another way, applicant’s claim language does not recite a hierarchy or plurality of ranges which corresponds to a plurality of difficulty levels needed for the interpretation argued. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 and 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over DeRosa-Grund, US 20210089640 A1 (hereafter referred to as DeRosa-Grund) in view of Timem et al., US 20160125884 A1 (hereafter referred to as Timem). 

Claim 1, DeRosa-Grund teaches a video verification system comprising: 
a memory (p. 5, “memory”); and one or more hardware processors coupled to a non-transitory memory and configured to read instructions from the non-transitory memory (p. 5, “a verification device including a processor and a non-transitory computer-readable memory storing an identity verification module that when executed by the processor causes the verification device to receive an identification request from the person”) to cause the video verification system (p. 68, “FIG. 4 illustrates a block diagram of an exemplary computing device, server or other device 400 configured to implement the system 100 according to some embodiments.”) to perform operations comprising: 
generating a dynamic set of questions based on account data associated with a user (p. 21, “The identification verification device using the biometric data to verify the identity of the person by dynamically generating and/or selecting challenge questions and comparing answers to those questions with the stored biometric data.” And p. 39, “the registration module is able to randomly generate the target word set independent of the words of the master list. This target word set is then associated with the unique identifier and identity as a part of the PPI.” See also p. 35, “private personal information (PPI)”), wherein questions in the dynamic set of questions correspond to different difficulty levels (p. 29, “This target word set is able to be updated by the data management module of application 112 by adding words each time the associated biometric data is captured (e.g. during the verification process) and/or selectively removing words.” And p. 32, “The challenge question repository 104 is able to store, update and/or maintain a master list of verification words as well as challenge data comprising a plurality of pairs of challenge questions and answers that are each associated with a set of one or more target words (a target word set) from the master list.” Difficulty based on changes to the number of words in the list and those words must be in the answer.); 
providing a question in the dynamic set of questions to the video verification system (p. 26, “The local device 106 and the network device 108 are each able to comprise a processor, a local memory and an identity verification application 112 stored on the local memory for execution by the processor. “ And p. 29, application 112 adds and removes the target words used to generate the challenge questions. See also p. 33, “the challenge question repository 104 is able to be omitted and/or replaced with an artificial intelligence challenge question generator module that is a part of the application 112 and that dynamically generates challenge questions and associated answers based on one or more target words selected by the application 112.”); 
receiving, at the video verification system an answer to the question (p. 43, “The challenge question requires a verbal response from the person in order to verify their identity. In some embodiments, the question is presented verbally using a microphone of the device 106, 108, textually using a display of the device 106, 108 or both.”), and one or more of a telemetry data, an audio data, and a video data (p. 44, “Specifically, the verification module enables the person to submit a live audio and/or video capture of their voice and/or face speaking their response to the challenge question.”). 
DeRosa-Grund does not specifically teach converting the one or more of the  answer to the question, the telemetry data, the audio data, and the video data into one or more vectors with parameters; assessing the one or more vectors using one or more policies comprising one or more rules to determine a difficultly level of a subsequent question. However, in the same field of endeavor, Timem teaches converting the one or more of the  answer to the question, the telemetry data, the audio data (p. 39, “ … a voice sample may be obtained over the phone (e.g., during a conversation between a customer and a customer service representative, in response to a voice prompt provided by an interactive voice response (IVR) system, and/or the like). In other examples, a voice sample may be obtained over the internet (e.g., via a web interface) and/or from a software application (e.g., via a mobile application being executed on a customer's mobile device).”And p. 60, “Such analysis of the voice sample may include employing various analysis techniques, such as frequency estimation, hidden Markov models, pattern matching, and/or other techniques.), and the video data into one or more vectors with parameters (applied hidden Markov models, p. 60); 
assessing the one or more vectors using one or more policies comprising one or more rules to determine a difficultly level of a subsequent question (p. 60, “selecting one or more authentication questions based on the voice biometric confidence score (e.g., in step 315) may include selecting one or more certain types of questions based on the voice biometric confidence score. For example, depending on the voice biometric confidence score (determined, e.g., by the computing device in step 310), the computing device may select questions with different levels of specificity and/or questions requiring different levels of knowledge. For instance, a voice biometric confidence score at or above a first threshold may correspond to a first set of type(s) of questions, a voice biometric confidence score at or above a second threshold less than the first threshold may correspond to a second set of type(s) of questions, and a voice biometric confidence score below the second threshold may correspond to a third set of type(s) of questions. If, for example, the voice biometric confidence score is relatively high, the one or more types of questions that are selected may be relatively easy to answer, such as the customer's birthdate, the customer's mother's maiden name, and/or the customer's billing address. If the voice biometric confidence score is moderately high, the one or more types of questions that are selected may be moderately easy to answer, such as the state in which the customer's account(s) were opened, the retail location or banking center that the customer has most recently visited, and/or the expiration date and/or verification value of the customer's credit card or debit card. If the voice biometric confidence score is relatively low, the one or more types of questions that are selected may be more intensive.”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify DeRosa-Grund to incorporate converting vectors and assessing vectors from Timem into  the analysis of answers to questions of DeRosa-Grund enabling more nuanced determination of emotion attributes and facilitating more reliable data for determination of emotional state (Timem) that improves the ability to provide customized questions. 

Claim 8, DeRosa-Grund + Timem teaches the video verification system of claim 1, wherein the data associated with the user corresponds to account data of the user collected during a predefined time period (DeRosa-Grund, p. 51, “the data management module is able to assign the unusable marking with an expiration value (e.g. time) wherein after the expiration value is reached, the unusable marking is removed and the word (and the biometric data) is able to be used again by the verification module for verification for that group.”).


Claim(s) 2-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over DeRosa-Grund and Timem as applied to claim 1 above, and further in view of Rule et al., US 20200211532 A1 (hereafter referred to as Rule).

Claim 2, DeRosa-Grund + Timem teaches the video verification system of claim 1, wherein the operations further comprise: 
generating, using a visual artificial intelligence bot of the video verification system, a segment in a dialogue (DeRosa-Grund, p.  59, “the verification module selects or generates (using a variety, or combination, of protocols including but not limited to AI and/or Machine Learning) the challenge question and answer pair from the challenge question repository 104 based on the selected words.”), wherein the segment in the dialogue incorporates the question (DeRosa-Grund, p. 60, “the verification module is able to select the word “blue” from the target word set  … and determine the challenge question “Please repeat the following sentence while filling in the blank ‘The American flag is ______, white and blue’” and correct answer “The American flag is red, white and blue.””); 
communicating, using the visual artificial intelligence bot, the segment in the dialogue to the user, during a video connection (DeRosa-Grund, p. 62, “The question is able to be presented verbally using a microphone of the device 106, 108, textually using a display of the device 106, 108 or both.” “presenting the challenge question to the person”). DeRosa-Grund + Timem does not specifically teach the video connection is a video call. However in the same field of endeavor, Rule teaches a video call with an artificial intelligence bot (p. 42, “data handling module 110 may receive data from user devices 103A-103C via server system 135.” “The received data may be in the form of streaming data e.g., audio data of a call from a user, video data of a video call from a user, text message data from a user and/or the like.” And p. 103, “FIG. 12 shows a block diagram of a user interacting with interaction system 105 using various channels of communication such as video call 1201A, chat 1201B and audio call 1201C.” See also p. 104, “the system may be configured to respond to the user participating in the video conference via a live agent or a VA having attributes that are preferred for that user…”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify DeRosa-Grund + Timem to substitute video call from Rule for the multi-modal communication to expand the features of the automated response system by integrating the audio and video interaction streams. The motivation would have been to substitute a combined communication stream to provide seamless interaction and thereby improve the user experience by coordinating the plurality of communication streams into a multi-modal communication experience.

Claim 3, DeRosa-Grund + Timem + Rule teaches the video verification system of claim 2, wherein the operations further comprise: 
collecting one or more of the telemetry data, the audio data, and the video data (DeRosa-Grund, p. 61, “The question is able to be presented verbally using a microphone of the device 106, 108, textually using a display of the device 106, 108 or both.”) during a predefined time interval after communicating the segment in the dialogue incorporating the question to the user (DeRosa-Grund, p. 61, “the verification module starts a question timer (and optionally displays the timer to the person on the device 106, 108), wherein if the person does not submit an answer to the challenge question before the time expires the verification module indicates that the verification has failed and/or selects/issues a new challenge question.”).

Claim 4, DeRosa-Grund + Timem + Rule teaches the video verification system of claim 2, wherein the operations further comprise: collecting one or more of the telemetry data, the audio data, and the video data during a predefined time interval before and after the communicating the segment in the dialogue incorporating the question to the user (DeRosa-Grund, p. 62, “Like the live facial image and target word capture of the registration process, this audio and/or video must be captured contemporaneously with the verification process, not previously recorded.” “for both the registration process and the verification process, during the recording of biometric target words and/or challenge question responses, the application 112 (via the modules) requires that the person (or their face) stay within view of the camera during the entirety of the audio/video recording.” Must stay in view before and during the challenge.).

Claim 5, DeRosa-Grund +Timem + Rule teaches the video verification system of claim 1, wherein the operations further comprise: 
determining the dynamic set of questions does not include the subsequent question corresponding to the difficulty level (DeRosa-Grund, p. 60, “the verification module is able to select the word “blue” from the target word set  … and determine the challenge question “Please repeat the following sentence while filling in the blank ‘The American flag is ______, white and blue’” and correct answer “The American flag is red, white and blue.”” The verification module dynamically adds words and deletes words changing the possible answers.); and generating the subsequent question with the determined difficulty level during a video session with an artificial intelligence bot (DeRosa-Grund, p.  59, “the verification module selects or generates (using a variety, or combination, of protocols including but not limited to AI and/or Machine Learning) the challenge question and answer pair from the challenge question repository 104 based on the selected words.”). 


Claim(s) 6-7 is/are rejected under 35 U.S.C. 103 as being unpatentable over DeRosa-Grund and Timem as applied to claim 1 above, further in view of Agnihotram et al., US 20200210528 A1 (hereafter referred to as Agnihotram) and Rule.

Claim 6, DeRosa-Grund + Timem teaches the video verification system of claim 1, wherein the operations further comprise:
determining that the answer to the question is a correct answer (DeRosa-Grund, p. 45, “As described above, these selected words are able to be found in the challenge question itself, in the correct answer to the challenge question, or both.”). DeRosa-Grund + Timem does not specifically teach  determining a sentiment of the user using the one or more of the audio data, the video data, and the telemetry data during a video connection with an artificial intelligence bot; and  determining the difficultly level of the subsequent question based on the correct answer and the sentiment. However, in the same field of endeavor, Agnihotram teaches determining a sentiment of the user using the one or more of the audio data, the video data, and the telemetry data during a video connection with an artificial intelligence bot (p. 20, “The predetermined application may have a provision to capture all the user 101 logs such as voice utterances, textual inputs, visual information and sensory information.” “The visual information may be captured using an image and/or video capturing device associated with the computing device of the user 101. Similarly, the sensory information may be obtained using one or more sensors placed on the user 101.” And p. 25, “The user 101 may be a customer, an interviewee or any person carrying out an interaction with the response generation system 105.”); and 
determining the difficultly level of the subsequent question based on the correct answer and the sentiment (p. 21, “suppose the goal-oriented approach indicates a ‘Negative’ outcome. This may mean that the candidate is confident, but the candidate is providing incorrect answers. Accordantly, the response generation system may change the context of the user interaction and may ask easier questions to the candidate.”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify DeRosa-Grund + Timem to substitute changing level of subsequent question from Agnihotram for the subsequent question from DeRosa-Grund + Timem to improve the accuracy of response system interactions by including context from the client interaction and thereby improve the conversation. 
DeRosa-Grund + Timem + Agnihotram does not specifically teach the connection is a video call (DeRosa-Grund, p. 44, “Like the live facial image and target word capture of the registration process, this audio and/or video must be captured contemporaneously with the verification process, not previously recorded.” And up. 26, using a “a cellular/mobile telephone” or “smartphone”). However in the same field of endeavor, Rule teaches a video call with an artificial intelligence bot (p. 42, “data handling module 110 may receive data from user devices 103A-103C via server system 135.” “The received data may be in the form of streaming data e.g., audio data of a call from a user, video data of a video call from a user, text message data from a user and/or the like.” And p. 103, “FIG. 12 shows a block diagram of a user interacting with interaction system 105 using various channels of communication such as video call 1201A, chat 1201B and audio call 1201C.” See also p. 104, “the system may be configured to respond to the user participating in the video conference via a live agent or a VA having attributes that are preferred for that user…”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify DeRosa-Grund + Timem + Agnihotram to substitute video call from Rule for the multi-modal communication to expand the features of the automated response system by integrating the audio and video interaction streams. The motivation would have been to substitute a combined communication stream to provide seamless interaction and thereby improve the user experience by coordinating the plurality of communication streams into a multi-modal communication experience.

Claim 7, DeRosa + Timem + Agnihotram + Rule teaches the video verification system of claim 6, wherein the sentiment of the user corresponds to a score (Agnihotram, p. 28, “the plurality of interaction parameters may include, without limiting to, emotion of a user 101, gestures and facial expressions of the user 101 and values of physiological factors associated with the user 101.”), and to determine the difficulty level of the subsequent question (Agnihotram, p. 20, “the context of the user interaction is determined to be ‘Happy’,… And p. 21, “the response generation system … may ask easier questions to the candidate.” See also p. 29, “determining the context of the user interaction includes assigning a context score to the contextual variable. The context score may be assigned based on comparison of the contextual variable with each of a plurality of predetermined emotion variables stored in the response generation system 105.”), the operations further comprise: 
mapping the score to the difficulty level of the subsequent question (Agnihotram, Agnihotram, p. 20, “the context of the user interaction is determined to be ‘Happy’,… And p. 21, “the response generation system … may ask easier questions to the candidate.” The questions get less difficult combination of sentiment context and negative “goal-oriented” predictions. The opposite for positive “goal-oriented” predictions.).

Claim(s) 9-13  is/are rejected under 35 U.S.C. 103 as being unpatentable over Agnihotram in view of Datta, US 20200202369 A1 (hereafter referred to as Datta) further in view of DeRosa-Grund.

Claim 9, Agnihotram teaches a method comprising:
collecting a segment of a dialogue between a user and an artificial intelligence bot and telemetry data, wherein the segment of the dialogue comprises one or more of a video data and an audio data (p. 5, “the present disclosure discloses collecting various data related to the user interaction from different modalities such as voice, video, sensory and textual data output modalities and fusing the collected data in an efficient manner using user-specific or group-specific learning models.”); 
determining a sentiment associated with the segment of the dialogue using the one or more of the video data, the audio data, and the telemetry data (p. 28, “The one or more processing models 107 may be configured with predetermined techniques for processing the input data received from each of the corresponding plurality of input channels 103 and extracting plurality of interaction parameters from the input data. As an example, the plurality of interaction parameters may include, without limiting to, emotion of a user 101, gestures and facial expressions of the user 101 and values of physiological factors associated with the user 101.”); 
determining a score associated with the sentiment, wherein the score corresponds to a difficulty level of a question (p. 29, “The context score may be assigned based on comparison of the contextual variable with each of a plurality of predetermined emotion variables stored in the response generation system 105. As an example, suppose the predetermined emotion variables stored in the response generation system 105 are, without limitation, ‘happy’, ‘sad’ and ‘angry’.”); 
determining the question using the difficulty level and data of the user (p. 51-54, Selecting difficulty level of “easier”. And p. 58, “the outcome of the goal-oriented prediction model Zi helps in generating more meaningful responses to the user interaction, thereby making the user interaction more interactive, contextually meaningful and comparable to natural human-like conversation.” The generated more meaningful responses to user interaction includes questions added to the continued interaction.). Agnihotram does not specifically teach converting the one or more of the video data, the audio data, and the telemetry data into one or more vectors with parameters; determining a sentiment associated with the segment of the dialogue by assessing the one or more vectors using one or more policies comprising one or more rules. However, in the same field of endeavor, Datta teaches converting the one or more of the video data, the audio data, and the telemetry data into one or more vectors with parameters (p. 60, “to determine a respondent in act 208 from the sent image from act 206, the digital survey system 104 can also apply face recognition algorithms to the image. By applying face recognition algorithms, the digital survey system 104, measures certain features of the face and head of the person (i.e., a survey respondent 120a) captured within the image.” And p. 61, “…[T]he digital survey system 104 applies well-known techniques or more recently developed facial recognition algorithms. For example, the digital survey system 104 optionally applies Elastic Bunch Graph Matching, a hidden Markov model…”);
determining a sentiment associated with the segment of the dialogue by assessing the one or more vectors using one or more policies comprising one or more rules (p. 62, “Specifically, a machine learning model can include a neural network that can be, or is trained, to analyze images to accurately determine either an identity of a person (i.e., a respondent identifier) and/or emotion attributes of a person.” And p. 69, “Moreover, in some embodiments, the unique scores can be weighted. Based on the average, sum, or product of the weighted scores, the digital survey system 104 determines that the survey respondent 120a in the sent image from act 206 expresses a positive emotion, a neutral emotion, or a negative emotion in act 210 (or any type of emotion attribute).” Policy or rules using weighted average, sum, or product for sentiment.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Agnihotram to incorporate conversion to vectors and accessing vectors by policy from Datta to further provide engagement and therefore provide more context for determining the emotion of the user and thereby provide better assessment for providing questions.
Agnihotram + Datta does not specifically teach providing the question to the artificial intelligence bot, wherein the artificial intelligence bot incorporates the question into a second segment of the dialogue that the artificial intelligence bot communicates to the user. However, in the same field of endeavor, DeRosa-Grund teaches providing the question to the artificial intelligence bot, wherein the artificial intelligence bot incorporates the question into a second segment of the dialogue that the artificial intelligence bot communicates to the user  (p. 49, “For example, if in a successful verification process the correct response to a challenge question included 10 words, but only 2 of the words were the selected target words, the remaining 8 words are able to be added to the set of target words for that group (assuming they are not already words in the target word set that were just not selected for that verification process)…” And p. 50, “Using the same example as above, the 2 selected words of the 10 words in the response are able to be removed from the target word set so that they are not used in a subsequent verification process for that group.” Removing words making the next challenge question more difficult and adding more words making the next challenge question less difficult.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate a selected further question from DeRosa-Grund  for the easier/harder question from Agnihotram + Datta as an equivalent substitution. Agnihotram + Datta teaches continuing the dialogue with easier/harder questions and DeRosa-Grund teaches a dynamic question delivery system for adding questions to a dialogue using an automated bot which would enable further updates to the available questions. The motivation would have been to be adaptive to the interpretation of the modeled inputs.

Claim 10, Agnihotram + Datta + DeRosa-Grund teaches the method of claim 9, further comprising: 
conducting the dialogue between the user and the artificial intelligence bot over a visual interface and as part of a verification process that verifies the user to a computing system that stores confidential information associated with the user (DeRosa-Grund,  p. 44, “the verification module is able to access (or direct the person to access) the microphone and/or camera on the device 106, 108 (or another device) and provide command prompts with a graphical user interface of the application 112 presenting the challenge question and instructing the person how to capture and submit audio and/or video of their voice and/or face speaking their response to the challenge question. Like the live facial image and target word capture of the registration process, this audio and/or video must be captured contemporaneously with the verification process, not previously recorded.”).

Claim 11, Agnihotram + Datta + DeRosa-Grund teaches the method of claim 9, wherein the telemetry data comprises data associated with a computing device that receives a user communication of the segment of the dialogue, the telemetry data including a location of the computing device (Agnihotram, p. 25, “In an embodiment, the response generation system 105 may be implemented in a location of the user 101. In another embodiment, the response generation system 105 may be implemented and operated from a remote location.” And p. 26, “The sensory data may include various physiological parameters such as heart rate and blood pressure of the user 101, sensed by one or more sensors configured and/or placed with the user 101.” By placing with the user the sensor readings are location data.); and 
wherein determining the sentiment is based on the user communication and the location of the computing device (Agnihotram, p. 27, “each type of the input data may be received using one of the plurality of input channels 103 corresponding to the type of the input data.” And p. 29, “once the interaction parameters are extracted from the input data, the response generation system 105 may combine each of the plurality of interaction parameters for computing a contextual variable corresponding to the user interaction.” See also p. 30, “the response generation system 105 may generate one or more responses corresponding to the context of the user interaction for providing the contextual responses 109 to the user interaction.”).

Claim 12, Agnihotram + Datta + DeRosa-Grund teaches the method of claim 9, wherein determining the question is based on a user account information associated with the user participating in the dialogue with the artificial intelligence bot (DeRosa-Grund, p. 35, “The registration module provides a graphical user interface and executable instructions that enable a person to create an account and establish a baseline identity group/entry on the system 100…” And p. 41, “The verification module provides a graphical user interface that enables a person to submit an identity verification request indicating one of the identities stored in the biometric data repository 102…”).

Claim 13, Agnihotram + Datta + DeRosa-Grund teaches the method of claim 9, further comprising: 
determining a second sentiment of the user associated with the third segment of the dialogue based on one or more of the second audio data, the second video data and the second telemetry data (Agnihotram, p. 5, “the present disclosure discloses collecting various data related to the user interaction from different modalities such as voice, video, sensory and textual data output modalities and fusing the collected data in an efficient manner using user-specific or group-specific learning models.” And p. 21, “in an interview scenario, the goal-oriented approach may be used for assessing the answers provided by a candidate. Suppose, during the interview, facial expressions of the candidate are indicating that the candidate is confident about the interview.”); 
determining to increase or decrease the difficulty level using the answer and the second sentiment (Agnihotram, p. 21, “the goal-oriented approach indicates a ‘Negative’ outcome. This may mean that the candidate is confident, but the candidate is providing incorrect answers. Accordantly, the response generation system may change the context of the user interaction and may ask easier questions to the candidate.”); and 
generating a second question based on the increased or decreased difficultly level (Agnihotram, p. 21, “the goal-oriented approach indicates a ‘Negative’ outcome. This may mean that the candidate is confident, but the candidate is providing incorrect answers. Accordantly, the response generation system may change the context of the user interaction and may ask easier questions to the candidate.”).
Agnihotram teaches p. 21, “interview scenario” providing multiple questions during conducting the interview. Agnihotram + Datta does not specifically teach a third segment of the dialogue between the user and the artificial intelligence bot and a second telemetry data. However, in the same field of endeavor, DeRosa-Grund teaches receiving a third segment of the dialogue between the user and the artificial intelligence bot and a second telemetry data (p. 44, “Like the live facial image and target word capture of the registration process, this audio and/or video must be captured contemporaneously with the verification process, not previously recorded.” “the verification module enables the person to submit a live audio and/or video capture of their voice and/or face speaking their response to the challenge question.”), the third segment of the dialogue comprising one or more of a second video data and a second audio data (p. 44, “the verification module enables the person to submit a live audio and/or video capture of their voice and/or face speaking their response to the challenge question.”); 
determining an answer to the question from the audio data (DeRosa-Grund, p. 45, “After receiving a response to the challenge question, the verification module determines if the response includes the answer that is associated with the selected challenge question. Concurrently, the verification module identifies/parses the selected words from audio and/or video data of the response”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate operations to selected further question from DeRosa-Grund for the interview scenario dialogue from Agnihotram + Datta  as an equivalent substitution. Agnihotram teaches the continuing the dialogue with easier/harder questions and DeRosa-Grund + Datta teaches a dynamic question delivery system for adding questions to a dialogue using an automated bot.

Claim(s) 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Agnihotram  and Datta and DeRosa-Grund as applied to claim 9 above, and further in view of Rule.

Claim 14, Agnihotram + Datta + DeRosa-Grund teaches the method of claim 9, wherein the video data is collected during a video session between the user and the artificial intelligence bot (DeRosa-Grund, p. 44, “Like the live facial image and target word capture of the registration process, this audio and/or video must be captured contemporaneously with the verification process, not previously recorded.” And up. 26, using a “a cellular/mobile telephone” or “smartphone”). Agnihotram + Datta + DeRosa-Grund does not specifically teach the video connection is a video call. However in the same field of endeavor, Rule teaches a video call with an artificial intelligence bot (p. 42, “data handling module 110 may receive data from user devices 103A-103C via server system 135.” “The received data may be in the form of streaming data e.g., audio data of a call from a user, video data of a video call from a user, text message data from a user and/or the like.” And p. 103, “FIG. 12 shows a block diagram of a user interacting with interaction system 105 using various channels of communication such as video call 1201A, chat 1201B and audio call 1201C.” See also p. 104, “the system may be configured to respond to the user participating in the video conference via a live agent or a VA having attributes that are preferred for that user…”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Agnihotram + Datta + DeRosa-Grund to substitute video call from Rule for the multi-modal communication to expand the features of the automated response system by integrating the audio and video interaction streams. The motivation would have been to substitute a combined communication stream to provide seamless interaction and thereby improve the user experience by coordinating the plurality of communication streams into a multi-modal communication experience.

Claim 15, Agnihotram + Datta + DeRosa-Grund + Rule teaches the method of claim 14, further comprising: 
communicating the question to the user in the dialogue during the video call (Rule, p. 103, “FIG. 12 shows a block diagram of a user interacting with interaction system 105 using various channels of communication such as video call 1201A, chat 1201B and audio call 1201C.” See also p. 104, “the system may be configured to respond to the user participating in the video conference via a live agent or a VA having attributes that are preferred for that user…”).

Claim(s) 16-19 are rejected under 35 U.S.C. 103 as being unpatentable in view of Agnihotram and DeRosa-Grund  and Rule.

	Claim 16, Agnihotram teaches a non-transitory computer readable medium having instructions stored thereon, that when executed by a processor cause the processor to perform operations, the operations comprising: 
collecting a communication between a user and an artificial intelligence bot (“the response generation system 105”), wherein the communication comprises one or more of a video data, an audio data, and a telemetry data (p. 5, “the present disclosure discloses collecting various data related to the user interaction from different modalities such as voice, video, sensory and textual data output modalities and fusing the collected data in an efficient manner using user-specific or group-specific learning models.”); 
determining a sentiment associated with the communication using the one or more of the video data, the audio data, and the telemetry data (p. 28, “The one or more processing models 107 may be configured with predetermined techniques for processing the input data received from each of the corresponding plurality of input channels 103 and extracting plurality of interaction parameters from the input data. As an example, the plurality of interaction parameters may include, without limiting to, emotion of a user 101, gestures and facial expressions of the user 101 and values of physiological factors associated with the user 101.”); 
determining a score associated with the sentiment (p. 29, “The context score may be assigned based on comparison of the contextual variable with each of a plurality of predetermined emotion variables stored in the response generation system 105. As an example, suppose the predetermined emotion variables stored in the response generation system 105 are, without limitation, ‘happy’, ‘sad’ and ‘angry’.”); 
determining a plurality of questions (p. 30, “subsequent to determining the context of the user interaction, the response generation system 105 may generate one or more responses corresponding to the context of the user interaction for providing the contextual responses 109 to the user interaction.”), selecting a question from the plurality of questions using the score (p. 51-54, “the context score corresponding to emotion variables—Happy, Sad and Angry may be computed” and each generated from a range less than 1. The highest context score is chosen as the context score. And p. 50, the emotion variable having a highest context score may be identified as the emotion variable corresponding to the contextual variable 213.”), the plurality of questions based on a user data associated with the user (p. 58, “the outcome of the goal-oriented prediction model Zi helps in generating more meaningful responses to the user interaction, thereby making the user interaction more interactive, contextually meaningful and comparable to natural human-like conversation.” The generated more meaningful responses to user interaction includes questions.), and 
each question associated with a difficulty level corresponding to a range associated with the score (p. 51-54, Each sentiment context score has a range of values less than 1. Each capable of mapping to the next question. The ranges can overlap because selection is based on additional criteria. See p. 50, “the emotion variable having a highest context score may be identified as the emotion variable corresponding to the contextual variable 213.”); 
Agnihotram teaches in p. 30, “subsequent to determining the context of the user interaction, the response generation system 105 may generate one or more responses corresponding to the context of the user interaction for providing the contextual responses 109 to the user interaction. In an embodiment, the one or more responses may be provided to the user 101 in various forms including, but not limited to, an audio response and textual and/or visual response on a display interface associated with the response generation system 105 and the user 101.”). Agnihotram does not specifically teach providing the question to the artificial intelligence bot, wherein the artificial intelligence bot incorporates the question into a dialogue between the artificial intelligence bot and the user. However, in the same field of endeavor, DeRosa-Grund teaches providing the question to the artificial intelligence bot, wherein the artificial intelligence bot incorporates the question into a dialogue between the artificial intelligence bot and the user (p. 49, “For example, if in a successful verification process the correct response to a challenge question included 10 words, but only 2 of the words were the selected target words, the remaining 8 words are able to be added to the set of target words for that group (assuming they are not already words in the target word set that were just not selected for that verification process)…” And p. 50, “Using the same example as above, the 2 selected words of the 10 words in the response are able to be removed from the target word set so that they are not used in a subsequent verification process for that group.” Removing words making the next challenge question more difficult and adding more words making the next challenge question less difficult.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate operations to selected further question from DeRosa-Grund for the interview scenario dialogue from Agnihotram  as an equivalent substitution. Agnihotram teaches the continuing the dialogue with easier/harder questions and DeRosa-Grund teaches a dynamic question delivery system for adding questions to a dialogue using an automated bot.
Agnihotram + DeRosa-Grund does not specifically teach the video session is a video call. However in the same field of endeavor, Rule teaches a video call with an artificial intelligence bot (p. 42, “data handling module 110 may receive data from user devices 103A-103C via server system 135.” “The received data may be in the form of streaming data e.g., audio data of a call from a user, video data of a video call from a user, text message data from a user and/or the like.” And p. 103, “FIG. 12 shows a block diagram of a user interacting with interaction system 105 using various channels of communication such as video call 1201A, chat 1201B and audio call 1201C.” See also p. 104, “the system may be configured to respond to the user participating in the video conference via a live agent or a VA having attributes that are preferred for that user…”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Agnihotram + DeRosa-Grund to substitute video call from Rule for the multi-modal communication to expand the features of the automated response system by integrating the audio and video interaction streams. 

Claim 17, Agnihotram + DeRosa-Grund + Rule teaches the non-transitory computer readable medium of claim 16, further comprising: 
receiving an audio data over a real-time video call between the artificial intelligence bot and the user, wherein the audio data includes an answer and other audio data (Agnihotram, p. 20, “The predetermined application may have a provision to capture all the user 101 logs such as voice utterances, textual inputs, visual information and sensory information.” “The visual information may be captured using an image and/or video capturing device associated with the computing device of the user 101. Similarly, the sensory information may be obtained using one or more sensors placed on the user 101.” And p. 25, “The user 101 may be a customer, an interviewee or any person carrying out an interaction with the response generation system 105.”); 
determining a second sentiment using the answer and the other audio data associated with the real-time video call (Agnihotram, p. 18, “The conversation may become meaningful, interactive and match the natural human-like conversation only when the responses provided by the automated systems are in same context of an ongoing conversation.”  And p. 21, “blending of all available multi-modal input data for detecting context of the user interaction and generating accurate and precise responses to the user interaction.”); 
selecting a second question from the plurality of questions, wherein a difficulty level of the second question corresponds to the second sentiment (Agnihotram, p. 21, “suppose the goal-oriented approach indicates a ‘Negative’ outcome. This may mean that the candidate is confident, but the candidate is providing incorrect answers. Accordantly, the response generation system may change the context of the user interaction and may ask easier questions to the candidate.” Whether a conversation or an interview the next question is based on “model” used and the emotional indications.); and 
communicating the second question in the dialogue over the real-time video call (Agnihotram, p. 21, “ask easier questions to the candidate.”). 

Claim 18, Agnihotram + DeRosa-Grund teaches the non-transitory computer readable medium of claim 16, further comprising: 
receiving an answer to the question over multimodal input stream, wherein the video stream includes a second video data (Agnihotram, next stage of the interaction, p. 25, “The user 101 may be a customer, an interviewee or any person carrying out an interaction with the response generation system 105. The plurality of input channels 103 are the means that capture and/or collect input data related to user transaction from the user 101 and transmit it to the response generation system 105.”); 
determining a second sentiment using the second video data and the answer (Agnihotram, p. 21, “Suppose, during the interview, facial expressions of the candidate are indicating that the candidate is confident about the interview. However, suppose the goal-oriented approach indicates a ‘Negative’ outcome. This may mean that the candidate is confident, but the candidate is providing incorrect answers.”);
 selecting a second question from the plurality of questions, wherein a difficulty level of the second question corresponds to the second sentiment (Agnihotram, p. 21, “Accordantly, the response generation system may change the context of the user interaction and may ask easier questions to the candidate.”); and
communicating the second question in the dialogue over the video connection (DeRosa-Grund). Agnihotram + DeRosa-Grund does not specifically teach a real-time video call. However, in the same field of endeavor Rule teaches a real-time video call (p. 42, “data handling module 110 may receive data from user devices 103A-103C via server system 135.” “The received data may be in the form of streaming data e.g., audio data of a call from a user, video data of a video call from a user, text message data from a user and/or the like.” And p. 103, “FIG. 12 shows a block diagram of a user interacting with interaction system 105 using various channels of communication such as video call 1201A, chat 1201B and audio call 1201C.” See also p. 104, “the system may be configured to respond to the user participating in the video conference via a live agent or a VA having attributes that are preferred for that user…”). For motivation for combination see claim 16 above. 

Claim 19, Agnihotram + DeRosa-Grund + Rule teaches the non-transitory computer readable medium of claim 16, further comprising:
receiving a second communication over a real-time video call (Agnihotram, next stage of the interaction, p. 25, “The user 101 may be a customer, an interviewee or any person carrying out an interaction with the response generation system 105. The plurality of input channels 103 are the means that capture and/or collect input data related to user transaction from the user 101 and transmit it to the response generation system 105.” Next interaction from user.); 
determining a second sentiment using the second communication (Agnihotram, p. 21, “Suppose, during the interview, facial expressions of the candidate are indicating that the candidate is confident about the interview. However, suppose the goal-oriented approach indicates a ‘Negative’ outcome. This may mean that the candidate is confident, but the candidate is providing incorrect answers.”); 
selecting a second question from the plurality of questions, wherein a difficulty level of the second question corresponds to the second sentiment and is higher than the difficulty level of the question (Agnihotram, p. 21, “Accordantly, the response generation system may change the context of the user interaction and may ask easier questions to the candidate.”); and 
communicating the second question in the dialogue over the video connection. Agnihotram + DeRosa-Grund does not specifically teach a real-time video call. However, in the same field of endeavor, Rule teaches a real-time video call (p. 42, “data handling module 110 may receive data from user devices 103A-103C via server system 135.” “The received data may be in the form of streaming data e.g., audio data of a call from a user, video data of a video call from a user, text message data from a user and/or the like.” And p. 103, “FIG. 12 shows a block diagram of a user interacting with interaction system 105 using various channels of communication such as video call 1201A, chat 1201B and audio call 1201C.” See also p. 104, “the system may be configured to respond to the user participating in the video conference via a live agent or a VA having attributes that are preferred for that user…”). For motivation for combination see claim 16 above. 

Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Agnihotram + DeRosa-Grund + Rule as applied to claim 16 above, and further in view of Anderson et al., US 20180047025 A1 (hereafter referred to as Anderson).

Claim 20, Agnihotram + DeRosa-Grund + Rule teaches the non-transitory computer readable medium of claim 16, as cited above. Agnihotram + DeRosa-Grund + Rule does not specifically teach wherein the plurality of questions are based on content of at least one textual document associated with a user account. However, in the same field of endeavor, Anderson teaches  wherein the plurality of questions are based on content of at least one textual document associated with a user account (p. 49, “The cognitive system 100 and network 102 enables question/answer (QA) generation functionality for one or more cognitive system users via their respective computing devices 110-112.” And p. 49, “the content creator creates content in a document of the corpus of data 106 for use as part of a corpus of data with the cognitive system 100. The document includes any file, text, article, or source of data for use in the cognitive system 100.” “The cognitive system 100 parses and generates questions via a QA pipeline 108, and analyzes answers to the questions.”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Agnihotram + DeRosa-Grund + Rule to substitute questions from a document for the questions from Anderson to improve the type of questions asked and thereby customize the questions according to attributes of the user and improve the verification process (Anderson, p. 14-15).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
McNully et al., US 20240361827 A1, teaches  Speaker recognition: Speaker recognition is a technology that identifies the unique characteristics of a person's voice and uses them to authenticate or verify the identity of the speaker. Unlike voice cloning, speaker recognition does not aim to create a new voice but rather to distinguish one person's voice from another.
Jain et al., US 20190163807 A1, teaches in some embodiments, the question difficulty can be based on the question class. For example, a question classified as “computer hardware” can have a relatively higher difficulty value than a question classified as “behavioral.” In some embodiments, the question difficulty can be determined based on feedback. In these embodiments, the question difficulty can be determined by an individual (e.g., or group of individuals), and the feedback can be provided to the question difficulty analyzer 145
Farallel et al., US 20120239398 A1, teaches a method for determining validity of an identity asserted by a speaker using a voice print associated with a user whose identity the speaker is asserting, the voice print obtained from characteristic features of at least one first voice signal obtained from the user uttering at least one enrollment utterance including at least one enrollment word, the method comprising acts of obtaining a second voice signal of the speaker uttering at least one challenge utterance, wherein the at least one challenge utterance includes at least one word that was not in the at least one enrollment utterance, obtaining at least one characteristic feature from the second voice signal, comparing the at least one characteristic feature with at least a portion of the voice print to determine a similarity between the at least one characteristic feature and the at least a portion of the voice print, and determining whether the speaker is the user based, at least in part, on the similarity between the at least one characteristic feature and the at least a portion of the voice print.
Raman et al., US 11062014 B1, teaches Step 604 includes generating a second prompt via the at least one user device interface in connection with an authentication request by the user to access a protected resource associated with the user device, wherein the second prompt is based at least in part on at least one of the one or more dynamic challenge questions. In one or more embodiments of the invention, the second prompt can be further based at least in part on a second set of one or more information-gathering questions, wherein user responses to the second set of one or more information-gathering questions are unknown. Additionally, at least one embodiment of the invention can also include storing one or more user responses to the second set of one or more information-gathering questions in a database.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PATRICE L WINDER whose telephone number is (571)272-3935. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KAMAL B DIVECHA can be reached at (571)272-5863. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Patrice L Winder/Primary Examiner, Art Unit 2453
Read full office action
Prosecution Timeline

Nov 22, 2023
Application Filed
Jul 30, 2025
Non-Final Rejection mailed — §103
Oct 14, 2025
Applicant Interview (Telephonic)
Oct 17, 2025
Examiner Interview Summary
Oct 30, 2025
Response Filed
Feb 11, 2026
Final Rejection mailed — §103
Apr 01, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/915,458
Patent 12634311
NETWORK ANOMALY DETECTION
3y 7m to grant Granted May 19, 2026
17/892,439
Patent 12627674
AUTOMATICALLY MANAGING ACCESS POLICIES FOR ARCHIVED OBJECTS
3y 8m to grant Granted May 12, 2026
18/194,263
Patent 12621321
AUTOMATIC GENERATION OF CAUSE AND EFFECT ATTACK PREDICTIONS MODELS VIA THREAT INTELLIGENCE DATA
3y 1m to grant Granted May 05, 2026
18/234,060
Patent 12621360
NETWORK PROXY FOR ENERGY EFFICIENT VIDEO STREAMING ON MOBILE DEVICES
2y 8m to grant Granted May 05, 2026
18/576,103
Patent 12619433
SECRET CALCULATION SYSTEM, APPARATUS, METHOD AND PROGRAM
2y 4m to grant Granted May 05, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
87%
Grant Probability
98%
With Interview (+11.2%)
3y 4m (~10m remaining)
Median Time to Grant
High
PTA Risk
Based on 637 resolved cases by this examiner. Grant probability derived from career allowance rate.