Prosecution Insights
Last updated: April 19, 2026
Application No. 18/412,679

INTERACTIVE TEXTUAL SYSTEM USING VISUAL GESTURE RECOGNITION

Final Rejection §103
Filed
Jan 15, 2024
Examiner
SHAH, PARAS D
Art Unit
2653
Tech Center
2600 — Communications
Assignee
NEC Corporation Of America
OA Round
2 (Final)
74%
Grant Probability
Favorable
3-4
OA Rounds
3y 9m
To Grant
99%
With Interview

Examiner Intelligence

Grants 74% — above average
74%
Career Allow Rate
474 granted / 645 resolved
+11.5% vs TC avg
Strong +31% interview lift
Without
With
+31.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
24 currently pending
Career history
669
Total Applications
across all art units

Statute-Specific Performance

§101
20.3%
-19.7% vs TC avg
§103
44.9%
+4.9% vs TC avg
§102
13.8%
-26.2% vs TC avg
§112
10.5%
-29.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 645 resolved cases

Office Action

§103
DETAILED ACTION 1. This communication is in response to the Amendments and Arguments filed on 11/6/2025. Claims 1-22 are pending and have been examined. Claim 22 is added. Response to Amendments and Arguments 2. With respect to claim rejections under 35 USC 101 (abstract idea), the applicant’s amendments and arguments are carefully considered, and the 35 USC 101 rejections are withdrawn. With respect to 35 USC 103 rejections, the applicant’s amendments and arguments are carefully considered. However, the examiner respectfully disagrees with the applicant’s arguments. In particular, the applicant argues that (1) the references do not teach “executing computer vision analysis function to infer a textual indication ..” Note that computer vision analysis results/output can be displayed in any of various formats – text (e.g., happy, sad), light on/off, high/low pitch sound, emoji happy/sad face, and so on, all of them represent the same semantic visual understanding for the purpose of generating an appropriate response or prompt -- [SPEC] “The textual indication may be fed to the interaction model as text, encoded, for example as a number ranging from 1 to 10 indicating an emotional state for example from sad to content, as an embedding, and/or the like.” (2) the references do not teach “generating at least one prompt by processing the textual content and the textual indication ..” Note that BEN-KIKI teaches: [Abstract], identifies one or more topics from the received input data, ascertains a tone of the received input data, generates a mirroring prompt <read on based on any input data including ‘the textual content and the textual indication’>; [0041], The response .. may comprise a prompt, an answer to a question ..” (3) the references do not teach “acquiring, by a virtual human interaction agent, a textual content from a user..” Note that BEN-KIKI teaches: [0106], the computing system to utilize AI in .. providing more in context response goes far beyond .. a typical current-day therapy session <read on ‘a virtual human interaction agent’>; [0046], capture user information .. recorded speech, typed texts, facial expression, etc.” [Abstract], identifies one or more topics <read on ‘understanding textual content’> from the received input data. (4) the references do not teach “capturing a visual input pertaining to the user by an image sensor ..” for cognitive/comprehension assessment purpose. Note that BEN-KIKI teaches: [0041], the camera/video monitor may be used to obtain visual data .. of the user <read on for any subsequent application>. The applicant is requested to clearly recite the intended application with the supporting steps as claimed limitation. (5) the references do not teach “using at least one processing circuitry for executing at least one computer vision analysis function to infer a textual indication from the visual input, wherein the visual input corresponding to a non-verbal cue ..” Note that BEN-KIKI teaches: [0041], visual data (e.g., still or moving image for capturing facial expression or other bodily gestures) of the user <read on ‘non-verbal cue’> .. Once the processor analyzes the input data, the computing system provides an appropriate response to the user <read on ‘a level of the user's understanding, inferred from the visual input’>). Also see (1) above. (6) the references do not teach “generating at least one prompt by processing the textual content and the textual indication using the at least one processing circuitry executing an interaction model ..” Note that BEN-KIKI teaches: [Abstract], The processor .. generates a mirroring prompt based on the ascertained tone of the received input data <read on any input data including ‘the textual content and the textual indication’> and output to the user the generated mirroring prompt; [0041], The response .. may comprise a prompt, an answer to a question, ....” which reads on user-facing prompts as the applicant argues. (7) the references do not teach “displaying the at least one prompt to the user in at least one format selected from a group consisting of visually, vocally, and by movement of mechanical components ..” Note that BEN-KIKI teaches: [Abstract], output to the user the generated mirroring prompt; [0041], The response .. may comprise a prompt .. The method of responding to the user may include, for example, synthetic speech, a visual avatar, typed or printed words, etc.” The references do not have to teach “movement of mechanical components.” The applicant further argues that even though the cited references clearly teach all recited limitations, the target application is different. However, the present invention is to generate a prompt/response based on the understanding of input data, which covers a very broad field (e.g., question-answering). Note that BEN-KIKI teaches: [Abstract] “providing more in context response <where context response reads on ‘evaluating understanding’> .. identifies one or more topics <also read on ‘evaluating understanding’> from the received input data, ascertains a tone of the received input data <also read on ‘evaluating understanding’>, generates a mirroring prompt based on the ascertained tone ..” and [0041] “visual data .. of the user .. Once the processor analyzes the input data, the computing system provides an appropriate response <also read on ‘evaluating understanding’> to the user <read on ‘a level of the understanding, inferred from the visual input’>). As for any further arguments including repeating the limitations listed above (1)-(7), the applicant is referred to the examiner’s detailed response provided above, as well as further elaborated responses provided in the next section. With no new technical limitations recited in the amended claims, no further search for prior art is conducted. Claim Rejections - 35 USC § 103 3. Claims 1-3, 5-13, 15-22 are rejected under 35 U.S.C. 103 as being unpatentable over Ben-Kiki, et al. (US 20220164541; hereinafter BEN-KIKI) in view of Mishra, et al. (US 20190172462; hereinafter MISHRA). As per claim 1, BEN-KIKI (Title: Systems and methods for dynamic user interaction for improving mental health) discloses “A method for operating an automated conversation system for evaluating understanding in a textual content, comprising: acquiring, by a virtual human interaction agent of the automated conversation system, a textual content from a user; capturing a visual input pertaining to the user by an image sensor (BEN-KIKI, [Abstract], an interactive session with a user; [0106], the computing system to utilize AI in demonstrating empathy and providing more in context response <where context response reads on ‘understanding’> goes far beyond merely automating what may occur in a typical current-day therapy session <read on ‘an automated conversation system’ and ‘a virtual human interaction agent’>; [Abstract], The processor <read on ‘processing circuitry and storage/memory’> .. identifies one or more topics <also read on ‘understanding’> from the received input data, ascertains a tone of the received input data <also read on ‘understanding’>, generates a mirroring prompt based on the ascertained tone; [0046], capture user information .. recorded speech, typed texts, facial expression, etc.; [0041], the camera/video monitor <read on ‘image sensor’> may be used to obtain visual data .. of the user); using at least one processing circuitry for executing at least one [ computer vision analysis function to infer a textual indication ] representing a level of the user's understanding, inferred from the visual input, wherein the visual input corresponding to a non-verbal cue (BEN-KIKI, [0041], visual data (e.g., still or moving image for capturing facial expression or other bodily gestures) of the user <read on ‘non-verbal cue’> .. Once the processor analyzes the input data, the computing system provides an appropriate response <also read on ‘understanding’> to the user <read on ‘a level of the understanding, inferred from the visual input’>); generating at least one prompt, adapted to the evaluated level of understanding, by processing the textual content and the textual indication using the at least one processing circuitry executing an interaction model; and displaying the at least one prompt to the user in at least one format selected from a group consisting of visually, vocally, and by movement of mechanical components (BEN-KIKI, [Abstract], The processor .. generates a mirroring prompt based on the ascertained tone of the received input data <read on ‘adapted to the evaluated level of understanding’ of any provided data>, and output to the user the generated mirroring prompt; [0041], The response .. may comprise a prompt, an answer to a question, a follow up question, a suggestion, an advice, a general statement, etc. The method of responding to the user may include, for example, synthetic speech, a visual avatar, typed or printed words, etc.).” BEN-KIKI does not explicitly disclose “computer vision analysis function to infer a textual indication ..” However, this limitation is taught by MISHRA (Title: Audio analysis learning with video data). In the same field of endeavor, MISHRA teaches: [Abstract] “Video data is obtained .. wherein the video data includes images of one or more people. Audio data is obtained .. which corresponds to the video data. A face within the video data is identified. A first voice, from the audio data, is associated with the face within the video data. The face within the video data is analyzed for cognitive content ..” Note that computer vision analysis results/output can be displayed in any of various formats – text (e.g., happy, sad), light on/off, high/low pitch sound, emoji happy/sad face, and so on, all of them represent the same semantic visual understanding for the purpose of generating a response or prompt. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of MISHRA in the system (as taught by BEN-KIKI) to perform computer vision analysis on visual input such as a user’s facial expression to augment user’s textual input for enhanced man-machine interaction with appropriate response or prompt. As per claim 2 (dependent on claim 1), BEN-KIKI in view of MISHRA further discloses “wherein the visual input comprising a user's face (BEN-KIKI, [0041], the camera/video monitor may be used to obtain visual data (e.g., still or moving image for capturing facial expression or other bodily gestures) of the user).” As per claim 3 (dependent on claim 2), BEN-KIKI in view of MISHRA further discloses “wherein the at least one computer vision method comprising estimating at least one face muscle positions in the user's face (BEN-KIKI, [0046], capture user information .. recorded speech, typed texts, facial expression <where ‘expression” reads on ‘face muscle positions’ which is subject to BRI>).” As per claim 5 (dependent on claim 1), BEN-KIKI in view of MISHRA further discloses “wherein the at least one prompt comprising an element expected to cause an expected range of facial gestures (BEN-KIKI, [0046], capture user information .. facial expression; [Abstract], The processor outputs the mirroring prompt to the user during the interactive session to cause an increase in a level of engagement of the user <read on ‘cause an expected range of facial gestures’ which is subject to BRI> with the interactive session).” As per claim 6 (dependent on claim 5), BEN-KIKI in view of MISHRA further discloses “acquiring an additional visual input pertaining to the user; using the at least one processing circuitry for executing the at least one computer vision analysis function to infer an additional textual indication from the additional visual input; and generating at least one additional prompt by processing the additional textual indication using the at least one processing circuitry executing an interaction model (see Claim 1 rejections for exactly the same processing steps).” As per claim 7 (dependent on claim 6), BEN-KIKI in view of MISHRA further discloses “wherein the at least one additional prompt is a hint aimed at clarifying the at least one prompt (BEN-KIKI, [0002], a computing system .. for dynamically interacting with a user .. in order to cause .. a higher level of engagement by the human with the computing system; [Abstract], The processor .. generates a mirroring prompt based on the ascertained tone of the received input data <where any additional prompts during user-system interaction can be considered ‘a hint aimed at clarifying the at least one (previous) prompt’>).” As per claim 8 (dependent on claim 1), BEN-KIKI in view of MISHRA further discloses “wherein the interaction model comprising a conversational language model (BEN-KIKI, [Abstract], A computing system for interacting with a user; [0063], The mirroring prompt demonstrates that the computing system “feels” what the user is feeling and, naturally, responds in a similar tone, mirrors the content of the conversation.; [0072], running topic modeling or a similar language modeling technique to identify themes ..).” As per claim 9 (dependent on claim 1), BEN-KIKI in view of MISHRA further discloses “wherein the textual content is received from the user as a voice input, and further comprising converting the voice to text using a text extraction module (BEN-KIKI, [0087], this conversation may not only be visual but also auditory and the user may not only type his or her response via a keyboard but also spoken speech may be captured via speech recognition techniques <read on ‘voice to text’>).” As per claim 10 (dependent on claim 9), BEN-KIKI in view of MISHRA further discloses “synchronizing of the textual indication and the textual content, corresponding to respective timing of the visual input and the voice input (MISHRA, [0018], By capturing audio data as well as the video data, the audio data can be synchronized with the video data. The synchronization or association of the audio data and the video data augments the analysis of the cognitive content. A video of the face of a person while yawning, while occluded by a hand of the person covering their mouth, can be augmented by non-speech sounds such as inhalation, sighs, and so on; [0055], The synchronizing can include synchrony between a face and the voice in expressing cognitive states, mental states, emotional states, moods, and so on).” Claims 11-13, 15-20 (similar in scope to claims 1-3, 5-10, respectively) are rejected under the same rationale as detailed above for claims 1-3, 5-10, respectively. Claim 21 (similar in scope to claim 1) is rejected under the same rationale as detailed above for claim 1. [Specification] A computer program product embodiment … A computer readable storage medium, as that term is used in the present disclosure, is not to be construed as storage in the form of transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide, light pulses passing through a fiber optic cable, electrical signals communicated through a wire, and/or other transmission media. However, to be clear, the examiner advises that Claim 21 be amended with “computer program products comprising instructions stored on non-transitory computer readable storage medium ..” As per claim 22 (dependent on claim 1), BEN-KIKI in view of MISHRA further discloses “wherein inferring said textual indication from the visual input comprises using artificial intelligence (Al) model (BEN-KIKI, [0041], the processor <see claim 1 functions and operations> is enhanced by artificial intelligence; MISHRA, [Abstract], Video data is obtained .. wherein the video data includes images of one or more people .. The face within the video data is analyzed for cognitive content <read on computer vision analysis using AI model>).” 4. Claims 4, 14 are rejected under 35 U.S.C. 103 as being unpatentable over BEN-KIKI in view of MISHRA, and further in view of Yin, et al. (US 9311527; hereinafter YIN). As per claim 4 (dependent on claim 2), BEN-KIKI in view of MISHRA further discloses “wherein the at least one computer vision method comprising [ estimating a gaze direction ] of the user.” BEN-KIKI in view of MISHRA does not explicitly disclose “estimating a gaze direction.” However, this limitation is taught by YIN (Title: Real time eye tracking for human computer interaction). In the same field of endeavor, YIN teaches: [Abstract] “A gaze direction determining system.” Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of YIN in the system (as taught by BEN-KIKI and MISHRA) for user’s gaze direction determination/estimation to augment user’s textual input for enhanced man-machine interaction. Claim 14 (similar in scope to claim 4) is rejected under the same rationale as detailed above for claim 4. Conclusion 5. THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to FENG-TZER TZENG whose telephone number is 571-272-4609. The examiner can normally be reached on M-F (8:30-5:00). The fax phone number where this application or proceeding is assigned is 571-273-4609. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras Shah (SPE) can be reached on 571-270-1650. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /FENG-TZER TZENG/ 2/14/2026 Primary Examiner, Art Unit 2653
Read full office action

Prosecution Timeline

Jan 15, 2024
Application Filed
Aug 06, 2025
Non-Final Rejection — §103
Nov 06, 2025
Response Filed
Feb 15, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12586591
SOUND SIGNAL DECODING METHOD, SOUND SIGNAL DECODER, PROGRAM, AND RECORDING MEDIUM
2y 5m to grant Granted Mar 24, 2026
Patent 12579367
TWO-TOWER NEURAL NETWORK FOR CONTENT-AUDIENCE RELATIONSHIP PREDICTION
2y 5m to grant Granted Mar 17, 2026
Patent 12579360
LEARNING SUPPORT APPARATUS FOR CREATING MULTIPLE-CHOICE QUIZ
2y 5m to grant Granted Mar 17, 2026
Patent 12562173
WEARABLE DEVICE CONTROL BASED ON VOICE COMMAND OF VERIFIED USER
2y 5m to grant Granted Feb 24, 2026
Patent 12559026
VEHICLE AND CONTROL METHOD THEREOF
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+31.1%)
3y 9m
Median Time to Grant
Moderate
PTA Risk
Based on 645 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month