Prosecution Insights
Last updated: April 19, 2026
Application No. 18/311,753

SYSTEM FOR PROCESSING TEXT, IMAGE AND AUDIO SIGNALS USING ARTIFICIAL INTELLIGENCE AND METHOD THEREOF

Non-Final OA §103
Filed
May 03, 2023
Examiner
VO, HUYEN X
Art Unit
2656
Tech Center
2600 — Communications
Assignee
Kai Conversations Limited
OA Round
3 (Non-Final)
83%
Grant Probability
Favorable
3-4
OA Rounds
2y 10m
To Grant
99%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allow Rate
869 granted / 1043 resolved
+21.3% vs TC avg
Strong +20% interview lift
Without
With
+19.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
17 currently pending
Career history
1060
Total Applications
across all art units

Statute-Specific Performance

§101
24.9%
-15.1% vs TC avg
§103
33.0%
-7.0% vs TC avg
§102
23.7%
-16.3% vs TC avg
§112
5.7%
-34.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 1043 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/1/2025 has been entered. Response to Arguments Applicant's arguments filed 12/1/2025 have been fully considered but they are not persuasive. Applicant essentially argues that the combination of the prior art on record fails to explicitly disclose “analyzing conversations between multiple participants, identifying potential issues or misunderstandings, and providing real-time feedback to participants to adjust their conversational actions” (REMARKS section). Also, applicant appears to place more emphasis on “understanding” or “misunderstandings” among users in a conversation. Examiner maintains that the classroom dynamic interactions between students or students and teachers are valid conversations (paragraphs 34-35, 43, and 68, student-student interaction and students-teacher interaction in a classroom is recorded, and that includes both audio and video), identifying potential issues or misunderstandings (paragraph 40, identifying one or more learning gaps; misunderstandings are not considered due to the “or” operator), and providing real-time feedback to participants to adjust their conversational actions (paragraphs 41-43, providing feedback; “The data output module 220 is configured to output the one or more attention activities, the one or more non-attention activities, the determined set of contextual parameters and the identified one or more learning gaps on user interface screen of one or more electronic devices 108 associated with one or more users in real-time”; the non-attention or non-learning activities includes “students are talking to each other”; also “to adjust their conversational actions” is considered an intended use of the feedback). For these reasons, examiner maintains the prior art on record. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-14 are rejected under 35 U.S.C. 103 as being unpatentable over Shin et al. (USPG 2023/0360437, hereinafter Shin) in view of Kanuganti et al. (USPG 2022/0319181, hereinafter Kanuganti) Regarding claims 1, 8, and 14, Shin discloses a system, method, and software product for processing at least concurrent audio and image signals to generate corresponding analysis data including emotional measurements (figure 6, audio and video are captured concurrently), wherein the system includes: a computing arrangement comprising at least a hardware audio processing module for processing the audio signals and an image processing module for processing the image signals, wherein the image processing module is a camera (figure 6, processing video and audio signals separately; also see paragraph 159; figure 5, camera), wherein each module is configured to use one or more artificial intelligence algorithms for processing its respective signal (figure 6 and/or paragraph 159, video signal is processed by CNN while audio signal is processed by LSTM), wherein the image processing module is configured to process facial image information present in the image signal to identify a plurality of key facial image points indicative of facial expression and to generate temporal facial status data (paragraph 89, various facial features), wherein the audio processing module is configured to process speech present in the audio signals by parsing the speech to correlate against a database of words to generate corresponding text data (figure 6, converting MFCC to text; also see paragraph 159), and by processing the speech to determine temporal speech frequency information indicative of at least one of emphasis, hesitation, speech word rate, and to temporally relate the temporal speech frequency information with the text data (figure 6, MFCC is temporal frequency information), and wherein the computing arrangement further includes an analysis module using one or more artificial intelligence algorithms to process the temporal facial status data, the text data and the temporal speech frequency information using emotional models to generate an interpretation of the audio and image signals to generate the analysis data including the emotional measurements (figure 6, the identification unit 422 processes the video feature, audio feature, and text feature to determine emotion; see paragraphs 160-162). Shin fails to explicitly disclose, however, Kanuganti teaches the audio and image signals are of a plurality of participants interacting in a conversation (paragraphs 43 and/or 68, “conversation” of multiple users), wherein the analysis data identifies patterns and trends during the conversation of each of the plurality of participants for determining potential issues or misunderstandings during the conversation between the plurality of participants, and wherein a participant is provided the analysis data in real-time to adjust conversational actions based on the determined potential issues or misunderstanding (paragraphs 34-43 and 68, see response to argument section above). Since Shin and Kanuganti are analogous in the art because they are from the same field of endeavor, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to use the known technique of providing realtime feedback of the user’s emotion to the user can make adjustment. One of ordinary skill in the art would have recognized that the results of the combination were predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385 (U.S. 2007). Regarding claims 2 and 9, Shin further discloses wherein the image processing module (106) is configured to process the image signals including video data that is captured concurrently with the audio signal (figure 6, audio video signals are captured concurrently); Regarding claims 3 and 10, Shin further discloses wherein the system and method (100, 200) is configured to process a text signal in addition to the at least audio and image signals (figure 6, converting MFCC to text), wherein information present included in the text signal is used in conjunction with information included in the audio and image signals to generate the corresponding analysis data including the emotional measurements (figure 6, the identification unit 422 processes the video feature, audio feature, and text feature to determine emotion; see paragraphs 160-162). Regarding claims 4 and 11, Shin further discloses wherein the one or more artificial intelligence algorithms include at least one of: neural networks, deep neural networks, Boltzmann machines, Hidden Markov Models, for processing at least the audio and image signals (figure 6, CNN or LSTM). Regarding claims 5 and 12, Shin further discloses wherein the analysis module (110) is configured to use at least the emotional measurements to determine decision points occurring in a video discussion giving rise to the audio and image signals (paragraphs 161-162, “the identification unit 422 assigns the impression identification label to a captured content for each scene (or segment). The assigned impression identification label is recorded as, for example, meta information of the content”; determining emotion for each scene). Regarding claims 6 and 13, Shin further discloses wherein the analysis module (110) is configured to use at least the emotional measurements to determine decision points occurring in a video discussion giving rise to the audio and image signals (paragraphs 161-162, “the identification unit 422 assigns the impression identification label to a captured content for each scene (or segment). The assigned impression identification label is recorded as, for example, meta information of the content”; determining emotion for each scene), wherein the decision points are determined by the analysis module from at least one of temporally abrupt changes in the emotional measurements, temporally abrupt changes in speech content of the audio signal (paragraphs 160-162, changes in emotion is dependent upon changes in signal content, which can be changes in audio or video signal). Regarding claim 7, Shin further discloses a method (300) for training the system (100, 200) of any one of claims 1 to 6, wherein the method includes: (i) assembling a first corpus of training material relating training values of emotional measurements to samples of audio signals including speech information (training process in figure 4 and/or paragraphs 132-138); (ii) assembling a second corpus of training material relating training values of emotional measurements of samples of image signals including facial expression information (training process in figure 4 and/or paragraphs 132-138); and (iii) applying the first and second corpus of training material to the one or more artificial intelligence algorithms to configure their analysis characteristics for processing at least the audio and video signals (process in figure 6, processing audio and video signals to determine emotion at runtime). Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Shin in view of Kanuganti, and further in view of Byun et al. (USPG 2024/0087597). Regarding claim 15, Shin further discloses the system of claim 1, wherein the analysis module is further configured to detect emotional information derived from the audio signals and emotional information derived from the image signals for at least one of the plurality of participants (figure 1, voice emotion and image emotion), and wherein the analysis module uses the detected discrepancies in determining the potential issues or misunderstandings during the conversation between the plurality of participants (see claim 1 above; “discrepancies” is interpreted as the audio emotion not agreeing with the video emotion as taught by Byun below; and only the audio emotion or video emotion that passes a certain threshold is selected for determining “potential issues” as discussed in Kanuganti: paragraphs 34-43 and 68, see response to argument section above). The modified Shin fails to explicitly disclose, however, Byun further teaches detect discrepancies between emotional information derived from the audio signals and emotional information derived from the image signals (paragraphs 105-106, determining discrepancies between emotion in the audio and emotion in the video, e.g. one exceeds a threshold and the other falls below a threshold value; us the one that is above the emotion that is above the threshold for further processing). Since the modified Shin and Byun are analogous in the art because they are from the same field of endeavor, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to use the known technique of comparing audio emotion with video emotion. One of ordinary skill in the art would have recognized that the results of the combination were predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385 (U.S. 2007). Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Yadav et al. (USPG 2021/0201004, hereinafter Yadav) teach a method for processing speech, text and video to determine emotion that is considered pertinent to the claimed invention. Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUYEN X VO whose telephone number is (571)272-7631. The examiner can normally be reached M-F, 8-4. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /HUYEN X VO/Primary Examiner, Art Unit 2656
Read full office action

Prosecution Timeline

May 03, 2023
Application Filed
Apr 18, 2025
Non-Final Rejection — §103
Jul 23, 2025
Response Filed
Jul 30, 2025
Final Rejection — §103
Dec 01, 2025
Request for Continued Examination
Dec 16, 2025
Response after Non-Final Action
Jan 09, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12603083
ESTIMATION DEVICE, ESTIMATION METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Apr 14, 2026
Patent 12596873
OPTIMIZATION OF RETRIEVAL AUGMENTED GENERATION USING DATA-DRIVEN TEMPLATES
2y 5m to grant Granted Apr 07, 2026
Patent 12586594
GUIDING AMBISONIC AUDIO COMPRESSION BY DECONVOLVING LONG WINDOW FREQUENCY ANALYSIS
2y 5m to grant Granted Mar 24, 2026
Patent 12579990
ENCODING DEVICE, DECODING DEVICE, ENCODING METHOD, AND DECODING METHOD
2y 5m to grant Granted Mar 17, 2026
Patent 12572755
SYSTEM AND METHOD FOR AUGMENTING TRAINING DATA FOR NATURAL LANGUAGE TO MEANING REPRESENTATION LANGUAGE SYSTEMS
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+19.9%)
2y 10m
Median Time to Grant
High
PTA Risk
Based on 1043 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month