Last updated: April 19, 2026

Application No. 18/311,753

SYSTEM FOR PROCESSING TEXT, IMAGE AND AUDIO SIGNALS USING ARTIFICIAL INTELLIGENCE AND METHOD THEREOF

Non-Final OA §103

Filed

May 03, 2023

Examiner

VO, HUYEN X

Art Unit

2656

Tech Center

2600 — Communications

Assignee

Kai Conversations Limited

OA Round

3 (Non-Final)

Interview Optional

— +19.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 1043 resolved cases, 2023–2026

Examiner Intelligence

VO, HUYEN X View full profile →

Grants 83% — above average

Career Allow Rate

869 granted / 1043 resolved

+21.3% vs TC avg

Strong +20% interview lift

Without

With

+19.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

17 currently pending

Career history

1060

Total Applications

across all art units

Statute-Specific Performance

§101

24.9%

-15.1% vs TC avg

§103

33.0%

-7.0% vs TC avg

§102

23.7%

-16.3% vs TC avg

§112

5.7%

-34.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1043 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/1/2025 has been entered.
 
Response to Arguments
Applicant's arguments filed 12/1/2025 have been fully considered but they are not persuasive.  Applicant essentially argues that the combination of the prior art on record fails to explicitly disclose “analyzing conversations between multiple participants, identifying potential issues or misunderstandings, and providing real-time feedback to participants to adjust their conversational actions” (REMARKS section).  Also, applicant appears to place more emphasis on “understanding” or “misunderstandings” among users in a conversation.  Examiner maintains that the classroom dynamic interactions between students or students and teachers are valid conversations (paragraphs 34-35, 43, and 68, student-student interaction and students-teacher interaction in a classroom is recorded, and that includes both audio and video), identifying potential issues or misunderstandings (paragraph 40, identifying one or more learning gaps; misunderstandings are not considered due to the “or” operator), and providing real-time feedback to participants to adjust their conversational actions (paragraphs 41-43, providing feedback; “The data output module 220 is configured to output the one or more attention activities, the one or more non-attention activities, the determined set of contextual parameters and the identified one or more learning gaps on user interface screen of one or more electronic devices 108 associated with one or more users in real-time”; the non-attention or non-learning activities includes “students are talking to each other”; also “to adjust their conversational actions” is considered an intended use of the feedback).  For these reasons, examiner maintains the prior art on record.

  
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-14 are rejected under 35 U.S.C. 103 as being unpatentable over Shin et al. (USPG 2023/0360437, hereinafter Shin) in view of Kanuganti et al. (USPG 2022/0319181, hereinafter Kanuganti)

Regarding claims 1, 8, and 14, Shin discloses a system, method, and software product for processing at least concurrent audio and image signals to generate corresponding analysis data including emotional measurements (figure 6, audio and video are captured concurrently), wherein the system includes:
a computing arrangement comprising at least a hardware audio processing module for processing the audio signals and an image processing module for processing the image signals, wherein the image processing module is a camera (figure 6, processing video and audio signals separately; also see paragraph 159; figure 5, camera), wherein each module is configured to use one or more artificial intelligence algorithms for processing its respective signal (figure 6 and/or paragraph 159, video signal is processed by CNN while audio signal is processed by LSTM),
wherein the image processing module is configured to process facial image information present in the image signal to identify a plurality of key facial image points indicative of facial expression and to generate temporal facial status data (paragraph 89, various facial features), 
wherein the audio processing module is configured to process speech present in the audio signals by parsing the speech to correlate against a database of words to generate corresponding text data (figure 6, converting MFCC to text; also see paragraph 159), and by processing the speech to determine temporal speech frequency information indicative of at least one of emphasis, hesitation, speech word rate, and to temporally relate the temporal speech frequency information with the text data (figure 6, MFCC is temporal frequency information), and 
wherein the computing arrangement further includes an analysis module using one or more artificial intelligence algorithms to process the temporal facial status data, the text data and the temporal speech frequency information using emotional models to generate an interpretation of the audio and image signals to generate the analysis data including the emotional measurements (figure 6, the identification unit 422 processes the video feature, audio feature, and text feature to determine emotion; see paragraphs 160-162).
Shin fails to explicitly disclose, however, Kanuganti teaches the audio and image signals are of a plurality of participants interacting in a conversation (paragraphs 43 and/or 68, “conversation” of multiple users), wherein the analysis data identifies patterns and trends during the conversation of each of the plurality of participants for determining potential issues or misunderstandings during the conversation between the plurality of participants, and wherein a participant is provided the analysis data in real-time to adjust conversational actions based on the determined potential issues or misunderstanding (paragraphs 34-43 and 68, see response to argument section above). 
Since Shin and Kanuganti are analogous in the art because they are from the same field of endeavor, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to use the known technique of providing realtime feedback of the user’s emotion to the user can make adjustment.  One of ordinary skill in the art would have recognized that the results of the combination were predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385 (U.S. 2007).

Regarding claims 2 and 9, Shin further discloses wherein the image processing module (106) is configured to process the image signals including video data that is captured concurrently with the audio signal (figure 6, audio video signals are captured concurrently);

Regarding claims 3 and 10, Shin further discloses wherein the system and method (100, 200) is configured to process a text signal in addition to the at least audio and image signals (figure 6, converting MFCC to text), wherein information present included in the text signal is used in conjunction with information included in the audio and image signals to generate the corresponding analysis data including the emotional measurements (figure 6, the identification unit 422 processes the video feature, audio feature, and text feature to determine emotion; see paragraphs 160-162). 

Regarding claims 4 and 11, Shin further discloses wherein the one or more artificial intelligence algorithms include at least one of: neural networks, deep neural networks, Boltzmann machines, Hidden Markov Models, for processing at least the audio and image signals (figure 6, CNN or LSTM).  

Regarding claims 5 and 12, Shin further discloses wherein the analysis module (110) is configured to use at least the emotional measurements to determine decision points occurring in a video discussion giving rise to the audio and image signals (paragraphs 161-162, “the identification unit 422 assigns the impression identification label to a captured content for each scene (or segment). The assigned impression identification label is recorded as, for example, meta information of the content”; determining emotion for each scene).  

Regarding claims 6 and 13, Shin further discloses wherein the analysis module (110) is configured to use at least the emotional measurements to determine decision points occurring in a video discussion giving rise to the audio and image signals (paragraphs 161-162, “the identification unit 422 assigns the impression identification label to a captured content for each scene (or segment). The assigned impression identification label is recorded as, for example, meta information of the content”; determining emotion for each scene), wherein the decision points are determined by the analysis module from at least one of temporally abrupt changes in the emotional measurements, temporally abrupt changes in speech content of the audio signal (paragraphs 160-162, changes in emotion is dependent upon changes in signal content, which can be changes in audio or video signal).  

Regarding claim 7, Shin further discloses a method (300) for training the system (100, 200) of any one of claims 1 to 6, wherein the method includes: (i) assembling a first corpus of training material relating training values of emotional measurements to samples of audio signals including speech information (training process in figure 4 and/or paragraphs 132-138); (ii) assembling a second corpus of training material relating training values of emotional measurements of samples of image signals including facial expression information (training process in figure 4 and/or paragraphs 132-138); and (iii) applying the first and second corpus of training material to the one or more artificial intelligence algorithms to configure their analysis characteristics for processing at least the audio and video signals (process in figure 6, processing audio and video signals to determine emotion at runtime).

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Shin in view of Kanuganti, and further in view of Byun et al. (USPG 2024/0087597).

Regarding claim 15, Shin further discloses the system of claim 1, wherein the analysis module is further configured to detect emotional information derived from the audio signals and emotional information derived from the image signals for at least one of the plurality of participants (figure 1, voice emotion and image emotion), and wherein the analysis module uses the detected discrepancies in determining the potential issues or misunderstandings during the conversation between the plurality of participants (see claim 1 above; “discrepancies” is interpreted as the audio emotion not agreeing with the video emotion as taught by Byun below; and only the audio emotion or video emotion that passes a certain threshold is selected for determining “potential issues” as discussed in Kanuganti: paragraphs 34-43 and 68, see response to argument section above).
The modified Shin fails to explicitly disclose, however, Byun further teaches detect discrepancies between emotional information derived from the audio signals and emotional information derived from the image signals (paragraphs 105-106, determining discrepancies between emotion in the audio and emotion in the video, e.g. one exceeds a threshold and the other falls below a threshold value; us the one that is above the emotion that is above the threshold for further processing).
Since the modified Shin and Byun are analogous in the art because they are from the same field of endeavor, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to use the known technique of comparing audio emotion with video emotion.  One of ordinary skill in the art would have recognized that the results of the combination were predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385 (U.S. 2007).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Yadav et al. (USPG 2021/0201004, hereinafter Yadav) teach a method for processing speech, text and video to determine emotion that is considered pertinent to the claimed invention.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUYEN X VO whose telephone number is (571)272-7631. The examiner can normally be reached M-F, 8-4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/HUYEN X VO/Primary Examiner, Art Unit 2656

Read full office action

Prosecution Timeline

May 03, 2023

Application Filed

Apr 18, 2025

Non-Final Rejection — §103

Jul 23, 2025

Response Filed

Jul 30, 2025

Final Rejection — §103

Dec 01, 2025

Request for Continued Examination

Dec 16, 2025

Response after Non-Final Action

Jan 09, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/701,536

Patent 12603083

ESTIMATION DEVICE, ESTIMATION METHOD, AND RECORDING MEDIUM

2y 5m to grant Granted Apr 14, 2026

18/627,842

Patent 12596873

OPTIMIZATION OF RETRIEVAL AUGMENTED GENERATION USING DATA-DRIVEN TEMPLATES

2y 5m to grant Granted Apr 07, 2026

18/472,841

Patent 12586594

GUIDING AMBISONIC AUDIO COMPRESSION BY DECONVOLVING LONG WINDOW FREQUENCY ANALYSIS

2y 5m to grant Granted Mar 24, 2026

18/712,143

Patent 12579990

ENCODING DEVICE, DECODING DEVICE, ENCODING METHOD, AND DECODING METHOD

2y 5m to grant Granted Mar 17, 2026

18/531,223

Patent 12572755

SYSTEM AND METHOD FOR AUGMENTING TRAINING DATA FOR NATURAL LANGUAGE TO MEANING REPRESENTATION LANGUAGE SYSTEMS

2y 5m to grant Granted Mar 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

83%

Grant Probability

99%

With Interview (+19.9%)

2y 10m

Median Time to Grant

High

PTA Risk

Based on 1043 resolved cases by this examiner. Grant probability derived from career allow rate.