Last updated: April 19, 2026

Application No. 18/271,609

HUMAN-COMPUTER INTERACTION METHOD, APPARATUS AND SYSTEM, ELECTRONIC DEVICE AND COMPUTER MEDIUM

Final Rejection §102§103

Filed

Jul 10, 2023

Examiner

WOO, STELLA L

Art Unit

2693

Tech Center

2600 — Communications

Assignee

BEIJING JINGDONG CENTURY TRADING CO., LTD.

OA Round

2 (Final)

Interview Optional

— +13.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 1007 resolved cases, 2023–2026

Examiner Intelligence

WOO, STELLA L View full profile →

Grants 80% — above average

Career Allow Rate

801 granted / 1007 resolved

+17.5% vs TC avg

Moderate +13% lift

Without

With

+13.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 9m

Avg Prosecution

21 currently pending

Career history

1028

Total Applications

across all art units

Statute-Specific Performance

§101

3.3%

-36.7% vs TC avg

§103

42.4%

+2.4% vs TC avg

§102

27.9%

-12.1% vs TC avg

§112

11.4%

-28.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1007 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed August 27, 2025 have been fully considered but they are not persuasive.
In response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., text-type data input by the user such as via an input apparatus) are not recited in claims 1, 6-9, 11, 13-14, 17-18. Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
It is maintained that the text data converted from speech input by the user in Lembersky may be considered as “text data of the user” to the extent required by the claims.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 6-9, 11, 13-14, 17-18 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lembersky et al. (US 2019/0095775 A1, “Lembersky”).
As to claims 1, 8, 9, 11, Lembersky discloses a method for human-computer interaction, comprising: 
receiving information of at least one modality of a user, wherein the information of at least one modality comprises image data and text data of the user (user face/video input 106, Fig. 1, para. 0022-0024, 0073; user’s speech 102 is converted to text, para. 0022); 
recognizing intention information of the user and an emotional characteristic of the user corresponding to the intention information based on the information of the at least one modality (user’s speech is converted to text and user’s mood is determined based on facial expressions captured by video and tone, volume, speed, timing, etc. of audio, para. 0022-0028), comprising:
recognizing an expression characteristic of the user based on the image data of the user (facial expression recognizer interprets the facial landmarks as indicating a facial expression and emotion, para. 0075); 
extracting the intention information of the user based on the text data (user intent based on converted text, para. 0025);
obtaining the emotional characteristic of the user corresponding to the intention information based on the text data and the expression characteristic (user’s mood may be determined based on facial recognition, skeletal tracking, words, etc., para. 0022, 0027-0028);
determining reply information to the user based on the intention information (a proper response to the user is determined based on the converted text and mood, para. 0022, 0027); 
selecting an emotional characteristic of a character to be fed back to the user based on the emotional characteristic of the user (AI character responds in the appropriate facial emotions and voice, e.g., if the user is determined to be worried, the AI character may respond in a calming way, para. 0027; AI character is smiling if the user is happy, or calming if the user is upset, or shocked if the user says something shocking, para. 0029); and 
generating a broadcast video of an animated character image corresponding to the emotional characteristic of the character based on the emotional characteristic of the character and the reply information (AI character is presented as a hologram/3D model, para. 0006, 0022, 0027, 0038, 0103).
As to claims 6, 17, Lembersky discloses: wherein the generating the broadcast video of the animated character image corresponding to the emotional characteristic of the character based on the emotional characteristic of the character and the reply information comprises: 
generating a reply audio based on the reply information and the emotional characteristic of the character (AI engine 112 determines a proper response to the user, which results in the proper text and emotion response being sent to a processor 116, which then translates the responsive text back to synthesized speech 118, para. 0022); and 
obtaining the broadcast video of the animated character image corresponding to the emotional characteristic of the character based on the reply audio, the emotional characteristic of the character, and a pre-established animated character image model (2D or 3D AI character conveys the appropriate emotional response, the AI character may be based on any associated character model, such as a human, avatar, cartoon, or inanimate object character, para. 0022; avatar may be a celebrity, a fictional character, etc., para. 0090).
As to claims 7, 18, Lembersky discloses: wherein the obtaining the broadcast video of the animated character image corresponding to the emotional characteristic of the character based on the reply audio, the emotional characteristic of the character, and the pre-established animated character image model comprises: 
inputting the reply audio and the emotional characteristic of the character into a trained mouth shape driving model to obtain mouth shape data outputted from the mouth shape driving model (morph target animation, para. 0026, 0040-0049); 
inputting the reply audio and the emotional characteristic of the character into a trained expression driving model to obtain expression data outputted from the expression driving model (morph target animation, para. 0026, 0040-0049); 
driving the animated character image model based on the mouth shape data and the expression data to obtain a three-dimensional model action sequence (mouth shape data, para. 0041); 
rendering the three-dimensional model action sequence to obtain a video frame picture sequence (3D holographic character conveys appropriate emotional response and mouth movement, para. 0006, 0022); and 
synthesizing the video frame picture sequence to obtain the broadcast video of the animated character image corresponding to the emotional characteristic of the character (3D files and audio files are combined into an interactive holographic “character” which can interact with users, para. 0038), 
wherein the mouth shape driving model and the expression driving model are trained based on a pre-annotated audio of a same person and audio emotion information obtained from the audio (machine learning tools and techniques 122 may then be used to improve the virtual assistant's responses based on the user's past experiences, para. 0031; machine learning may be used to analyze all user interactions to further improve the face emotional response and verbal response over time, para. 0093).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 2, 13, 19-26 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky in view of McDuff et al. (US 2020/0279553 A1, “McDuff”).
As to claims 2, 13, 22, Lembersky discloses: wherein 
the information of the at least one modality further comprises audio data of the user (user’s speech 102, para. 0022), and 
the recognizing the intention information of the user and the emotional characteristic of the user corresponding to the intention information based on the information of the at least one modality further comprises:
obtaining text information from the audio data (user’s speech is converted to text, para. 0022); 
extracting the intention information of the user based on the text information and the text data (user intent based on converted text, para. 0025);
wherein the obtaining the emotional characteristic of the user corresponding to the intention information based on the text data and the expression characteristic comprises: obtaining the emotional characteristic of the user corresponding to the intention information based on the text data, the audio data and the expression characteristic (user’s mood may be determined based on facial recognition, skeletal tracking, and tone, volume, speed, words, etc. of voice, para. 0022, 0027-0028, 0101).
Lembersky differs from claims 2, 13 in that it does not disclose two types of text, i.e. text data and text information.  McDuff teaches an emotionally-intelligent conversation agent which interacts with a user based on the user’s multimodal inputs, i.e. audio, text and video, to identify content and emotional expression (Abstract, para. 0018, 0020, 0064), the text including typed text as well as text converted from speech (para. 0060, 0065, 0127).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lembersky with the above teaching of McDuff in order to accommodate input via typed text as well as spoken text, such as when the user is not able to speak or not comfortable speaking (McDuff: para. 0090).
As to claims 19-21, Lembersky in view of McDuff teaches: wherein the text data of the user comprises text data input by the user as text-type data (McDuff: typed text, para. 0060, 0065, 0090, 0127).
As to claim 23, Lembersky in view of McDuff teaches: the non-transitory computer-readable medium according to claim 11, 
wherein the text data of the user consists of text data input by the user as one or more of characters, symbols, and numerical data, using an input apparatus configured to receive input from a user's fingers (McDuff: text typed on a keyboard 310 or entered using any other type of input device, para. 0065, 0090); 
the information of the at least one modality consists essentially of the image data, the text data of the user, and audio data of the user (Lembersky: user face/video input 106, Fig. 1, para. 0022-0024, 0073; user’s speech 102 is converted to text, para. 0022; McDuff: audio, text and video inputs, para. 0018), and 
the recognizing the intention information of the user and the emotional characteristic of the user corresponding to the intention information based on the information of the at least one modality further comprises: 
obtaining text information from the audio data (Lembersky: user’s speech is converted to text, para. 0022; McDuff: text from speech, para. 0028); 
extracting the intention information of the user based on the text information and the text data (Lembersky: user intent based on converted text, para. 0025; McDuff: intent recognized from speech text and typed text, para. 0035, 0065, 0090); 
wherein the obtaining the emotional characteristic of the user corresponding to the intention information based on the text data and the expression characteristic comprises: obtaining the emotional characteristic of the user corresponding to the intention information based on text data, the audio data and the expression characteristic (Lembersky: user’s mood may be determined based on facial recognition, skeletal tracking, and tone, volume, speed, words, etc. of voice, para. 0022, 0027-0028, 0101; McDuff: sentiment based on text from speech, typed text and facial expressions, para. 0064-0065); 
wherein the generating the broadcast video of the animated character image corresponding to the emotional characteristic of the character based on the emotional characteristic of the character and the reply information comprises: 
generating a reply audio based on the reply information and the emotional characteristic of the character (Lembersky: AI engine 112 determines a proper response to the user, which results in the proper text and emotion response being sent to a processor 116, which then translates the responsive text back to synthesized speech 118, para. 0022); and 
obtaining the broadcast video of the animated character image corresponding to the emotional characteristic of the character based on the reply audio, the emotional characteristic of the character, and a pre-established animated character image model (Lembersky: 2D or 3D AI character conveys the appropriate emotional response, the AI character may be based on any associated character model, such as a human, avatar, cartoon, or inanimate object character, para. 0022; avatar may be a celebrity, a fictional character, etc., para. 0090)., comprising: 
inputting the reply audio and the emotional characteristic of the character into a trained mouth shape driving model to obtain mouth shape data outputted from the mouth shape driving model (Lembersky: morph target animation, para. 0026, 0040-0049); 
inputting the reply audio and the emotional characteristic of the character into a trained expression driving model to obtain expression data outputted from the expression driving model (Lembersky: morph target animation, para. 0026, 0040-0049); 
driving the animated character image model based on the mouth shape data and the expression data to obtain a three-dimensional model action sequence (Lembersky: mouth shape data, para. 0041); 
rendering the three-dimensional model action sequence to obtain a video frame picture sequence (Lembersky: 3D holographic character conveys appropriate emotional response and mouth movement, para. 0006, 0022); and 
synthesizing the video frame picture sequence to obtain the broadcast video of the animated character image corresponding to the emotional characteristic of the character (Lembersky: 3D files and audio files are combined into an interactive holographic “character” which can interact with users, para. 0038), 
wherein the mouth shape driving model and the expression driving model are trained based on a pre-annotated audio of a same person and audio emotion information obtained from the audio (Lembersky: machine learning tools and techniques 122 may then be used to improve the virtual assistant's responses based on the user's past experiences, para. 0031; machine learning may be used to analyze all user interactions to further improve the face emotional response and verbal response over time, para. 0093).
As to claim 24, Lembersky in view of McDuff teaches: wherein the text data of the user comprises text data input by the user as one or more of characters, symbols, and numerical data, using an input apparatus (McDuff: text typed on a keyboard 310 or entered using any other type of input device, para. 0065, 0090).
As to claim 25, Lembersky in view of McDuff teaches: wherein the text data of the user comprises text data input by the user as one or more of characters, symbols, and numerical data, using an input apparatus configured to receive input from a user's extremities and/or digits (McDuff: text input may be generated writing freehand, para. 0090).
As to claim 26, Lembersky in view of McDuff teaches: wherein the information of the at least one modality consists essentially of the image data and text data of the user; wherein the text data of the user comprises text data input by the user as one or more of characters, symbols, and numerical data, using an input apparatus comprising a keyboard and/or a mouse (McDuff: text typed on a keyboard 310 or entered using any other type of input device, para. 0065, 0090).

Claim(s) 4, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky in view of McDuff, and further in view of Seo et al. (US 2020/0104670 A1, “Seo”).
Lembersky in view of McDuff differs from claims 4, 15 in that although it teaches the use of machine learning to analyze all user interactions and categorize the emotion (Lembersky: para. 0093, claim 3), it does not specifically teach: wherein the obtaining the emotional characteristic of the user corresponding to the intention information based on the text data, the audio data and the expression characteristic comprises: 
inputting the text data into a trained text emotion recognition model to obtain a text emotion characteristic outputted from the text emotion recognition model;
inputting the audio data into a trained speech emotion recognition model to obtain a speech emotion characteristic outputted from the speech emotion recognition model; 
inputting the expression characteristic into a trained expression emotion recognition model to obtain an expression emotion characteristic outputted from the expression emotion recognition model; and 
performing weighted summation on the speech emotion characteristic and the expression emotion characteristic to obtain the emotional characteristic of the user corresponding to the intention information.
Seo teaches obtaining an emotion prediction based on a weighted average of speech emotion and facial expression emotion using trained neural network models according to each modality of data, such as a voice model, a facial expression model, a language model, etc. (Figs. 3, 4, para. 0049, 0080-0082, 0089, 0134-0135), the multimedia data including image, data, video data, audio data, text data, graphic data, etc. (para. 0045).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lembersky in view of McDuff with the above teaching of Seo in order to provide an accurate identification of human emotion (Seo: para. 0006).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Stella L Woo whose telephone number is (571)272-7512. The examiner can normally be reached Monday - Friday, 8 a.m. to 5 p.m.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ahmad Matar can be reached at 571-272-7488. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/Stella L. Woo/Primary Examiner, Art Unit 2693

Read full office action

Prosecution Timeline

Jul 10, 2023

Application Filed

May 23, 2025

Non-Final Rejection — §102, §103

Aug 27, 2025

Response Filed

Oct 20, 2025

Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/209,475

Patent 12602416

HYBRID ARTIFICIAL INTELLIGENCE SYSTEM FOR SEMI-AUTOMATIC PATENT CLAIMS ANALYSIS

2y 5m to grant Granted Apr 14, 2026

18/454,212

Patent 12587613

System and method for documenting and controlling meetings with labels and automated operations

2y 5m to grant Granted Mar 24, 2026

18/466,814

Patent 12585681

Methods for Converting Electronic Presentations Into Autonomous Information Collection and Feedback Systems

2y 5m to grant Granted Mar 24, 2026

18/543,126

Patent 12581038

AUDIO PROCESSING IN VIDEO CONFERENCING SYSTEM USING MULTIMODAL FEATURES

2y 5m to grant Granted Mar 17, 2026

19/220,169

Patent 12568170

PRIORITIZING EMERGENCY CALLS BASED ON CALLER RESPONSE TO AUTOMATED QUERY

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

80%

Grant Probability

93%

With Interview (+13.2%)

2y 9m

Median Time to Grant

Moderate

PTA Risk

Based on 1007 resolved cases by this examiner. Grant probability derived from career allow rate.