Last updated: May 29, 2026

Application No. 18/323,641

HUMAN-MACHINE DIALOGUE SYSTEM AND METHOD

Non-Final OA §103

Filed

May 25, 2023

Priority

Jun 01, 2022 — CN 202210615940.6

Examiner

ZHU, RICHARD Z

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Alibaba Damo (Hangzhou) Technology Co., Ltd.

OA Round

3 (Non-Final)

Interview Optional

— +15.5% interview lift. Examiner has a relatively high allowance rate (69%); +15.5% interview lift. A written response may suffice.

Based on 721 resolved cases, 2023–2026

Examiner Intelligence

ZHU, RICHARD Z View full profile →

Grants 69% — above average

Career Allowance Rate

500 granted / 721 resolved

+7.3% vs TC avg

Strong +16% interview lift

Without

With

+15.5%

Interview Lift

resolved cases with interview

Typical timeline

3y 3m

Avg Prosecution

22 currently pending

Career history

754

Total Applications

across all art units

Statute-Specific Performance

§101

2.3%

-37.7% vs TC avg

§103

89.2%

+49.2% vs TC avg

§102

5.9%

-34.1% vs TC avg

§112

1.1%

-38.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 721 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114 
A request for continued examination under, including the fee set forth in 37 CFR1.17(e), was filed in this application after final rejection. Since this application is eligiblefor continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e)has been timely paid, the finality of the previous Office action has been withdrawnpursuant to 37 CFR 1.114. Applicant's submission filed on 04/23/2026 has been entered.
Status of the Claims
Claims 1-20 are pending. Claims 5-13 are withdrawn. 
Response to Applicant’s Arguments
In response to “Applicant respectfully believes that, in the cited paragraphs, Shen, at best, merely discusses using the preset text threshold and matching the recognized words with prestored words in the preset mood bag database, after the voice-to-text conversion. However, Shen merely uses the traditional rule-based operations solely based on the recognized text and characters, without considering any voice feature. Thus, Shen still fails to disclose determining the intention by combining both the text feature and the voice feature of the inserted voice”.

In Shen, the determination of user’s intention to interrupt is based on recognition of user’s tone and user’s words expressing certain mood (Shen, p. 5, ¶26, “a second judgment is made on the recognition result of the incoming voice stream, and when the recognition result of the incoming voice stream is a preset tone and when the mood words in the vocabulary pack are judged, the user has an intention to interrupt”).
While the tone corresponds to a voice feature and mood words correspond to text feature, there is no teaching in Shen of a fusion feature obtained by combining a text feature and a voice feature of the inserted voice.
In view of such amendment to claims 1 and 14, previous rejections have been withdrawn. Upon further consideration and search, please see a new combination of references set forth below. 
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 14, and 20 are rejected under 35 USC 103(a) as being unpatentable over Chatterjee et al. (US 11087094 B2) in view of Shen et al. (CN 110853638 A, see attached EPO translation) and Xing et al. (US 2023/0223018 A1).
Regarding Claims 1 and 14, Chatterjee discloses a human-machine dialogue system (Fig. 1), comprising: 
a non-transitory computer-readable storage medium storing a set of instructions that are executable by one or more processors of a device to cause the device to perform a human-machine dialogue method and one or more processors configured to execute instructions to cause the human-machine dialogue system to perform operations of the human-machine dialogue method (Col 2, Rows 50-54 and Col 3, Rows 19-24) comprising: 
performing intention clustering of a dialogue data sample based on a semantic representation of the dialogue data sample (Col 15, Rows 52-63, receiving a set of conversations and related meta data (e.g., Col 17, Rows 6-7, conversation summary), converting each of the set of conversations and related meta data into a set of feature representations in a multi-dimensional vector space, and performing density based spatial clustering of applications with noise to identify clusters among the set of feature representations; per Col 17, Rows 28-34, once the cluster is correctly formed with a similar set of conversations, add a label to the cluster to create an intent model to infer user goals during conversations); 
constructing, based on a clustering result, a dialogue procedure corresponding to the dialogue data sample (Col 17, Rows 53-57, create nodes and assign word sequences to a node in a conversation graph and form corresponding transitional paths to represent the plurality of conversations that have occurred and were directed to a particular task, intent, goal, etc.; per Col 25, Rows 24-28, the conversation graph embeds utterance patterns for agents as a plurality of nodes and user utterances as a plurality of transitional paths or edges such that the conversation graph encodes an agent’s behavior based on user responses); 
receiving a voice dialogue from a user (Col 5, Rows 54-58, request 102 / caller voice request; Fig. 10, receiving user utterance 1010);
converting the voice dialogue into a dialogue text (Col 5, Rows 65-67, automatic speech recognition system 110 processes request 102); 
obtaining a semantic representation corresponding to the dialogue text of the voice dialogue of the user (Col 6, Rows 5-8, pass the sequence of words to natural language understanding system 112; Col 25, Rows 36-52, Fig. 10, passing a first input / current user utterances 1010 into an embedding layer 1012 and LSTM layer 1014 where embedding layer 1012 generates embedding vectors of the current user utterances 1010); 
performing intention analysis on the semantic representation corresponding to the voice dialog to obtain an intention analysis result (Col  6, Rows 10-20, natural language understanding system outputs dialogue act category, the intent of the user, and slot names and values; Col 25, Rows 36-41, Fig. 10, passing a first input / current user utterances 1010 into an embedding layer 1012 and LSTM layer 1014; in view of Col 9, Rows 4-18 and Col 11, Rows 28-33, the LSTM layer 1014 corresponds to a spoken language understanding system including a bidirectional RNN comprising LSTM to determine slot, intent, and dialogue act classification (per Col 8, Rows 44-46, determine if the dialog is a question, query, greeting, command, or request, and information)); 
determining, according to the intention analysis result and the dialogue procedure constructed in advance, a dialogue response (Col 25, Rows 55-60, neural network dense layer 1040 output for predicting the next best node and sample an utterance randomly from the node to automatically generate a conversation flow to allow a system to respond in real-time to different customer utterances and conversation paths); and 
performing voice interaction of the dialogue response with the user (Col 6, Rows 38-44, track current state of the dialog between virtual agent 100 and customer and to respond to the request in a conversational manner) by converting the dialogue response into voice so as to interact with the user by voice (Col 6, Rows 52-53, convert words into audible response 104 by text to speech synthesis unit 118), wherein the dialogue response is an answer response to the voice dialogue (Col 7, Rows 28-37, look up answer to a particular question), or a clarification response to clarify a dialogue intention of the voice dialogue (Col 6, Rows 45-53, “when would you like to leave?”).
Chatterjee does not disclose in a process of performing voice dialogue interaction with the user, performing: determining whether an inserted voice by the user is detected before the dialogue response being played is finished; in response to determining that the inserted voice is detected before the dialogue being played is finished, determining whether the inserted voice reflects an interrupt intention and in response to the determination that the inserted voice does not reflect the interrupt intention, continuing playing the dialogue response.
Shen teaches a human machine dialogue system (p. 1, “Summary of the Invention”, “a method and a device for interrupting a voice robot in real time during a voice interaction process, so that a user can interrupt a voice robot’s voice output at any time during the voice interaction process with the voice robot”) where in a process of performing voice dialogue interaction with the user (p. 3, ¶6, step S101, the voice robot performs voice interaction with the user’s voice), performing: 
determining whether an inserted voice by the user is detected before the dialogue response being played is finished (p. 3, ¶8 and p. 3, ¶10, step S102, determine whether the voice robot outputs a voice, Step S103, when the voice robot outputs a voice, start a detection device to detect whether the user sends an incoming voice stream); 
in response to determining that the inserted voice is detected before the dialogue being played is finished, determining whether the inserted voice reflects an interrupt intention (p. 4, ¶12, user’s incoming voice stream is the voice stream output by the user during the voice robot’s voice output process where purpose of user’s voice is to interrupt the voice robot or not interrupt; p. 4, ¶18, judging whether user sends out interrupted voice stream when the number of text in the incoming voice stream does not exceed a preset text threshold) and 
in response to the determination that the inserted voice does not reflect the interrupt intention, continuing playing the dialogue response (p. 5, ¶21 and ¶26, when the number of texts in the incoming voice stream does not exceed the preset text threshold, execute step S106 to match the recognition result of the incoming voice stream with a preset mood bag database corresponding to user having an intention to interrupt; p. 5, ¶27, when the recognition result of the incoming voice stream does not match the preset mood bag database, the detection device does not perform the action of interrupting the voice robot in response to the incoming voice stream).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to determine whether an inserted voice by the user is detected before the dialogue response being played is finished, determine whether the inserted voice reflects an interrupt intention, and continue playing the dialogue response when it is determined that the inserted voice does not reflect the interrupt intention in order to enable the user to interrupt the voice (i.e., dialogue response) being output by the voice robot in real-time during the voice interaction process between the user and the voice robot, and to filter the non-interrupted user’s input voice (Shen, p. 3, ¶3).
The combination of Chatterjee and Shen does not disclose determining whether the inserted voice reflects an interrupt intention by a joint modeling of speech and text, according to a fusion feature obtained by combining a text feature and a voice feature of the inserted voice.
Xing discloses a multimodal language understanding system making a semantic prediction representing speaker’s intent based on an audio-textual representation (¶30 and Figs. 2-4) by determining speaker’s intent by a joint modeling of speech and text, according to a fusion feature obtained by combining a text feature and a voice feature of the speaker’s inserted voice (¶62, generate a sequence of speech chunks 230 and corresponding text transcripts 240 from speech signal 210; ¶74, encode the speech chunks into encoded speech embeddings 414; ¶75, encode the text transcript 240 to generate encoded word embeddings 416; ¶¶75-77, generate a uniform representation 420 comprising temporally aligned speech embedding and word embedding; ¶¶78-79, concatenate the uniform representation to generate an audio-textual representation 424 constituting fused multimodal feature integration of speech and text; ¶80, transform the audio-textual embeddings 424 into speaker’s intent).
In Shen, the determination of user’s intention to interrupt is based on recognition of user’s tone and user’s words expressing certain mood (Shen, p. 5, ¶26) that Xing’s fusion feature can capture (Xing, ¶79, audio-textual embeddings 424 being a joint representation of both speech and text to help to capture important semantic cues not present in the text transcript; per ¶33, intent and emotion may be communicated through semantic cues such as timing, intensity, intonation, pitch not captured within a text transcript of speaker speech). 
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to determine whether the inserted voice reflects an interrupt intention by a joint modeling of speech and text, according to a fusion feature obtained by combining a text feature and a voice feature of the inserted voice in order to provide fusion feature combining text feature and voice feature to improve performance of spoken language understanding (Xing, ¶44) making semantic predictions representing a speaker’s intent (Xing, Abstract).
Regarding Claim 20, Chatterjee discloses wherein one part of the dialogue data sample is tagged data, and the other part of the dialogue data sample is untagged data (Col 17, Rows 28-46 and Col 18, Rows 38-44, 90%-97% of the conversation data can be annotated with a classifier for user intent; i.e., 3% to 10% of the conversation data are not annotated, which may require a human expert to add the label (per Col 17, Rows 21-27) in semi-supervised labeling algorithm).
Claims 2-3 and 18-19 are rejected under 35 USC 103(a) as being unpatentable over Chatterjee et al. (US 11087094 B2) in view of Shen et al. (CN 110853638 A) and Xing et al. (US 2023/0223018 A1) as applied to claims 1 and 14, in view of Dubey et al. (US 2019/0124202 A1).
Regarding claims 2 and 18, Chatterjee discloses wherein the operations further comprise: 
performing dialogue semantic cluster segmentation on the dialogue data sample in advance based on the semantic representation of the dialogue data sample (Col 15, Row 57 – Col 16, Row 4 and Col 16, Rows 48-55, convert each of the set of conversations and related meta data into feature representations in multi-dimensional vector space and using density based spatial clustering of applications with noise (DBSCAN) to determine a first set of clusters and a second set of conversations not mapped to the first set of clusters); 
performing density clustering according to a semantic cluster obtained by segmentation and a dialogue representation vector corresponding to the dialogue data sample (Col 15, Rows 3-7 and Col 16, Rows 8-12, clustering model runs multiple iterations over the data to adapt the problem space to find large clusters first and then increasingly smaller clusters in subsequent iterations); 
obtaining, according to the clustering result, at least one start intention and dialogue data corresponding to each start intention (Col 17, Rows 28-34, once the cluster is correctly formed with a similar set of conversations, add label to the cluster to train a classification model to create a intent model to infer user goals during conversations; Col 17, Rows 34-41, in an embodiment training is done on actual conversations where training data is collected by extracting first few utterances of the customer and an intent label is used as an end class to build a mapping function from conversation to intent); 
for each start intention, performing dialogue path mining based on dialogue data corresponding to the start intention (Col 17, Rows 55-60, create nodes and assign word sequences into a node in a conversation graph as well as forming corresponding transitional paths, the conversation graph represents a plurality of conversations that have occurred and directed to particular task, intent, goal, and objective; see e.g., Fig. 4 and Col 18, Rows 15-54, for each collection of conversations determined to have the same intent (per Col 18, Rows 15-17 and Rows 52-54, intent label), perform steps 410-440 to extract agent utterances comprising agent questions and corresponding question classification, agent resolutions and corresponding resolution classification); and 
constructing, according to a mining result, a dialogue procedure corresponding to the dialogue data sample (Col 18, Rows 54-56, convert extracted question-resolution sequence from each conversation to a graph; see e.g., Fig. 5, Rows 57-67, a conversation graph with set V of nodes and set E of edges (transitional paths)).
Chatterjee does not disclose that the density clustering is hierarchical density clustering. 
Dubey teaches using natural language processing techniques to identify topics associated with conversations and perform hierarchical density-based cluster analysis (H-DBSCAN) to group conversations under one or more topics (¶31).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to perform hierarchical density clustering according to the semantic cluster in order to group conversations under one or more topics (Dubey, ¶31) such as dialogue act classifications.
Regarding Claims 3 and 19, Chatterjee discloses wherein the dialogue procedure corresponding to the dialogue data sample is constructed according to the mining result by: 
obtaining, according to the mining result, dialogue semantic clusters respectively corresponding to a user and a robot customer service corresponding to the dialogue data (Col 16, Rows 34-41, clustering process obtains set of conversations comprising user’s utterances and responsive robot’s utterance); 
constructing a key dialogue transfer matrix according to the dialogue semantic clusters respectively corresponding to the user and the robot customer service (Col 16, Rows 40-55, use word2vec model to convert the set of conversations to feature representation, run the feature representation through density based spatial clustering of applications with noise to generate a first set of clusters and a second set of conversations not mapped to the first set of clusters, and generate a conversation feature matrix); 
generating, according to the key dialogue transfer matrix, a dialogue path used to indicate a dialogue procedure (Col 17, Rows 53-57, use the obtained data (i.e., conversation feature matrix) to create nodes and assign words sequences to a node in a conversation graph as well as form corresponding transitional paths); and 
mounting the generated dialogue path to the start intention to construct the dialogue procedure corresponding to the dialogue data sample (Col 18, Row 54 – Col 19, Row 22, convert to graph such as the conversation graph 500 of Fig. 5 comprise set V of nodes and set E of edges (transitional paths) with different sets of nodes representing different intents such as information seeking or action node to provide an action or resolution; per Col 17, Rows 57-60, a conversation graph represents a plurality of conversations that have occurred and were directed to a particular task, intent, goal, and/or objective).
Claims 4 and 15-17 are rejected under 35 USC 103(a) as being unpatentable over Chatterjee et al. (US 11087094 B2) in view of Shen et al. (CN 110853638 A) and Xing et al. (US 2023/0223018 A1) as applied to claims 1 and 14, in view of Walters et al. (US 2014/0244712 A1).
Regarding Claims 4 and 15, Chatterjee does not disclose in a process of performing dialogue interaction with the user, detecting whether an insertion timing for a set speech exists, and inserting the set speech when the insertion timing is detected.
Walters discloses interaction environment comprising a dialog interface to receive input from a user and an interaction engine core determines a user’s intent to provide feedback or responses back to the user (¶86) where in a process of performing dialogue interaction with the user, detecting whether an insertion timing for a set speech exists (¶142, when determining to resume a task after an interruption of a shorter time period or an interruption of a longer time period), and inserting the set speech when the insertion timing is detected (¶142, with an interruption of a shorter time period, virtual assistant 710 utters “So, what date did you want to fly” and with an interruption of a longer time period, virtual assistant 710 utters “Should we continue with the flight booking to New York?”; i.e., insert “So, what date did you want to fly” or “Should we continue with the flight booking to New York?” based on length of time of interruption and chance of user forgetting where the dialog left off).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to insert set speech when insertion timing is detected in order to implement task resumption strategies based on length of interruption timing and based on chance of user forgetting where the dialog left off (Walters, ¶142).
Regarding Claims 4 and 16, Chatterjee does not disclose in a process of performing dialogue interaction with the user, detecting the inserted voice by the user, and in response to determining that an intention corresponding to the inserted voice is to interrupt a dialogue voice, processing the inserted voice.
Walters discloses in a process of performing dialogue interaction with the user, detecting an inserted voice by the user, and in response to determining that an intention corresponding to the inserted voice is to interrupt a dialogue voice, processing the inserted voice (¶¶108-110, when user requested “Can you check flights to Paris for Friday” and user interrupts the dialog “We will have to continue later I am home now”, virtual assistant executes the interruption, place a time stamp on the dialog, and store it in task repository).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to process inserted voice by a user in response to a determination that an intention corresponding to the inserted voice is to interrupt a dialogue voice in order to maintain a persistence of knowledge by preserving dialog sessions and techniques for dialog resumption (Walters, ¶105).
Regarding Claims 4 and 17, Chatterjee does not disclose detecting a pause of the user in a dialogue interaction process, and in response to a detection result indicating that a dialogue corresponding to the pause is incomplete, inserting a guide language to guide the user to complete the dialogue.
Walters discloses in a process of performing dialogue interaction with the user, detecting a pause of the user in a dialogue interaction process (¶¶108-110, as virtual assistant responds “I have found ten flights for Friday. When do you need to be there?” to user request “Can you check flights to Paris for Friday?”, determine that user has paused the conversation based on user interruption “We will have to continue later I am home now”), and in response to a detection result indicating that a dialogue corresponding to the pause is incomplete, inserting a guide language to guide the user to complete the dialogue (¶142, with an interruption (i.e., pause) of a shorter time period, virtual assistant 710 utters “So, what date did you want to fly”).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to inserting a guide language to guide the user to complete the dialogue in response to a detection result indicating that a dialogue corresponding to the pause is incomplete in order to implement task resumption strategies based on chance of user forgetting where the dialog left off (Walters, ¶142).














Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor Hai Phan whose telephone number is 571-272-6338. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RICHARD Z ZHU/Primary Examiner, Art Unit 2654                                                                                                                                                                                                        05/02/2026

Read full office action

Prosecution Timeline

May 25, 2023

Application Filed

Oct 01, 2025

Non-Final Rejection mailed — §103

Jan 07, 2026

Response Filed

Jan 28, 2026

Final Rejection mailed — §103

Mar 30, 2026

Response after Non-Final Action

Apr 23, 2026

Request for Continued Examination

Apr 24, 2026

Response after Non-Final Action

May 06, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/525,620

Patent 12633287

DOMAIN MODEL DRIVEN PROCESSING OF DIALOG INCLUDING AMBIGUOUS INTENTS

2y 5m to grant Granted May 19, 2026

18/077,826

Patent 12626693

MODELING ATTENTION TO IMPROVE CLASSIFICATION AND PROVIDE INHERENT EXPLAINABILITY

3y 5m to grant Granted May 12, 2026

18/408,254

Patent 12619822

RESPONSIBLE PROMPT RECOMMENDATION

2y 3m to grant Granted May 05, 2026

18/543,093

Patent 12613895

SYSTEMS AND METHODS FOR QUERYING A DOCUMENT POOL

2y 4m to grant Granted Apr 28, 2026

18/247,441

Patent 12592228

SPEECH INTERACTION METHOD ,AND APPARATUS, COMPUTER READABLE STORAGE MEDIUM, AND ELECTRONIC DEVICE

3y 0m to grant Granted Mar 31, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

69%

Grant Probability

85%

With Interview (+15.5%)

3y 3m (~3m remaining)

Median Time to Grant

High

PTA Risk

Based on 721 resolved cases by this examiner. Grant probability derived from career allowance rate.