Last updated: April 19, 2026

Application No. 18/892,786

Resolving a Co-Reference Based on a Multimodal User Input for Assistant Systems

Final Rejection §102§DP

Filed

Sep 23, 2024

Examiner

LE, HUNG D

Art Unit

2161

Tech Center

2100 — Computer Architecture & Software

Assignee

Meta Platforms Inc.

OA Round

2 (Final)

Interview Optional

— +6.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 1073 resolved cases, 2023–2026

Examiner Intelligence

LE, HUNG D View full profile →

Grants 90% — above average

Career Allow Rate

969 granted / 1073 resolved

+35.3% vs TC avg

Moderate +6% lift

Without

With

+6.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 6m

Avg Prosecution

33 currently pending

Career history

1106

Total Applications

across all art units

Statute-Specific Performance

§101

12.3%

-27.7% vs TC avg

§103

39.2%

-0.8% vs TC avg

§102

20.6%

-19.4% vs TC avg

§112

9.2%

-30.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1073 resolved cases

Office Action

§102 §DP

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1.	This Office Action is in response to the preliminary amendment filed on 11/15/2024.
Claim 1 has been canceled.	
Claims 2-21 have been added.
Claims 2-21 are pending.

Priority
2.	This application is a Continuation of 18/185,258, which was filed on 03/16/2023, was acknowledged and considered.

Information Disclosure Statement
2.	The information disclosure statement (IDS) filed on 12/11/2024, 12/11/2024, 12/11/2024, 12/11/2024 and 12/11/2024 comply with the provisions of M.P.E.P. 609. The examiner has considered it.

Double Patenting
3. 	The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the "right to exclude" ranted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory obviousness-type double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Omum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); and In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969). 
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on a nonstatutory double patenting ground provided the conflicting application or patent either is shown to be commonly owned with this application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. 
Effective January 1, 1994, a registered attorney or agent of record may sign a terminal disclaimer. A terminal disclaimer signed by the assignee must fully comply with 37 CFR 3.73(b). 
4. 	Claims 2-21 are provisionally rejected on the ground of nonstatutory obviousness-type double patenting as being unpatentable over claims 1-18 of application 18185258. Although the conflicting claims are not identical, they are not patentably distinct from each other.

Instant Application 18892786
Application 18185258
Claim 2:
A method comprising, by a client system:

receiving, at the client system via an assistant application, an audio input comprising speech of a user;

receiving, at the client system via the assistant application, a visual input comprising one or more subjects;

determining that the speech comprises a co-reference to an entity associated with an attribute;

analyzing the visual input to resolve, based at least in part on the attribute, the entity from the one or more subjects;

determining that the speech comprises a request to perform a task associated with the entity;

executing the requested task; and

presenting, at the client system via the assistant application, an output associated with the executed task, wherein the output comprises audio information.
Claim 1:
A method comprising, by a head-mounted device:



receiving, at the head-mounted device, a speech input from a user and a visual input captured by one or more cameras of the head-mounted device, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual input comprises a co-reference to one or more of the visual concepts;

resolving, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and

presenting, at the head-mounted device, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities.
Claim 8:
A method of operating a client system, the method comprising:

receiving, from a microphone of the client system, an audio input comprising speech of a user;

receiving, from a camera of the client system, a visual input comprising a real-time view of one or more subjects;

determining, based at least in part on a machine-learning model, one or more attributes of the one or more subjects;

performing speech recognition on the speech to obtain a co-reference;

performing a visual analysis of the visual input to resolve, based at least in part on the co- reference and the one or more attributes, an entity corresponding to a specific subject of the one or more subjects;

in response to a request from the user, executing a task associated with the entity; and

presenting, at the client system, an output associated with the executed task, wherein the output comprises audio information.
Claim 17:
One or more computer-readable non-transitory storage media embodying software that is operable when executed by a client system to:

receive, at a head-mounted device, a speech input from a user and a visual input captured by one or more cameras of the head-mounted device, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts  and one or more attributes associated with the one or more visual concepts, and wherein the

textual input comprises a co-reference to one or more of the visual concepts; 

resolve, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and 

present, at the head-mounted device, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities.
Claim 15:
A method of operating a client system, the method comprising:

receiving, from a microphone of the client system, an audio input comprising speech of a user;

receiving, from a camera of the client system, a visual input comprising one or more subjects;

processing the speech to obtain a co-reference;

processing the visual input to resolve, based at least in part on the co-reference, an entity corresponding to a specific subject of the one or more subjects;

executing a task associated with the entity; and

presenting, at the client system, information associated with the executed task.
Claim 18:
A client system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: 

receive, at a head-mounted device, a speech input from a user and a visual input captured by one or more cameras of the head-mounted device, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual speech input comprises a co-reference to one or more of the visual concepts; 

resolve, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and 

present, at the head-mounted device, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities. 




Examiner’s Note
5.	Preliminary mappings of some relevant prior arts:
Mohajer et al, US 10,418,032, [Column 3, lines 10-20 (“One such step has been so identified as the co-reference resolution problem, which is generally concerned with finding that (say) a reference to a person (‘Mr. Smith’) points to the same entity as another (‘the man with the felt hat’). A number of approaches to co-reference resolution have been suggested in the literature in computational linguistics and discourse analysis”)] [Column 12, lines 50-65 (“combination of multimedia outputs produced by client output devices, including a voice spoken through a speaker”, i.e., audio output)] [Column 13, lines 44-61 (“where user query 310 is specifically speech, and the front end is speech recognizer”, i.e., speech input)].

Finkelstein et al, US 20180233139, [Paragraph 39 (“The entity tracker 100 is configured to detect entities and their activities, including people, animals, or other living things, as well as non-living objects. …  Voice listener 30 receives audio data and utilizes speech recognition functionality to translate spoken utterances into text”)] [Paragraph 62 (“the graphing may include identifying one or more links (e.g., syntactic, semantic, co-reference, discourse, etc.) between nodes in the text”)] [Paragraphs 145-147 and 152 (“For example, the entity tracker may determine that the recorded speech received from the microphone corresponds to lip movements of the person visible to the camera when the speech was received, and thereby conclude with relatively high confidence, such as 92%, that the person visible to the camera is the person speaking. In this manner the entity tracker 100 may combine the confidence values of two or more predictions to identify a person with a combined, higher confidence value”, i.e., multimodal input).

Wu, US 20220129513, [Paragraph 40 (“the content may be presented as visual and/or audible output”, i.e., visual and audio output)] [Paragraph 47 (“Also, for example, the user interface input device(s) 102 may include a microphone, a voice-to-text processor that is separate from the system 120 may convert voice input received at the microphone into textual input, and the textual input may be provided to the system”)] [Paragraphs 53-54 (“the named entity tagger may rely on annotations from the coreference resolver and/or dependency parser in annotating all mentions to a particular entity. Also, for example, in some implementations the coreference resolver”)].

Bui et al, US 20170228366, [Paragraphs 34 and 42 (“As used herein, “utterance” broadly encompasses any type of conversational input including, but not limited to, speech, text entry, touch, and gestures” and “a microphone for accepting audio input (i.e., spoken utterances), a keyboard (physical or virtual via a touchscreen) for accepting text input (i.e., typed utterances), and a camera for accepting visual input (i.e., visual utterances that can be provided using sign language or hand writing”, i.e., multimodal input)] [Paragraph 59 (“a display screen for producing visual output and/or an audio transducer (e.g., a speaker) for generating audible output (e.g., computer generated speech)”, i.e., audio and visual output)] [Paragraphs 64, 78 and 82(“Dialog state tracker 118 comprises string-matcher 150, slot-value pruner 152, coreference resolver 154”)] [Paragraphs 40 and 41 (“such as an application having personal assistant functionality”)] [Paragraphs 30, 42, 83 and 84 ( “topics” = subjects; “words” or “characters” = attributes) ].

Claim Rejections - 35 USC § 102

6.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
7.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


8.	Claims 2-21 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Bui et al (US 20170228366).
Claim 2:
	Bui suggest a method comprising, by a client system: receiving, at the client system via an assistant application, an audio input comprising speech of a user [Paragraphs 34 and 42 (“As used herein, “utterance” broadly encompasses any type of conversational input including, but not limited to, speech, text entry, touch, and gestures” and “a microphone for accepting audio input (i.e., spoken utterances), a keyboard (physical or virtual via a touchscreen) for accepting text input (i.e., typed utterances), and a camera for accepting visual input (i.e., visual utterances that can be provided using sign language or hand writing”, i.e., multimodal input)] [Paragraphs 40 and 41 (“such as an application having personal assistant functionality”)]. Bui suggest receiving, at the client system via the assistant application, a visual input comprising one or more subjects [Paragraphs 34 and 42 (“As used herein, “utterance” broadly encompasses any type of conversational input including, but not limited to, speech, text entry, touch, and gestures” and “a microphone for accepting audio input (i.e., spoken utterances), a keyboard (physical or virtual via a touchscreen) for accepting text input (i.e., typed utterances), and a camera for accepting visual input (i.e., visual utterances that can be provided using sign language or hand writing”, i.e., multimodal input)]. Bui suggest determining that the speech comprises a co-reference to an entity associated with an attribute [Paragraphs 64, 78 and 82(“Dialog state tracker 118 comprises string-matcher 150, slot-value pruner 152, coreference resolver 154”)]. Bui suggest analyzing the visual input to resolve, based at least in part on the attribute, the entity from the one or more subjects [Paragraphs 64, 78 and 82(“Dialog state tracker 118 comprises string-matcher 150, slot-value pruner 152, coreference resolver 154”)] [Paragraphs 30, 42, 83 and 84 ( “topics” = subjects; “words” or “characters” = attributes) ]. Bui suggest determining that the speech comprises a request to perform a task associated with the entity; and executing the requested task [Paragraphs 64, 78 and 82(“Dialog state tracker 118 comprises string-matcher 150, slot-value pruner 152, coreference resolver 154”)]. Bui suggest presenting, at the client system via the assistant application, an output associated with the executed task, wherein the output comprises audio information [Paragraph 59 (“a display screen for producing visual output and/or an audio transducer (e.g., a speaker) for generating audible output (e.g., computer generated speech)”, i.e., audio and visual output)].
Claim 3:
	Bui suggests wherein the output further comprises text information presented on a display of the client system [Paragraphs 39 and 59 (“client devices on a client-side of operating environment” and “a display screen for producing visual output and/or an audio transducer (e.g., a speaker) for generating audible output (e.g., computer generated speech)”)].
Claim 4:
Bui suggests wherein the task is executed by an assistant system in communication with the assistant application [Paragraphs 40 and 41 (“such as an application having personal assistant functionality”)].
Claim 5:
Bui suggests wherein the task comprises retrieving information about the entity from a service [Paragraphs 30, 42, 83, 84 and 118 (object and people)].
Claim 6:
Bui suggests checking an authorization setting associated with the entity before executing the task [Paragraph 28 and 29 (“Other rules described herein allow slot-value pairs to be identified for dialog states based on coreferences between utterances and previous utterances of a dialog session. A coreference is a linguistic phenomenon in which words or phrases refer to the same entity.”, i.e., making sure it is the same person who performs multimodal inputs)].
Claim 7:
Bui suggests wherein the one or more subjects comprise a plurality of persons, the co-reference identifies a particular person of the plurality of persons, and the entity corresponds to the particular person [Paragraph 28 and 29 (“Other rules described herein allow slot-value pairs to be identified for dialog states based on coreferences between utterances and previous utterances of a dialog session. A coreference is a linguistic phenomenon in which words or phrases refer to the same entity.”, i.e., making sure it is the same person)].
Claim 8:
Claim 8 is essentially the same as claim 2 and rejected under the same reasons as applied above. See paragraphs 42 and 118 for camera and paragraphs 49 and 56 for real time.
Claim 9:
Claim 9 is essentially the same as claim 3 and rejected under the same reasons as applied above. 
Claim 10:
Bui suggests wherein the speech comprises the request from the user [Paragraphs 20 and 30].
Claim 11:
Bui suggests wherein the task is executed by an assistant system [Paragraphs 40 and 41 (“such as an application having personal assistant functionality”)].
Claim 12:
Bui suggests wherein the task comprises retrieving information about the entity from a service [Paragraphs 37 and 38 (“Internet” and “cloud”)].
Claim 13:
Bui suggests wherein the entity is resolved based at least in part on an association between specific attributes of the specific subject and the co-reference [Paragraph 28 and 29 (“Other rules described herein allow slot-value pairs to be identified for dialog states based on coreferences between utterances and previous utterances of a dialog session. A coreference is a linguistic phenomenon in which words or phrases refer to the same entity.”, i.e., making sure it is the same person)].
Claim 14:
Bui suggests wherein the one or more subjects comprise a plurality of persons, the co-reference identifies a particular person of the plurality of persons, and the entity corresponds to the particular person [Paragraph 28 and 29 (“Other rules described herein allow slot-value pairs to be identified for dialog states based on coreferences between utterances and previous utterances of a dialog session. A coreference is a linguistic phenomenon in which words or phrases refer to the same entity.”, i.e., making sure it is the same person)].
Claim 15:
Claim 15 is essentially the same as claim 2 and rejected under the same reasons as applied above. See paragraphs 42 and 118 for camera and paragraphs 49 and 56 for real time.
Claim 16:
Claim 16 is essentially the same as claim 12 and rejected under the same reasons as applied above. 
Claim 17:
Bui suggests wherein the information is presented via a visual modality [Paragraph 59 (“a display screen for producing visual output and/or an audio transducer (e.g., a speaker) for generating audible output (e.g., computer generated speech)”, i.e., audio and visual output)].
Claim 18:
Bui suggests wherein the information is presented via an audio modality [Paragraph 59 (“a display screen for producing visual output and/or an audio transducer (e.g., a speaker) for generating audible output (e.g., computer generated speech)”, i.e., audio and visual output)].
Claim 19:
Bui suggests wherein the entity is resolved based at least in part on an attribute associated with the specific subject [Paragraph 29 (“slot-value pair”)].
Claim 20:
Bui suggests wherein the one or more subjects comprise a plurality of persons, the co-reference identifies a particular person of the plurality of persons, and the entity corresponds to the particular person [Paragraph 28 and 29 (“Other rules described herein allow slot-value pairs to be identified for dialog states based on coreferences between utterances and previous utterances of a dialog session. A coreference is a linguistic phenomenon in which words or phrases refer to the same entity.”, i.e., making sure it is the same person)].
Claim 21:
Bui suggests checking a privacy setting associated with the particular person before executing the task [Paragraph 28 and 29 (“Other rules described herein allow slot-value pairs to be identified for dialog states based on coreferences between utterances and previous utterances of a dialog session. A coreference is a linguistic phenomenon in which words or phrases refer to the same entity.”, i.e., making sure it is the same person)] [Paragraph 37 (“private networks”)].

Pertinent Arts

9. 	Mosterman et al, US 20160239751, discloses multimodal input processing, wherein the
detailed implementation comprises: receiving, using a processor of a computing device, an input,
the input comprising a plurality of input elements; parsing the input, the parsing: performed
using the processor, identifying a first element from among the plurality of input elements, and
identifying a second element from among the plurality of input elements, wherein the first
element and the second element are represented according to a formalism type defining a syntax
for organizing elements of the formalism type; access a library comprising entries for different
formalism types; evaluating a likelihood that the first element and the second element coexist
together in a formalism type from among the different formalism types in the library, and
selecting, based on the evaluating, a selected formalism type that is consistent with a coexistence
of the first element and the second element; and generating, using the processor, an output, the output: representing a translation of at least part of the input into an output type that is associated with the selected formalism type.

Anandarajah, US 20150019227, discloses interlaced multimodal input processing, wherein the detailed implementation comprises: receiving verbal input using a verbal input interface of the computing device; receiving, concurrently with at least part of the verbal input, at least one secondary input using a non-verbal input interface of the computing device; identifying one or more target objects from the at least one secondary input; recognizing text from the received verbal input; generating an interaction object, the interaction object comprising a natural language expression having references to the one or more identified target objects embedded within the recognized text, the generating of the interaction object comprising identifying at least one attribute associated with each of the one or more identified target objects or at least one operation associated with each of the one or more identified target objects; processing the interaction object to identify at least one operation to be executed on at least one of the one or more identified target objects; and executing the operation on the at least one of the one or more identified target objects.

10.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to [Hung D. Le], whose telephone number is [571-270-1404].  The examiner can normally be communicated on [Monday to Friday: 9:00 A.M. to 5:00 P.M.]. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Apu Mofiz can be reached on [571-272-4080].  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
     Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, contact [800-786-9199 (IN USA OR CANADA) or 571-272-1000].


~TBD~


Hung Le
08/19/2025

/HUNG D LE/Primary Examiner, Art Unit 2161

Read full office action

Prosecution Timeline

Sep 23, 2024

Application Filed

Nov 15, 2024

Response after Non-Final Action

Aug 19, 2025

Non-Final Rejection — §102, §DP

Oct 06, 2025

Applicant Interview (Telephonic)

Oct 16, 2025

Examiner Interview Summary

Nov 17, 2025

Response Filed

Dec 19, 2025

Final Rejection — §102, §DP

Mar 20, 2026

Examiner Interview Summary

Mar 20, 2026

Applicant Interview (Telephonic)

Precedent Cases

Applications granted by this same examiner with similar technology

18/464,188

Patent 12596684

SYSTEMS AND METHODS FOR SEARCHING DEDUPLICATED DATA

2y 5m to grant Granted Apr 07, 2026

18/738,469

Patent 12596724

SYSTEMS AND METHODS FOR USE IN REPLICATING DATA

2y 5m to grant Granted Apr 07, 2026

18/962,656

Patent 12596736

SYSTEMS AND METHODS FOR USING PROMPT DISSECTION FOR LARGE LANGUAGE MODELS

2y 5m to grant Granted Apr 07, 2026

18/060,184

Patent 12591489

POINT-IN-TIME DATA COPY IN A DISTRIBUTED SYSTEM

2y 5m to grant Granted Mar 31, 2026

18/742,058

Patent 12585625

SYSTEM AND METHOD FOR IMPLEMENTING A DATA QUALITY FRAMEWORK AND ENGINE

2y 5m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

90%

Grant Probability

97%

With Interview (+6.4%)

2y 6m

Median Time to Grant

Moderate

PTA Risk

Based on 1073 resolved cases by this examiner. Grant probability derived from career allow rate.

Resolving a Co-Reference Based on a Multimodal User Input for Assistant Systems

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email