Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
This office action is in response to application 18/634,991, which was filed 04/14/24. Claims 1-3 are pending in the application and have been considered.
Specification
The abstract of the disclosure is objected to because is not within the range of 50 to 150 words and fails to describe the disclosure sufficiently to assist readers in deciding whether there is a need for consulting the full patent text for details. Correction is required. See MPEP § 608.01(b).
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
The following title is suggested: Video Assisted Background Noise Filtering System Based on Multimodal AI.
Claim Objections
Claim 1 is objected to because of the following informalities: in line 5, should “a set of virtual agent” be “a set of virtual agents”? Appropriate correction is required.
Claim 1 is objected to because of the following informalities: in lines 12-13, should “the virtual agent serves to interact users” be “the virtual agent serves to interact with users”? Appropriate correction is required.
In claim 1, line 19, should “comprising” be “comprises”?
In claim 1, line 28, the text “wherein the lip movements,” appears to be extraneous or incomplete.
In claim 2, line 7, should “the set of virtual agent” be “the set of virtual agents”?
In claim 2, line 16, should “the set of virtual agent” be “the set of virtual agents”?
In claim 2, line 18, should “wherein a set of virtual agents coupled to the one or more cameras” be “wherein the set of virtual agents is coupled to the one or more cameras”?
In claim 2, line 22, the text “wherein the lip movements,” appears to be extraneous or incomplete.
In claim 2, line 25, should “exists” be “existing”?
In claim 3, line 6, should “these signals” be “these inputs”?
In claim 3, line 21, should “will only listed in the situation” be “will only listen in the situation”?
In claim 3, line 23, should “wherein multiple users are allowed the system” be “wherein if multiple users are allowed, the system”?
In claim 3, line 24, should “that interacts” be “that interact”?
In claim 3, line 24, should “the sessions” be “the session”?
In claim 3, lines 25, 26, 28, should “the single mode” be “the solo-user mode”?
In claim 3, second to last line, should “multimodal” be “the multimodal system”?
In claim 3, last line, should “multimodal” be “multimodal input”?
As seen above, the claims are replete with minor mistakes. Although the examiner attempted to identify and suggest corrections for as many as possible, Applicant’s assistance is respectfully requested in making sure the claims are free from minor mistakes.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Claims 1-3 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, regards as the invention.
Claim 1 recites the limitation "the one or more processors" in line 8. There is insufficient antecedent basis for this limitation in the claim.
Claim 1 recites the limitation "the user" in line 12. There is insufficient antecedent basis for this limitation in the claim.
Claim 1 recites the limitation "the set of customer-facing virtual agent" in lines 12-13. There is insufficient antecedent basis for this limitation in the claim.
Claim 1 recites the limitation "the virtual agent" in line 13. There is insufficient antecedent basis for this limitation in the claim.
Claim 1 recites the limitation "the artificial intelligence engine" in line 14. There is insufficient antecedent basis for this limitation in the claim.
Claim 2 recites the limitation "the set of virtual agent" in line 4. There is insufficient antecedent basis for this limitation in the claim.
Claim 2 recites the limitation "the user" in line 10. There is insufficient antecedent basis for this limitation in the claim.
Claim 2 recites the limitation "the set of customer-facing virtual agent" in line 10. There is insufficient antecedent basis for this limitation in the claim.
Claim 2 recites the limitation "the artificial intelligence engine" in lines 11-12. There is insufficient antecedent basis for this limitation in the claim.
Claim 3 recites the limitation "the representation" in line 16. There is insufficient antecedent basis for this limitation in the claim.
Claim 3 recites the limitation "the individual" in line 18. There is insufficient antecedent basis for this limitation in the claim.
Claim 3 recites the limitation "the person" in line 21. There is insufficient antecedent basis for this limitation in the claim.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 and 2 are rejected under 35 U.S.C. 103 as being unpatentable over Morrison et al. (US 20110257971) in view of Mehmeri (US 20250028579), in further view of Rule et al. (US 20200211532), in further view of Pearce (US 9990926).
Consider claim 1, Morrison discloses a background noise filtering system based on multimodal AI (noise cancellation module filters out portions of the original audio input that contain speech contributions from other speakers and ambient noise when the identified speaker is silent based on facial image sequence portions, [0033], [0034], based on the identification of whether the speaker is speaking, which is determined based on whether there is lip movement, [0026], using a trained classifier, i.e. artificial intelligence, [0021] the speech recognition and facial recognition making up “multimodal AI”), comprising:
a server (server 104, [0012], Fig. 1);
one or more cameras coupled to a server (camera 110, [0015], Fig. 1);
one or more microphones coupled to the server (microphone, [0045], Fig. 1),
a set of virtual agent coupled to the one or more cameras and server (electronic device responding to commands such as “compose email” and starting an email application, considered a set of one virtual agents, [0030], and having camera, [0015], connected to server, [0012])
and a device coupled to the server, wherein the device comprising an artificial intelligence engine (speech recognition module and visual interpretation module performed speech and facial recognition using artificial intelligence, Fig 1 elements 156, 158, [0017], [0018], [0021]) and one or more processors and memory storing instructions that (processor 148 executes instructions from memory 150, Fig. 1, [0046]), when executed by one of the processors, cause the device to:
obtain in real-time, from any of the one or more cameras, a set of videos of an individual of a plurality of individuals at a location (recording frames across time, including real-time, as a set of videos of an individual speaker, [0009], from among multiple speakers, [0018])
select, from the set of videos for each individual, a preferred facial image for the individual (facial recognition module tracks the speaker across time, [0021-0023], the frames across time considered to make up a set of videos of an individual speaker, [0009], and digital image processing is used to crop a particular portion of the speaker’s face, i.e. select a preferred facial image, [0023-0024]);
determine whether lip movement of one of the individuals is visible in the set of images (detecting active movement of the lips, [0026]);
select, based on whether the lip movement of one of the individuals is visible in the set of images, at least one of a facial recognition algorithm and an audio algorithm to determine which individual is speaking, wherein the lip movements (based on detecting lip movement, noise cancellation module transforms the audio, i.e. selects an audio algorithm, into a modified audio input that includes the speech utterances of the speaker, i.e. determining which speaker is speaking, interspersed with moments of silence, [0026]);
record audio from the one of the individuals by one or more microphones (microphone captures audio from the environment, [0015], including spoken speech from one or more speakers, [0018]);
compare the preferred facial image of the one of the individuals and pre-recorded facial images of the one of the individuals (facial recognition module compares facial image, i.e. the preferred facial image, to reference example facial images by way of a trained classifier, [0021]);
determine identification of the one of the individuals who is speaking (noise cancellation module makes an identification of whether the individual whose face is being recognized and tracked is currently speaking, [0021], [0023], [0026]) and
filter, based on whether the lip movement of one of the individuals is visible in the set of images and the identification of the one of the individuals, other sounds from the others of the plurality of individuals, and background sounds (noise cancellation module filters out portions of the original audio input that correspond to when the speaker but contain speech contributions from other speakers and ambient noise when the identified speaker is silent based on facial image sequence portions, [0033], [0034], based on the identification of whether the speaker is speaking, which is determined based on whether there is lip movement, [0026]).
Morrison does not specifically mention wherein the set of virtual agent are configured to be displayed in LED/OLED displays, Android/iOS tablets, Laptops/PCs, smartphones, or VR/AR goggles, wherein a set of multi-layer info panels coupled to the one or more processors are configured to overlay graphics on top of the set of virtual agents, wherein any of the set of virtual agent are configured to be displayed with an appearance of an actual human or a humanoid or a cartoon character or an animated talking object, wherein any of the set of customer-facing virtual agent is configured to be displayed in whole or half body portrait mode, wherein the virtual agent serves to interact users, wherein the artificial intelligence engine is configured for real-time speech recognition, speech-to-text generation, real-time dialog generation, text-to-speech generation, real-time lip animation to sync with speech, and avatar generation, wherein the artificial intelligence engine is configured to emulate different voices and use different languages;
obtain in real-time, by one or more cameras, a set of videos of a plurality of individuals at a location.
Mehmeri discloses wherein the set of virtual agents are configured to be displayed in LED/OLED displays, Android/iOS tablets, Laptops/PCs, smartphones, or VR/AR goggles (client devices may be a desktop computer, laptop computer, smartphone, tablet, etc., [0177], and UI displays video of other participants, [0057-0058], including virtual assistant, a set of one virtual agents, is coupled to server, [0051], [0052], Fig. 2B), wherein a set of multi-layer info panels coupled to the one or more processors are configured to overlay graphics on top of the set of virtual agents (see text bubble overlaid on image of virtual assistant in Fig 2B, [0105]), wherein any of the set of virtual agent are configured to be displayed with an appearance of a real human or a humanoid or a cartoon character (real human appearance, Fig 2B, [0105]), wherein any of the set of customer-facing virtual agent is configured to be displayed in whole body or half body portrait mode (half body mode, Fig 2B, [0105]), wherein the artificial intelligence engine is configured for real-time speech recognition, speech to text generation, real-time dialog generation, text to speech generation, voice-driven animation, and human avatar generation (other participants initiate and engage in a discussion with virtual assistant avatar in real time, [0047], [0069], for which each of real-time speech recognition, speech to text generation, real-time dialog generation, text to speech generation, voice-driven animation, and human avatar generation is inherent), wherein the artificial intelligence engine is configured to emulate different voices and use different languages (virtual assistant having regional variations and accents, as well as multilingual support, [0110]), wherein a device with an artificial intelligence engine is configured to be connected to one or more cameras and the set of virtual agent (AI model 145 is connected via network to client devices with cameras, Fig 1, [0072]);
obtain in real-time, by one or more cameras, a set of videos of a plurality of individuals at a location (video cameras of client devices of participants at client device locations, Fig 1, [0058]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Morrison such that the set of virtual agent are configured to be displayed in LED/OLED displays, Android/iOS tablets, Laptops/PCs, smartphones, or VR/AR goggles, wherein a set of multi-layer info panels coupled to the one or more processors are configured to overlay graphics on top of the set of virtual agents, wherein any of the set of virtual agent are configured to be displayed with an appearance of an actual human or a humanoid or a cartoon character or an animated talking object, wherein any of the set of customer-facing virtual agent is configured to be displayed in whole or half body portrait mode, wherein the virtual agent serves to interact users, wherein the artificial intelligence engine is configured for real-time speech recognition, speech-to-text generation, real-time dialog generation, text-to-speech generation, real-time lip animation to sync with speech, and avatar generation, wherein the artificial intelligence engine is configured to emulate different voices and use different languages; and obtain in real-time, by one or more cameras, a set of videos of a plurality of individuals at a location in order to reduce meeting duration, as suggested by Mehmeri ([0045]). Doing so would have led to predictable results of increased overall efficiency of the system, as suggested by Mehmeri ([0045]). The references cited are analogous art in the same field of speech recognition.
Morrison and Mehmeri do not specifically mention wherein any of the set of virtual agents' gender, age and ethnicity is determined by the artificial Intelligence's analysis on input from the user.
Rule discloses wherein any of the set of virtual agents' gender, age and ethnicity is determined by the artificial Intelligence's analysis on input from the user (data analysis module analyzes pitch, tone, and cadence of user speech and identifies a gender and age of the user, [0066], and instructs virtual assistant module to synthesize speech to match the gender and age of the user characteristics, [0067]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Morrison and Mehmeri such that any of the set of virtual agents' gender, age and ethnicity is determined by the artificial Intelligence's analysis on input from the user in order to reflect a style of communication of a user communicating with the system, as suggested by Rule, ([0003]), predictably improving the user experience, as suggested by Rule ([0067]). The references cited are analogous art in the same field of speech recognition.
Morrison, Mehmeri, and Rule do not specifically mention comparing the audio from the one of the individuals and pre-recorded audio that belongs to the one of the individuals.
Pearce discloses comparing the audio from the one of the individuals and pre-recorded audio that belongs to the one of the individuals (if the speaker has already been enrolled, the speech sample is matched that sample of the enrolled speaker, i.e. compared to a pre-recorded sample, Col 8 lines 53-61).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Morrison, Mehmeri, and Rule by comparing the audio from the one of the individuals and pre-recorded audio that belongs to the one of the individuals if there is a pre-recorded audio exists in order to avoid the loss of accuracy from people not talking normally during active enrollment, as suggested by Pearce (Col 3 lines 39-45) while predictably saving time, overhead, and commitment on behalf of the user speakers, as suggested by Pearce (Col 3 lines 39-45). The references cited are analogous art in the same field of speech recognition.
Consider claim 2, Morrison discloses a method to identify speakers and filter background noise with Artificial intelligence (noise cancellation module filters out portions of the original audio input that contain speech contributions from other speakers and ambient noise when the identified speaker is silent based on facial image sequence portions, [0033], [0034], based on the identification of whether the speaker is speaking, which is determined based on whether there is lip movement, [0026], using a trained classifier, i.e. artificial intelligence, [0021]) comprising:
obtaining in real-time, from any of the one or more cameras, a set of videos of an individual of a plurality of individuals at a location (recording frames across time, including real time, as a set of videos of an individual speaker, [0009], from among multiple speakers, [0018]);
selecting, from the set of videos for each individual, a preferred facial image for the individual (facial recognition module tracks the speaker across time, [0021-0023], the frames across time considered to make up a set of videos of an individual speaker, [0009], and digital image processing is used to crop a particular portion of the speaker’s face, i.e. select a preferred facial image, [0023-0024]), wherein a set of virtual agents coupled to the one or more cameras (electronic device responding to commands such as “compose email” and starting an email application, considered a virtual agent, [0030], and having camera, [0015]);
determining whether lip movement of one of the individuals is visible in the set of images (detecting active movement of the lips, [0026]);
selecting, based on whether the lip movement of one of the individuals is visible in the set of images, at least one of a facial recognition algorithm and an audio algorithm to determine which individual is speaking, wherein the lip movements (based on detecting lip movement, noise cancellation module transforms the audio, i.e. selects an audio algorithm, into a modified audio input that includes the speech utterances of the speaker, i.e. determining which speaker is speaking, interspersed with moments of silence, [0026]);
record audio from the one of the individuals by one or more microphones (microphone captures audio from the environment, [0015], including spoken speech from one or more speakers, [0018]);
comparing the preferred facial image of the one of the individuals and pre-recorded facial images of the one of the individuals (facial recognition module compares facial image, i.e. the preferred facial image, to reference example facial images by way of a trained classifier, [0021]);
determining identification of the one of the individuals who is speaking (noise cancellation module makes an identification of whether the individual whose face is being recognized and tracked is currently speaking, [0021], [0023], [0026]) and
filtering, based on whether the lip movement of one of the individuals is visible in the set of images and the identification of the one of the individuals, other sounds from the others of the plurality of individuals, and background sounds (noise cancellation module filters out portions of the original audio input that correspond to when the speaker but contain speech contributions from other speakers and ambient noise when the identified speaker is silent based on facial image sequence portions, [0033], [0034], based on the identification of whether the speaker is speaking, which is determined based on whether there is lip movement, [0026]).
Morrison does not specifically mention obtaining in real-time, by one or more cameras, a set of videos of a plurality of individuals at a location, wherein the set of virtual agent are configured to be displayed in LED/OLED displays, Android/iOS tablets, Laptops/PCs, smartphones, or VR/AR goggles, wherein a set of multi-layer info panels coupled to the one or more processors are configured to overlay graphics on top of the set of virtual agents, wherein any of the set of virtual agent are configured to be displayed with an appearance of a real human or a humanoid or a cartoon character, wherein any of the set of customer-facing virtual agent is configured to be displayed in whole body or half body portrait mode, wherein the artificial intelligence engine is configured for real-time speech recognition, speech to text generation, real-time dialog generation, text to speech generation, voice-driven animation, and human avatar generation, wherein the artificial intelligence engine is configured to emulate different voices and use different languages, wherein a device with an artificial intelligence engine is configured to be connected to one or more cameras and the set of virtual agent.
Mehmeri discloses obtaining in real-time, by one or more cameras, a set of videos of a plurality of individuals at a location (video cameras of client devices of participants at client device locations, Fig 1, [0058]), wherein the set of virtual agents are configured to be displayed in LED/OLED displays, Android/iOS tablets, Laptops/PCs, smartphones, or VR/AR goggles (client devices may be a desktop computer, laptop computer, smartphone, tablet, etc., [0177], and UI displays video of other participants, [0057-0058], including virtual assistant, a set of one virtual agents, Fig. 2B), wherein a set of multi-layer info panels coupled to the one or more processors are configured to overlay graphics on top of the set of virtual agents (see text bubble overlaid on image of virtual assistant in Fig 2B, [0105]), wherein any of the set of virtual agent are configured to be displayed with an appearance of a real human or a humanoid or a cartoon character (real human appearance, Fig 2B, [0105]), wherein any of the set of customer-facing virtual agent is configured to be displayed in whole body or half body portrait mode (half body mode, Fig 2B, [0105]), wherein the artificial intelligence engine is configured for real-time speech recognition, speech to text generation, real-time dialog generation, text to speech generation, voice-driven animation, and human avatar generation (other participants initiate and engage in a discussion with virtual assistant avatar in real time, [0047], [0069], for which each of real-time speech recognition, speech to text generation, real-time dialog generation, text to speech generation, voice-driven animation, and human avatar generation is inherent), wherein the artificial intelligence engine is configured to emulate different voices and use different languages (virtual assistant having regional variations and accents, as well as multilingual support, [0110]), wherein a device with an artificial intelligence engine is configured to be connected to one or more cameras and the set of virtual agent (AI model 145 is connected via network to client devices with cameras, Fig 1, [0072]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Morrison by obtaining in real-time, by one or more cameras, a set of videos of a plurality of individuals at a location, wherein the set of virtual agent are configured to be displayed in LED/OLED displays, Android/iOS tablets, Laptops/PCs, smartphones, or VR/AR goggles, wherein a set of multi-layer info panels coupled to the one or more processors are configured to overlay graphics on top of the set of virtual agents, wherein any of the set of virtual agent are configured to be displayed with an appearance of a real human or a humanoid or a cartoon character, wherein any of the set of customer-facing virtual agent is configured to be displayed in whole body or half body portrait mode, wherein the artificial intelligence engine is configured for real-time speech recognition, speech to text generation, real-time dialog generation, text to speech generation, voice-driven animation, and human avatar generation, wherein the artificial intelligence engine is configured to emulate different voices and use different languages, wherein a device with an artificial intelligence engine is configured to be connected to one or more cameras and the set of virtual agent for reasons similar to those for claim 1.
Morrison and Mehmeri do not specifically mention wherein any of the set of virtual agents' gender, age and ethnicity is determined by the artificial Intelligence's analysis on input from the user.
Rule discloses wherein any of the set of virtual agents' gender, age and ethnicity is determined by the artificial Intelligence's analysis on input from the user (data analysis module analyzes pitch, tone, and cadence of user speech and identifies a gender and age of the user, [0066], and instructs virtual assistant module to synthesize speech to match the gender and age of the user characteristics, [0067]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Morrison and Mehmeri such that any of the set of virtual agents' gender, age and ethnicity is determined by the artificial Intelligence's analysis on input from the user for reasons similar to those for claim 1.
Morrison, Mehmeri, and Rule do not specifically mention comparing the audio from the one of the individuals and pre-recorded audio that belongs to the one of the individuals if there is a pre-recorded audio exists;
saving the audio from the one of the individuals with a tag attached to the one of the individuals if there is no pre-recorded audio exists.
Pearce discloses comparing the audio from the one of the individuals and pre-recorded audio that belongs to the one of the individuals if there is a pre-recorded audio exists (if the speaker has already been enrolled, the speech sample is matched that sample of the enrolled speaker, i.e. compared to a pre-recorded sample, Col 8 lines 53-61);
saving the audio from the one of the individuals with a tag attached to the one of the individuals if there is no pre-recorded audio exists (if the speaker is not enrolled, the speech sample is parsed into a keyword phrase sample and a command phrase sample, and identifying label is generated, the command phrase sample with identifying label, i.e. tag attached to one of the individuals, is inserted into the list of unenrolled command phrase samples, Col 8 lines 62-10).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Morrison, Mehmeri, and Rule by comparing the audio from the one of the individuals and pre-recorded audio that belongs to the one of the individuals if there is a pre-recorded audio exists; and saving the audio from the one of the individuals with a tag attached to the one of the individuals if there is no pre-recorded audio exists for reasons similar to those for claim 1.
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Morrison et al. (US 20110257971) in view of Iyengar et al. (US 20240169984).
Consider claim 3, Morrison discloses a multimodal lip-sync background noise filtering system (filtering background noise by analyzing audio and video input to identify audio where lip movement is in sync, [0026]), comprising:
A virtual agent that is available for one or more users (electronic device responding to commands from a user such as “compose email” and starting an email application, considered a virtual agent, [0030]),
One or more cameras and one or more microphones (camera and microphone, [0015]),
wherein the one or more users interact via the one or more cameras and microphones that capture real-time inputs of its surroundings (cameras and microphones capture video and audio of the environment, [0015], which includes multiple speakers, [0018]),
wherein upon the one or more users activating the virtual agent, a speaker's face and voice are captured (facial image input and audio input, Fig 1, [0018], [0021], and upon activating the virtual agent by means of a command such as “compose email”, system listens and watches for further audio/video corresponding to the composition of the email, [0030]),
wherein the speaker is among the one or more users (image and spoken audio from one or more speakers, [0018], [0021]),
wherein these signals are used for the speaker re-identification (repeatedly identifying segments of audio in which the user is speaking, i.e. re-identification, [0026]),
An AI engine that couples to the virtual agent and the one or more cameras and microphones, wherein the AI engine uses re-identification to determine whether a given input audio signal is from the speaker(s) of interest (facial and audio recognition, which employ trained classifiers, i.e. AI, are coupled to electronic device and repeatedly identify segments of audio in which the speaker identified in the images is speaking, [0018], [0021], [0026]),
wherein background noise will be filtered out if any of the one or more users are not speaking in the system's field of view (background noise is filtered when the images show the speaker was not speaking, [0026]),
wherein a session starts when any of the one or more users are visually detected in front of the system (detected via facial recognition, [0021]),
wherein the AI engine captures face and speech samples from the speaker to later perform re-identification (repeatedly identifying segments of audio in which the user is speaking using captured images and audio samples, i.e. re-identification, [0026]),
wherein the AI engine's confidence is a function of the confidence of the re-identification recognition mechanism and the lip-sync detection mechanism (confidence score being the higher of visual transformation and audio transformation confidence scores, [0039-0040]),
wherein the face and speech samples are captured and encoded until the representation optimally discriminates (discriminating the symbol sequence with the highest confidence score, [0039-0040]),
wherein during a session, the AI engine decides whether a given input audio is actual speech input for the virtual agent to interact with, provided that the individual currently using the system is visually speaking, upon validating that a speaker is the current user by comparing the visual and audio samples previously captured (repeatedly discriminating audio which is speech from the speaker in the images from background noise segments using previously captured video and audio samples, [0026], to discern commands such as “compose email” for the virtual agent, [0030] from other speech and noise),
wherein the session can be configured to one solo user or multiple users (the audio inputs may include the spoken speech from one or more speakers, [0018]),
wherein the solo-user mode will only listed in the situation that the person that initiates the session is actively speaking such that the system can detect their lip movement upon re-identifying (repeatedly identifying segments of audio in which the user is speaking using captured images and audio samples, i.e. re-identification, [0026]),
wherein the sessions can consist of a single or multiple interactions (e.g. a “compose email” command, i.e. a single interaction, followed by speaking the composition of the email, considered another interaction, [0030]),
wherein single-mode persisting over multiple sessions can configure the virtual agent to only interact with that user (e.g. a “compose email” command, i.e. a first session, followed by speaking the composition of the email, considered another session, [0030], the system filtering out speech including commands from other users, [0026]),
wherein single mode for a single session ensures the virtual agent does not mistakenly respond to side conversations of bystanders of the individual using the system (the system filtering out speech including commands from other users, [0026], [0030]),
wherein a mechanism ensures that audio noise is not mistaken as input prompts for the one or more users (noise filtering mechanism ensuring that the system filters out speech including commands, i.e. input prompts, from other users, [0026], [0030]),
wherein speech from those around, but not using, the system, background music, or any other signal not intended to prompt the virtual agent can be considered noise (speech from other speakers and background noise is considered noise and filtered out, [0018], [0026], the examiner noting the term “or” not necessarily requiring filtering out background music, although this would also be filtered as it is not speech from the speaker in the video),
wherein multimodal can infer that the speaker of interest is prompting the virtual agent (audio and video are recognized to detect that the speaker is speaking a command, i.e. prompt for the virtual agent to launch email application, [0018], [0021], [0026], [0030]),
wherein multimodal comprises video and audio signals (audio and video signals, [0018], [0021).
Morrison does not specifically mention:
wherein multiple users are allowed the system extends the re-identification to unique users that interacts in a given session,
wherein the single-mode has a database reset each time it starts a new conversation, and multiple modes persist over time with a growing database.
Iyengar discloses wherein multiple users are allowed the system extends the re-identification to unique users that interacts in a given session (identifying the user by recognizing the voice of the user, [0011], for N number of unique users, [0065], [0067], across multiple conversations, [0078]),
wherein the single-mode has a database reset each time it starts a new conversation (the processor is configured to clear conversation history and initiate a new conversation, [0009]), and multiple modes persist over time with a growing database (opinion sets for the N users are saved and grow over time, [0040], [0061], [0073]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Morrison such that multiple users are allowed the system extends the re-identification to unique users that interacts in a given session, wherein the single-mode has a database reset each time it starts a new conversation, and multiple modes persist over time with a growing database in order to reduce conversation system architecture complexity, as suggested by Iyengar ([0003]), predictably resulting in saving computing resources, as suggested by Iyengar ([0003]). The references cited are analogous art in the same field of speech recognition.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20200228358 Rampton discloses integrating a virtual assistant into a multi-party conference
US 11082460 Nesta discloses audio source enhancement facilitated using video data
US 20200145616 Nassar discloses enhancing meeting participation by an interactive virtual assistant
US 20210358188 Lebaredian discloses a conversational AI platform with rendered graphical output
US 20220051663 Sharifi discloses a digital assistant with a transient personalization mode that temporarily applies personalization to responses for users with a guest account
US 20190005976 Peleg discloses video assisted speaker separation
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jesse Pullias whose telephone number is 571/270-5135. The examiner can normally be reached on M-F 8:00 AM - 4:30 PM. The examiner’s fax number is 571/270-6135.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Andrew Flanders can be reached on 571/272-7516.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Jesse S Pullias/
Primary Examiner, Art Unit 2655 11/17/25