Office Action Analysis: 18652961 — SELECTING ISOLATED SPEAKER SIGNAL BY COMPARING TEXT OBTAINED FROM AUDIO AND VIDEO STREAMS

Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The disclosure is objected to because of the following informalities: 
The specification uses the phrases "predicted spoken words" and "predicted next words" in a non-standard way that could mislead someone of ordinary skill in the art to misinterpret the scope of the invention. For examination purposes, the concepts will be interpreted as "determined spoken words" and "determined next words" since the concept of "predicted spoken words", where the system predicts a future word that hasn't been spoken yet, does not appear to have the proper support in the disclosure.  The examiner recommends amending the specification such that the term “predicted” is replaced with something along the lines of “determined” or “identified”.
Appropriate correction is required.
Claim Objections
Claims 1, 3, 8, 10, 15 and 17 are objected to because of the following informalities: 
The claims use the phrases "predicted spoken words" and "predicted next words" in a non-standard way that could mislead someone of ordinary skill in the art to misinterpret the scope of the invention. For examination purposes, the concepts will be interpreted as "determined spoken words" and "determined next words" since the concept of "predicted spoken words", where the system predicts a future word that hasn't been spoken yet, does not appear to have the proper support in the disclosure. The examiner recommends amending the claims such that the term “predicted” is replaced with something along the lines of “determined” or “identified”.
Appropriate correction is required.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1 to 5, 8 to 12 and 15 to 18 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by PARK; Minjung et al. (US 20230230593 A1) hereinafter PARK.

	Regarding claim 1, Park teaches:
	A method, comprising: transforming a plurality of audio signals associated with at least one human speaker to respective ones of a plurality of first sets of predicted spoken words using at least one speech-to-text conversion model;
	 “In operation 409, the processor 120 may process the speech of the utterance target in different manners, based on the audio and the image. When the utterance target is the user, the processor 120 may perform deep learning by matching the audio of the user and the mouth shape of the user, and perform a function according to the utterance of the user based on the deep learning. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth).” (PARK[0110]).
	 “According to various embodiments, the processor 120 may recognize the speech of the audio of the conversation counterpart by utilizing an audio filter specified (or personalized) for the utterance of the conversation counterpart. The processor 120 may collect utterances of the conversation counterpart for a predetermined period of time (for example, one day, seven days, thirty days, and the like) and analyze (or learn) the collected utterances of the conversation counterpart so as to generate a conversation counterpart-specific audio filter suitable for the conversation counterpart. The conversation counterpart may be multiple people, and the processor 120 may generate a conversation counterpart-specific audio filter only for the conversation counterpart (for example, a family member, or close friend) designated by the user. The processor 120 may store the generated conversation counterpart-specific audio filter in the memory (for example, the memory 130 in FIG. 1). The processor 120 may keep updating the conversation counterpart-specific audio filter stored in the memory 130 whenever the utterance of the conversation counterpart is collected.” (PARK[0114]).
	transforming at least one video signal associated with the at least one human speaker to one or more second sets of predicted spoken words using at least one lip motion-to-text conversion model, wherein the one or more second sets of predicted spoken words are based at least in part on an analysis of at least one image associated with a respective human speaker;
“In operation 409, the processor 120 may process the speech of the utterance target in different manners, based on the audio and the image. When the utterance target is the user, the processor 120 may perform deep learning by matching the audio of the user and the mouth shape of the user, and perform a function according to the utterance of the user based on the deep learning. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth).” (PARK[0110]).
	iteratively adjusting a steering vector associated with the plurality of audio signals to compare at least one of the plurality of first sets of predicted spoken words with at least one of the one or more second sets of predicted spoken words;
	 “An electronic device according to various example embodiments of the disclosure may include: multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of the multiple microphones based on the determination, obtain audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and image. “(PARK[0007]).                                                 
	“The multiple microphones 340 may include at least one of a first microphone 260, a second microphone 270, and a third microphone 280. The processor 120 may configure (or determine) directivity of the multiple microphones 340 based on an utterance target. The directivity of a microphone indicates an attribute of detecting an audio signal input through a microphone according to a distance and direction, and may include an attribute of representing sensitivity according to a direction with reference to the front side of a microphone. At least one of the first microphone 260 to the third microphone 280 may obtain, as an audio, speech of a user or speech of a counterpart (for example, conversation counterpart) having a conversation with the user. The processor 120 may perform speech recognition based on the audio obtained through at least one of the first microphone 260 to the third microphone 280.” (PARK[0084]).
	and selecting an isolated human speaker audio signal associated with a particular one of the plurality of first sets of predicted spoken words and a particular one of the one or more second sets of predicted spoken words,  wherein the selection is based at least in part on a result of the comparison;
	 “In operation 607, the processor 120 may perform deep learning by matching the audio and the mouth shape. The processor 120 may match the speech of the user and the mouth shape of the user. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze a change of the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth). The processor 120 may determine an accurate utterance content of the user through the comparison of the first text and the second text. The deep learning is a type of machine learning technology and may indicate learning data obtained for a predetermined period of time to output a result. The processor 120 may learn the speech of the user and the mouth shape of the user for the same text by matching the result of converting user’s speech collected for a predetermined period of time into a text and the result of converting user’s mouth shape collected for a predetermined period of time into a text.” (PARK[0123]).
	wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
	“Various embodiments of the disclosure provide a method and a device which includes multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using at least one of the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of at least one of the multiple microphones based on the determination, obtain an audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and the image.” (PARK[Abstract]).

	Regarding claim 2, Park teaches:
	The method of claim 1, wherein the plurality of audio signals is obtained from a multi-directional microphone array.
	“An electronic device according to various example embodiments of the disclosure may include: multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of the multiple microphones based on the determination, obtain audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and image. ” (PARK[0007]).

		Regarding claim 3, Park teaches:
	The method of claim 1, further comprising validating the isolated human speaker audio signal over time by evaluating one or more predicted next words of the isolated human speaker audio signal with a corresponding set of predicted spoken words from the at least one speech-to-text conversion model.
	“According to various embodiments, the processor 120 may recognize the speech of the audio of the user by utilizing an audio filter specified (or personalized) for the utterance of the user. Even when the same word is uttered, utterance characteristics may be different for each user. The processor 120 may collect (for example, big data) utterances of the user for a predetermined (e.g., specified) period of time (for example, one day, seven days, thirty days, and the like) and analyze (or learn) the collected utterances of the user so as to generate a user-specific audio filter suitable for the user. The processor 120 may store the generated user-specific audio filter in the memory (for example, the memory 130 in FIG. 1). The processor 120 may keep updating the user-specific audio filter stored in the memory 130 whenever the utterance of the user is collected. ” (PARK[0112]).                                                              	 “According to various example embodiments, multiple cameras included in an electronic device may be used to identify (or check) an utterance target (for example, a user wearing the electronic device or a counterpart conversing with the user), and a different speech recognition process may be applied depending on the utterance target. ” (PARK[0010]).

	Regarding claim 4, Park teaches:
	The method of claim 1, wherein the at least one image associated with the respective human speaker comprises at least one coordinate in a multi-dimensional plane.
	“The multiple microphones 340 may include at least one of a first microphone 260, a second microphone 270, and a third microphone 280. The processor 120 may configure (or determine) directivity of the multiple microphones 340 based on an utterance target. The directivity of a microphone indicates an attribute of detecting an audio signal input through a microphone according to a distance and direction, and may include an attribute of representing sensitivity according to a direction with reference to the front side of a microphone. At least one of the first microphone 260 to the third microphone 280 may obtain, as an audio, speech of a user or speech of a counterpart (for example, conversation counterpart) having a conversation with the user. The processor 120 may perform speech recognition based on the audio obtained through at least one of the first microphone 260 to the third microphone 280.” (PARK[0084]).

	Regarding claim 5, Park teaches:
	The method of claim 1, wherein the iteratively adjusting the steering vector associated with the plurality of audio signals employs beamforming techniques.
	“According to various example embodiments, when a user or a counterpart conversing with the user alone makes an utterance, a micro beamforming technology may be utilized to process the speech of the utterance target. When the user and the counterpart make simultaneous utterances, the user’s utterance is processed by detecting the user’s lip shape acquired through a camera, and the counterpart’s utterance is processed utilizing the micro beamforming technology. As a result, intermingling of different speeches may be prevented and/or reduced, and the speech recognition rate may be improved according to each utterance.” (PARK[0011]).

	

	Regarding claim 8, Park teaches:
	An apparatus comprising: at least one processing device comprising a processor coupled to a memory;
	“Various embodiments of the disclosure provide a method and a device which includes multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using at least one of the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of at least one of the multiple microphones based on the determination, obtain an audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and the image.” (PARK[Abstract]).
	the at least one processing device being configured to implement the following steps: transforming a plurality of audio signals associated with at least one human speaker to respective ones of a plurality of first sets of predicted spoken words using at least one speech-to-text conversion model;
	 “In operation 409, the processor 120 may process the speech of the utterance target in different manners, based on the audio and the image. When the utterance target is the user, the processor 120 may perform deep learning by matching the audio of the user and the mouth shape of the user, and perform a function according to the utterance of the user based on the deep learning. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth).” (PARK[0110]).
	 “According to various embodiments, the processor 120 may recognize the speech of the audio of the conversation counterpart by utilizing an audio filter specified (or personalized) for the utterance of the conversation counterpart. The processor 120 may collect utterances of the conversation counterpart for a predetermined period of time (for example, one day, seven days, thirty days, and the like) and analyze (or learn) the collected utterances of the conversation counterpart so as to generate a conversation counterpart-specific audio filter suitable for the conversation counterpart. The conversation counterpart may be multiple people, and the processor 120 may generate a conversation counterpart-specific audio filter only for the conversation counterpart (for example, a family member, or close friend) designated by the user. The processor 120 may store the generated conversation counterpart-specific audio filter in the memory (for example, the memory 130 in FIG. 1). The processor 120 may keep updating the conversation counterpart-specific audio filter stored in the memory 130 whenever the utterance of the conversation counterpart is collected.” (PARK[0114]).
	transforming at least one video signal associated with the at least one human speaker to one or more second sets of predicted spoken words using at least one lip motion-to-text conversion model, wherein the one or more second sets of predicted spoken words are based at least in part on an analysis of at least one image associated with a respective human speaker;
	 “In operation 409, the processor 120 may process the speech of the utterance target in different manners, based on the audio and the image. When the utterance target is the user, the processor 120 may perform deep learning by matching the audio of the user and the mouth shape of the user, and perform a function according to the utterance of the user based on the deep learning. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth).” (PARK[0110]).
	iteratively adjusting a steering vector associated with the plurality of audio signals to compare at least one of the plurality of first sets of predicted spoken words with at least one of the one or more second sets of predicted spoken words;
	 “An electronic device according to various example embodiments of the disclosure may include: multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of the multiple microphones based on the determination, obtain audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and image. “(PARK[0007]).                                                 
	“The multiple microphones 340 may include at least one of a first microphone 260, a second microphone 270, and a third microphone 280. The processor 120 may configure (or determine) directivity of the multiple microphones 340 based on an utterance target. The directivity of a microphone indicates an attribute of detecting an audio signal input through a microphone according to a distance and direction, and may include an attribute of representing sensitivity according to a direction with reference to the front side of a microphone. At least one of the first microphone 260 to the third microphone 280 may obtain, as an audio, speech of a user or speech of a counterpart (for example, conversation counterpart) having a conversation with the user. The processor 120 may perform speech recognition based on the audio obtained through at least one of the first microphone 260 to the third microphone 280.” (PARK[0084]).
	and selecting an isolated human speaker audio signal associated with a particular one of the plurality of first sets of predicted spoken words and a particular one of the one or more second sets of predicted spoken words, wherein the selection is based at least in part on a result of the comparison.
	 “In operation 607, the processor 120 may perform deep learning by matching the audio and the mouth shape. The processor 120 may match the speech of the user and the mouth shape of the user. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze a change of the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth). The processor 120 may determine an accurate utterance content of the user through the comparison of the first text and the second text. The deep learning is a type of machine learning technology and may indicate learning data obtained for a predetermined period of time to output a result. The processor 120 may learn the speech of the user and the mouth shape of the user for the same text by matching the result of converting user’s speech collected for a predetermined period of time into a text and the result of converting user’s mouth shape collected for a predetermined period of time into a text.” (PARK[0123]).

	Regarding claim 9, arguments analogous to claims 2 are applicable.
	Regarding claim 10, arguments analogous to claim 3 are applicable.
	Regarding claim 11, arguments analogous to claim 4 are applicable.
	Regarding claim 12, arguments analogous to claim 5 are applicable.


	Regarding claim 15, Park teaches:
	A non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes the at least one processing device to perform the following steps
	“Various embodiments of the disclosure provide a method and a device which includes multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using at least one of the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of at least one of the multiple microphones based on the determination, obtain an audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and the image.” (PARK[Abstract]).
	transforming a plurality of audio signals associated with at least one human speaker to respective ones of a plurality of first sets of predicted spoken words using at least one speech-to-text conversion model;
	 “In operation 409, the processor 120 may process the speech of the utterance target in different manners, based on the audio and the image. When the utterance target is the user, the processor 120 may perform deep learning by matching the audio of the user and the mouth shape of the user, and perform a function according to the utterance of the user based on the deep learning. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth).” (PARK[0110]).
	 “According to various embodiments, the processor 120 may recognize the speech of the audio of the conversation counterpart by utilizing an audio filter specified (or personalized) for the utterance of the conversation counterpart. The processor 120 may collect utterances of the conversation counterpart for a predetermined period of time (for example, one day, seven days, thirty days, and the like) and analyze (or learn) the collected utterances of the conversation counterpart so as to generate a conversation counterpart-specific audio filter suitable for the conversation counterpart. The conversation counterpart may be multiple people, and the processor 120 may generate a conversation counterpart-specific audio filter only for the conversation counterpart (for example, a family member, or close friend) designated by the user. The processor 120 may store the generated conversation counterpart-specific audio filter in the memory (for example, the memory 130 in FIG. 1). The processor 120 may keep updating the conversation counterpart-specific audio filter stored in the memory 130 whenever the utterance of the conversation counterpart is collected.” (PARK[0114]).
	transforming at least one video signal associated with the at least one human speaker to one or more second sets of predicted spoken words using at least one lip motion-to-text conversion model, wherein the one or more second sets of predicted spoken words are based at least in part on an analysis of at least one image associated with a respective human speaker;
	 “In operation 409, the processor 120 may process the speech of the utterance target in different manners, based on the audio and the image. When the utterance target is the user, the processor 120 may perform deep learning by matching the audio of the user and the mouth shape of the user, and perform a function according to the utterance of the user based on the deep learning. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth).” (PARK[0110]).
	iteratively adjusting a steering vector associated with the plurality of audio signals to compare at least one of the plurality of first sets of predicted spoken words with at least one of the one or more second sets of predicted spoken words;
	 “An electronic device according to various example embodiments of the disclosure may include: multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of the multiple microphones based on the determination, obtain audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and image. “(PARK[0007]).                                                 
	“The multiple microphones 340 may include at least one of a first microphone 260, a second microphone 270, and a third microphone 280. The processor 120 may configure (or determine) directivity of the multiple microphones 340 based on an utterance target. The directivity of a microphone indicates an attribute of detecting an audio signal input through a microphone according to a distance and direction, and may include an attribute of representing sensitivity according to a direction with reference to the front side of a microphone. At least one of the first microphone 260 to the third microphone 280 may obtain, as an audio, speech of a user or speech of a counterpart (for example, conversation counterpart) having a conversation with the user. The processor 120 may perform speech recognition based on the audio obtained through at least one of the first microphone 260 to the third microphone 280.” (PARK[0084]).
	and selecting an isolated human speaker audio signal associated with a particular one of the plurality of first sets of predicted spoken words and a particular one of the one or more second sets of predicted spoken words, wherein the selection is based at least in part on a result of the comparison.	 	“In operation 607, the processor 120 may perform deep learning by matching the audio and the mouth shape. The processor 120 may match the speech of the user and the mouth shape of the user. The processor 120 may recognize speech of the audio (or audio signal) of the user to convert the same into a first text corresponding to the speech, and analyze a change of the mouth shape of the user to convert the same into a second text corresponding to a change of the mouth shape (for example, movement of a mouth). The processor 120 may determine an accurate utterance content of the user through the comparison of the first text and the second text. The deep learning is a type of machine learning technology and may indicate learning data obtained for a predetermined period of time to output a result. The processor 120 may learn the speech of the user and the mouth shape of the user for the same text by matching the result of converting user’s speech collected for a predetermined period of time into a text and the result of converting user’s mouth shape collected for a predetermined period of time into a text.” (PARK[0123]).

	Regarding claim 16, arguments analogous to claims 2 and 9 are applicable.
	Regarding claim 17, arguments analogous to claim 3 and 10 are applicable.
	Regarding claim 18, arguments analogous to claim 5 and 12 are applicable.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


Claims 6, 13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over PARK in view of HARA MUNEHIRO (JP 2024032124 A) hereinafter MUNEHIRO.
	Regarding claim 6, the rejection of claim 1 is incorporated herein by reference; however, Park fails to teach:
	wherein the iteratively adjusting the steering vector associated with the plurality of audio signals adjusts a position of the steering vector until the comparison satisfies a designated threshold.
 	On the other hand, MUNEHIRO teaches:
	“To provide a sound collecting device, a sound collecting program and a sound collecting method, a determination device, a determination program and a determination method, which suppress erroneous detection of noise and interference sound, and erroneous discard of target sound. SOLUTION: There is provided a sound collecting device. The sound collecting device includes: means for forming directivity in a direction of a target area on the basis of an input signal supplied from a plurality of microphone arrays, and acquiring a target area sound extraction signal and a non-target area sound extraction signal on the basis of beam forming output; means for determining whether or not target area sound exists in the target area sound extraction signal, using a target area sound threshold; means for determining whether or not non-target area sound exists in the non-target area sound extraction signal; means for adjusting the target area sound threshold on the basis of the results of the determination processing; and means for outputting the target area sound extraction signal according to the determination processing result of the target area sound. SELECTED DRAWING: Figure 1” (MUNEHIRO [Abstract]).
	It would have been obvious for someone od ordinary skill in the art before the effective filing date of the claimed invention to have modified PARK to incorporate the teachings of MONIHERO to include instructions for the microphone beamforming to adjust the position until it satisfies a designated threshold. The motivation to include this modification to the  is discussed by MUNIHERO in: “When performing speech recognition under noise, the accuracy of speech recognition decreases due to the influence of noise, so high accuracy is required to detect speech sections that exclude noise sections where no speech is made and extract only the user's utterances.” (MUNEHIRO [Background-art Paragraph 1]).
	
	Regarding claim 13, the rejection of claim 8 is incorporated herein by reference; arguments analogous to claim 6 are applicable. 
	Regarding claim 19, the rejection of claim 15 is incorporated herein by reference; furthermore, arguments analogous to claim 6 are applicable. 

Claims 7, 14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over PARK in view of Lokesh; Rahul et al. (US 20240331088 A1) hereinafter LOKESH.
	Regarding claim 7, the rejection of claim 1 is incorporated herein by reference; however, Park does not teach: 
	wherein a human speaker associated with the isolated human speaker audio signal is interacting with at least one processor-based digital human.
	On the other hand, LOKESH teaches:
	“In one or more embodiments, a digital human is a computer-generated entity that is rendered visually with a human-like appearance. The digital human may be an avatar. In some embodiments, a digital human is a photorealistic avatar. In some embodiments, a digital human is a digital rendering of a hominid, a humanoid, a human, or other human-like character. A digital human may be an artificial human. A digital human can include elements of artificial intelligence (AI) for interpreting user input and responding to the input in a contextually appropriate manner. The digital human can interact with a user using verbal and/or non-verbal cues. Implementing natural language processing (NLP), a chatbot, and/or other software, the digital human can be configured to provide human-like interactions with a human being and/or perform activities such as scheduling, initiating, terminating, and/or monitoring of the operations of various systems and devices.” (LOKESH [0036]).
	It would have been obvious for someone od ordinary skill in the art before the effective filing date of the claimed invention to have modified PARK to incorporate the teachings of LOKESH to include instructions wherein a human speaker interacts with at least one processor-based digital human. The motivation to include this modification to the  is discussed by LOKESH in: “The use of life-like avatars referred to as digital humans or virtual humans is becoming increasingly popular. Digital humans may be used in a variety of different contexts including, but not limited to, the metaverse, gaming, and as part of any of a variety of virtual experiences in which human beings increasingly wish to take part. Advances in computer technology and neural networks have enabled the rapid virtualization of many different “real world” activities.” (LOKESH [0004]).

	Regarding claim 14, the rejection of claim 8 is incorporated herein by reference; furthermore, arguments analogous to claim 7 are applicable.
	Regarding claim 20, the rejection of claim 15 is incorporated herein by reference; furthermore, arguments analogous to claim 7 are applicable.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HECTOR J. CRESPO FEBLES whose telephone number is (571)272-4512. The examiner can normally be reached Mon - Fri 7:30 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HECTOR J. CRESPO FEBLES/Examiner, Art Unit 2657                                                                                                                                                                                                        3/6/2026


/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657
Read full office action
SELECTING ISOLATED SPEAKER SIGNAL BY COMPARING TEXT OBTAINED FROM AUDIO AND VIDEO STREAMS

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SELECTING ISOLATED SPEAKER SIGNAL BY COMPARING TEXT OBTAINED FROM AUDIO AND VIDEO STREAMS

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email