Detailed Action
This communication is in response to the Arguments and Amendments filed on 1/16/2026.
Claims 1-20 are pending and have been examined. Hence, this Action has been made Final.
Independent Claims 1, 14 and 20 are parallel device, method, and CRM claims, respectively.
Apparent priority: 10/17/2022.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Arguments and Amendments
Applicants have amended the independent claims to include “sound source” “model; and provide the at least one sound source to a user through at least one sound device.”
With respect to Claim Rejections - 35 U.S.C. § 101
Applicant notes "Reminders on evaluating subject matter eligibility of claims under 35 U.S.C. 101" (August 4, 2025, available at https://www.uspto.gov/sites/default/files/documents/memo-101-20250804.pdf) (hereinafter "Reminders"). Regarding Step 2A, Prong One, the Reminders specifically recite, "The mental process grouping is not without limits. Examiners are reminded not to expand this grouping in a manner that encompasses claim limitations that cannot practically be performed in the human mind. The MPEP and the AI-SME Update provide examples of claim limitations that cannot be practically performed in the human mind. Claim limitations that encompass AT in a way that cannot be practically performed in the human mind do not fall within this grouping." (Emphasis added).
Thus, the Examiner's arguments about the claimed Al models are entirely different from the above specific instructions of the Reminders.
Further, the Reminders states, "Examiners should be careful to distinguish claims that
recite an exception (which require further eligibility analysis) from claims that merely involve an exception (which are eligible and do not require further eligibility analysis)""Consider for example, the published USPTO examples 39, which illustrates claim limitations that merely involve an abstract idea, and 47, which shows limitations that recite an abstract idea. The claim limitation "training the neural network in a first stage using the first training set" of example 39 does not recite a judicial exception. Even though "training the neural network" involves a broad array of techniques and/or activities that may involve or rely upon mathematical concepts, the limitation does not set forth or describe any mathematical relationships, calculations, formulas, or equations using words or mathematical symbols."
The above elements of claim 1 are very similar to the above claim limitations of example 29 explained in the Reminders. Thus, Applicant respectfully submits that the Examiner's arguments should be withdrawn in view of the Reminders.
Examiner notes The August 2025 Memo’s reminder that the mental process grouping has limits is acknowledged, but it does not preclude application of the mental process grouping here because several claim limitations recite high level cognitive/data manipulation concepts (e.g., generating audio information from the mixed signal, separating sound sources) that can be characterized as mental/data processing concepts. The fact that a AI performs them in practice does not automatically remove them from the judicial exception analysis without claim language or specification evidence showing a concrete technological improvement to computer functionality.
The claims reciete a sequence of data transformations and computations: generating audio information, overlap information, signal manipulation, separating signals . These steps are paradigmatic mathematical/data processing operations (signal manipulation) and therefore fall within the “mathematical concepts” exception recognized by the USPTO and Federal Circuit (see, e.g., Digitech, SAP America, Electric Power Group).
Examiner further notes the statutory exception analysis asks whether the claim is “directed to” an abstract idea — and courts and the USPTO have recognized that claims which at a high level recite the performance of cognitive or mental tasks (e.g., signal manipulation) may be placed in the mental process grouping even if implemented by machines (see examples in the August 2025 Memo and cases like SAP, Electric Power Group).
o The claim language here is largely high level and functional: “generating a phone feature,” “generating a word embedding vector sequence,” “generating audio information,” and “separating audio” — these can be reasonably characterized as information processing/mental like steps (conceptual transformations of linguistic and acoustic information). Absent claim detail tying the operations to specific technical mechanisms that go beyond mere data processing, these limitations are susceptible to classification as mental/data manipulation concepts under Step 2A Prong One.
Applicant notes Ex Partes Desjardins (September 26, 2025, available at https://www.uspto.gov/sites/default/files/documents/202400567-arp-rehearing-decision- 20250926.pdf) (hereinafter "Desjardins") In Desjardins, the USPTO director Squires, along with the Acting Commissioner Wallace, and Judge Kim, specifically held that, "Categorically excluding Al innovations from patent protection in the United States jeopardizes America's leadership in this critical emerging technology. Yet, under the panel's reasoning, many Al innovations are potentially unpatentable- even if they are adequately described and nonobvious-because the panel essentially equated any machine learning with an unpatentable "algorithm" and the remaining additional elements as "generic computer components," without adequate explanation. [] Examiners and panels should not evaluate claims at such a high level of generality."
Here, just as in Desjardins, the Examiner evaluates the claims of the present application "at such a high level of generality" without adequate explanation. Thus, also based on Desjardins, Applicant respectfully submits that the Examiner's § 101 arguments about the claims should be withdrawn.
Examiner notes the additional elements must supply an “inventive concept.” The claim recites known functional components (electronic device, processor, memory, computer readable medium) and high level data/signal transformations. Without claim specificity tying those components to particular unconventional architectures, constrained parameterizations, training/regimen steps, or demonstrable improvements, the recited elements appear to be routine, conventional uses of neural networks and generic software components, and therefore fail to supply an inventive concept (see Alice; Berkheimer — factual showing may rebut this with evidence).
Examiner further notes A claim reciting operations that are inherently mathematical/data transformative may be characterized as reciting a mathematical concept even without naming the algorithm (see Digitech, SAP). The decisive inquiry is the claim’s focus: if the focus is on data transformation or mental process like manipulation, it is vulnerable to categorization as an abstract idea.
With respect to Claim Rejections - 35 U.S.C. § 103
Applicant’s arguments with respect to the independent claims have been considered but are moot because the new ground of rejection does not rely on the primary reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Examiner notes the amendments to the independent claims resulted in a new interpretation of the claims. Hence, new grounds for rejection have been made in view of SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1).
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The independent Claims are directed to statutory categories:
Claim 1 is a device claim and directed to the machine or manufacture category of patentable subject matter.
Claim 14 is a method claim and is directed to the machine or manufacture category of patentable subject matter.
Claim 20 is a CRM claim and is directed to the machine or manufacture category of patentable subject matter.
Independent claim 1 recites,
“1. An electronic device for processing a video comprising an image signal and a mixed audio signal, the electronic device comprising: at least one processor; and a memory configured to store at least one program for processing the video; wherein, by executing the at least one program, the at least one processor is configured to:
generate, from the image signal and the mixed audio signal, audio- related information indicating a degree of overlap in a plurality of sound sources included in the mixed audio signal by using a first artificial intelligence (AI) model; (this relates to a human using visual and auditory processing in tandem to determine overlap information of the sound sources and using the human vocal system to generate information.)
and separate at least one sound source of the plurality of sound sources included in the mixed audio signal from the mixed audio signal, by applying the audio-related information to a second Al model; and (this relates to a human using natural auditory processes in the human mind to separate sources.)
provide the at least one sound source to a user through at least one sound device. [this relates to a human using speech to provide a sound source.]
The Dependent Claims do not include additional limitations that could incorporate the abstract idea into a practical application or cause the Claim as a whole to amount to significantly more than the underlying abstract idea.
Regarding Independent claim 14, claim 14 is a method claim with limitations similar to that of Claim 1 and is rejected under the same rational.
Regarding Independent claim 20, claim 20 is a CRM claim with limitations similar to that of Claim 1 and is rejected under the same rational.
This judicial exception is not integrated into a practical application. In particular, claims 1 and 20 recite additional elements of “processors”, and “memory”. For example, in [0045] of the as filed specification, there is description of using [0045] The computer program instructions may be provided to a processor of a general-purpose computer and The computer program instructions may also be stored in a computer-usable or computer-readable memory that may direct a computer or other programmable data processing equipment to function in a particular manner, and the instructions stored in the computer-usable or computer-readable memory may produce a manufactured article including instruction means that perform the functions specified in the flowchart block(s). Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a computer noted as a general computer. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Further, the additional limitation in the claims noted above are directed towards insignificant solution activity. The claims are not patent eligible.
Dependent claim 2 recites,
“2. The electronic device of claim 1, wherein the audio-related information comprises a map indicating the degree of overlap in the plurality of sound sources, (this relates to a human using pen and paper to indicate overlap in sound sources in a map format.)
and wherein each bin of the map has a probability value corresponding to a degree to which one of the plurality of sound sources overlap with another in a time-frequency domain. (this relates to a human using pen and paper to create a probability value for sound sources.)
Dependent claim 3 recites,
“3. The electronic device of claim 1, wherein the first Al model comprises: a first submodel configured to generate, from the image signal, a plurality of pieces of mouth movement information representing temporal pronouncing information of a plurality of speakers corresponding to the plurality of sound sources; (this relates to a human using pen and paper to generate mouth movement information representing pronouncing information)
and a second submodel configured to generate, from the mixed audio signal, the audio-related information, based on the plurality of pieces of mouth movement information. (this relates to a human using speech or pen and paper to generate audio information)
Dependent claim 4 recites,
“4. The electronic device of claim 3, wherein the first Al model is trained by comparing training audio-related information estimated from a training image signal and a training audio signal with a ground truth, (this relates to a human training on image and audio signals using the auditory and visual systems.”)
and wherein the ground truth is generated by a product operation between a plurality of probability maps generated from a plurality of spectrograms generated based on each of a plurality of individual training sound sources included in the training audio signal.” (this relates to a human creating ground truth information from probability maps and spectrogram data”)
Dependent claim 5 recites,
“5. The electronic device of claim 4, wherein each of the plurality of probability maps is generated by MaxClip (log(1 + 11F112), 1), where ||F1|2 is a size of a corresponding spectrogram from among the plurality of spectrograms, and MaxClip (x, 1) is a function that outputs x when x is less than 1, and outputs 1 when x is equal to or greater than 1.” (this relates to a mathematical function.)
Dependent claim 6 recites,
“6. The electronic device of claim 1, wherein the second Al model comprises an input layer, an encoder including a plurality of feature layers, and a bottleneck layer,
and wherein the applying of the audio-related information to the second Al model comprises at least one of applying of the audio-related information to the input layer, (this relates to a human applying audio information to an input )
applying of the audio-related information to each of the plurality of feature layers included in the encoder, or applying of the audio-related information to the bottleneck layer.” (this relates to a human applying audio information into layers)
Dependent claim 7 recites,
“7. The electronic device of claim 1, wherein the at least one processor is further configured to: generate, from the mixed audio signal or from the mixed audio signal and visual information, number-of-speakers related information included in the mixed audio signal by using a third Al model; (this relates to a human generating number of speakers information from audio.)
generate, from the image signal and the mixed audio signal, the audio-related information based on the number-of-speakers related information by using the first Al model; (this relates to a human generating audio information about the number of speakers using speech or pen and paper.)
and separate, from the mixed audio signal, at least one of the plurality of sound sources included in the mixed audio signal by applying the number-of-speakers related information and the audio-related information to the second Al model, (this relates to a human using natural visual processing to separate sound sources based on number of speakers information)
wherein the visual information comprises at least one key frame included in the image signal, and wherein the at least one key frame comprises a facial area including lips of at least one speaker corresponding to at least one sound source included in the mixed audio signal.” (this relates to a human using natural facial recognition to identify a key frame of a speaker including lips.)
Dependent claim 8 recites,
“8. The electronic device of claim 7, wherein the number-of-speakers related information included in the mixed audio signal comprises at least one of first number-of-speakers related information about the mixed audio signal or second number-of-speakers related information about the visual information.” (this relates to a human generating number of speakers information using audiotry processing or generating number of speakers information using visual processing.)
Dependent claim 9 recites,
“9. The electronic device of claim 8, wherein the first number-of-speakers related information comprises a probability distribution of a number of speakers corresponding to the plurality of sound sources included in the mixed audio signal, and wherein the second number-of-speakers related information comprises a probability distribution of the number of speakers included in the visual information.” (this relates to a human generating a probability distribution of a number of speakers information )
Dependent claim 10 recites,
“10. The electronic device of claim 7, wherein the second Al model comprises an input layer, an encoder including a plurality of feature layers, and a bottleneck layer
and wherein the applying of the number-of-speakers related information to the second Al model comprises at least one of applying of the number-of-speakers related information to the input layer, applying of the number-of-speakers related information to each of the plurality of feature layers included in the encoder, or applying of the number-of-speakers related information to the bottleneck layer.” (this relates to a human applying number of speakers information to layers)
Dependent claim 11 recites,
“11. The electronic device of claim 1, wherein the at least one processor is further configured to: obtain a plurality of pieces of mouth movement information associated with the plurality of speakers from the image signal; (this relates to a human using the visual system and human mind obtain mouth movement information)
and separate, from the mixed audio signal, at least one of the plurality of sound sources included in the mixed audio signal by applying the obtained plurality of pieces of mouth movement information to the second Al model.” (this relates to a human using natural auditory processing to separate sources in tandem with visual mouth information)
Dependent claim 12 recites,
“12. The electronic device of claim 1,further comprising: an input/output interface configured to display a screen on which the video is played back and receive, from a user, an input for selecting at least one speaker from among a plurality of speakers corresponding to the plurality of sound sources included in the mixed audio signal;
and an audio output interface configured to output at least one sound source corresponding to the at least one speaker selected from among the plurality of sound sources included in the mixed audio signal. A display screen and speakers are noted as additional elements.
Dependent claim 13 recites,
“13. The electronic device of claim 12, wherein the at least one processor is further configured to: display, on the screen, a user interface for adjusting a volume of at least one sound source corresponding to the selected at least one speaker and receive, from the user, adjustment of the volume of the at least one sound source; and based on the adjustment of the volume of the at least one sound source, adjust the volume of the at least one sound source that is output through the audio output interface.” A display screen and speakers are noted as additional elements.
As to dependent Claim 15, Claim 15 is a parallel Method claim with limitations similar to that of Claim 2 and is rejected under the same rational.
As to dependent Claim 16, Claim 16 is a parallel Method claim with limitations similar to that of Claim 3 and is rejected under the same rational.
As to dependent Claim 17, Claim 17 is a parallel Method claim with limitations similar to that of Claim 6 and is rejected under the same rational.
As to dependent Claim 18, Claim 18 is a parallel Method claim with limitations similar to that of Claim 7 and is rejected under the same rational.
As to dependent Claim 19, Claim 19 is a parallel Method claim with limitations similar to that of Claim 8 and is rejected under the same rational.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 7-9, 14, 18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1).
Regarding Claim 1, SATO teaches
1. An electronic device for processing a video comprising an image signal and a mixed audio signal, the electronic device comprising: at least one processor; and a memory configured to store at least one program for processing the video; wherein, by executing the at least one program, the at least one processor is configured to: (see SATO [0033] “FIG. 1 is a diagram illustrating an example of a configuration of the audio signal processing apparatus according to the first embodiment. The audio signal processing apparatus 10 according to the first embodiment is realized, for example, by a computer or the like, which includes a read only memory (ROM), a random access memory (RAM), a central processing unit (CPU), and the like, reading a predetermined program and the CPU executing the predetermined program.”) and separate at least one sound source of the plurality of sound sources included in the mixed audio signal from the mixed audio signal, by applying the audio-related information to a second Al model; (see SATO [0034] As illustrated in FIG. 1, the audio signal processing apparatus 10 includes an audio signal processing unit 11, a first auxiliary feature conversion unit 12, a second auxiliary feature conversion unit 13, and an auxiliary information generation unit 14 (a generation unit). A mixed audio signal including audio from a plurality of sound sources is input to the audio signal processing apparatus 10. Further, an audio signal of a target speaker and video information of speakers at the time of recording the input mixed audio signal are input to the audio signal processing apparatus 10. Here, the audio signal of the target speaker is a signal obtained by recording what the target speaker utters independently in a different scene (place and time) from a scene in which the mixed audio signal is acquired. The audio signal of the target speaker does not include audio of other speakers, but may include background noise or the like. Further, the video information of speakers at the time of recording the mixed audio signal is a video containing at least the target speaker in the scene in which the mixed audio signal to be processed by the audio signal processing apparatus 10 is acquired, for example, a video capturing a state of the target speaker in the scene. The audio signal processing apparatus 10 estimates and outputs information regarding the audio signal of the target speaker included in the mixed audio signal.”)
SATO does not specifically teach generate, from the image signal and the mixed audio signal, audio- related information indicating a degree of overlap in a plurality of sound sources included in the mixed audio signal by using a first artificial intelligence (AI) model; However, Wexler does teach this limitation (see Wexler (U.S. Patent Number US 20210390957 A1) [0211] “In some embodiments, processor 210 may be configured to select between multiple active speakers to selectively condition audio signals. For example, individuals 2310 and 2410 may both be speaking at the same time or their speech may overlap during a conversation. Processor 210 may selectively condition audio associated with one speaking individual relative to others. This may include giving priority to a speaker who has started but not finished a word or sentence or has not finished speaking altogether when the other speaker started speaking. This determination may also be driven by the context of the speech, as described above.”) (see Wexler (U.S. Patent Number US 20210390957 A1) [0180-182] “In some embodiments, processor 210 may have access to one or more voiceprints of individuals, which may facilitate selective conditioning of voice 2012 of individual 2010 in relation to other sounds or voices. Having a speaker's voiceprint, and a high quality voiceprint in particular, may provide for fast and efficient speaker separation. A high quality voice print may be collected, for example, when the user speaks alone, preferably in a quiet environment. By having a voiceprint of one or more speakers, it is possible to separate an ongoing voice signal almost in real time, e.g. with a minimal delay, using a sliding time window. The delay may be, for example 10 ms, 20 ms, 30 ms, 50 ms, 100 ms, or the like. Different time windows may be selected, depending on the quality of the voice print, on the quality of the captured audio, the difference in characteristics between the speaker and other speaker(s), the available processing resources, the required separation quality, or the like. In some embodiments, a voice print may be extracted from a segment of a conversation in which an individual speaks alone, and then used for separating the individual's voice later in the conversation, whether the individual's is recognized or not Separating voices may be performed as follows: spectral features, also referred to as spectral attributes, spectral envelope, or spectrogram may be extracted from a clean audio of a single speaker and fed into a pre-trained first neural network, which generates or updates a signature of the speaker's voice based on the extracted features. The audio may be for example, of one second of clean voice. The output signature may be a vector representing the speaker's voice, such that the distance between the vector and another vector extracted from the voice of the same speaker is typically smaller than the distance between the vector and a vector extracted from the voice of another speaker. The speaker's model may be pre-generated from a captured audio. Alternatively or additionally, the model may be generated after a segment of the audio in which only the speaker speaks, followed by another segment in which the speaker and another speaker (or background noise) is heard, and which it is required to separate.”) [0182] Then, to separate the speaker's voice from additional speakers or background noise in a noisy audio, a second pre-trained neural network may receive the noisy audio and the speaker's signature and output an audio (which may also be represented as attributes) of the voice of the speaker as extracted from the noisy audio, separated from the other speech or background noise. It will be appreciated that the same or additional neural networks may be used to separate the voices of multiple speakers. For example, if there are two possible speakers, two neural networks may be activated, each with models of the same noisy output and one of the two speakers. Alternatively, a neural network may receive voice signatures of two or more speakers and output the voice of each of the speakers separately. Accordingly, the system may generate two or more different audio outputs, each comprising the speech of the respective speaker. In some embodiments, if separation is impossible, the input voice may only be cleaned from background noise.”) and provide the at least one sound source to a user through at least one sound device. (see Wexler (U.S. Patent Number US 20210390957 A1) [0225] In some embodiments, the disclosed system may include a microphone configured to capture sounds from an environment of a user. As discussed above, apparatus 110 may include one or more microphones to receive one or more sounds associated with an environment of user 100. By way of example, apparatus 110 may comprise microphones 443, 444, as described with respect to FIGS. 4F and 4G. Microphones 443 and 444 may be configured to obtain environmental sounds and voices of user 100 and various speakers communicating with user 100, and output one or more audio signals. As another example, apparatus 110 may comprise microphone 1720, as described with respect to FIG. 17B. Microphone 1720 may be configured to determine a directionality of sounds in the environment of user 100. For example, microphones 443, 444, 1720, etc., may comprise one or more directional microphones, a microphone array, a multi-port microphone, or the like. The microphones shown in FIGS. 4F, 4G, 17B, etc., are by way of example only, and any suitable number, configuration, or location of microphones may be used.”)
SATO and Wexler (U.S. Patent Number US 20210390957 A1) are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device of SATO to incorporate generate, from the image signal and the mixed audio signal, audio- related information indicating a degree of overlap in a plurality of sound sources included in the mixed audio signal by using a first artificial intelligence (AI) model and provide the at least one sound source to a user through at least one sound device of Wexler (U.S. Patent Number US 20210390957 A1). This allows for improved processing efficiency and/or help to preserve battery life. as recognized by Wexler (U.S. Patent Number US 20210390957 A1) [0106].
Regarding Independent claim 14, claim 14 is a method claim with limitations similar to that of Claim 1 and is rejected under the same rational.
Regarding Independent claim 20, claim 20 is a CRM claim with limitations similar to that of Claim 1 and is rejected under the same rational. Additionally, SATO teaches 20. A non-transitory computer-readable recording medium storing computer program for processing a video including an image signal and a mixed audio signal, which, when executed by at least one processor, causes the at least one processor to execute: (see SATO [0091] “FIG. 5 is a diagram illustrating an example of a configuration of a training apparatus according to the second embodiment. The training apparatus 220 according to the second embodiment is realized, for example, by a computer or the like, which includes a ROM, a RAM, a CPU, and the like, reading a predetermined program and the CPU executing the predetermined program. As illustrated in FIG. 5, the training apparatus 220 includes a feature conversion unit 230, an audio signal processing unit 221, an auxiliary information generation unit 224, a training data selection unit 225, and an update unit 226.”)
As to Claim 7, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teach 7. The electronic device of claim 1,
Furthermore, Wexler (U.S. Patent Number US 20210390957 A1) teaches wherein the at least one processor is further configured to: generate, from the mixed audio signal or from the mixed audio signal and visual information, number-of-speakers related information included in the mixed audio signal by using a third Al model; generate, from the image signal and the mixed audio signal, the audio-related information based on the number-of-speakers related information by using the first Al model; and separate, from the mixed audio signal, at least one of the plurality of sound sources included in the mixed audio signal by applying the number-of-speakers related information and the audio-related information to the second Al model, wherein the visual information comprises at least one key frame included in the image signal, and wherein the at least one key frame comprises a facial area including lips of at least one speaker corresponding to at least one sound source included in the mixed audio signal. (see Wexler (U.S. Patent Number US 20210390957 A1) [0240] “It is also contemplated that in some embodiments, processor 210 may be configured to identify the one or more words using a machine learning algorithm or neural network that may be trained using training examples. Examples of such models may include support vector machines, Fisher's linear discriminant, nearest neighbor, k nearest neighbors, decision trees, random forests, and so forth. By way of example, a set of training examples may include audio samples having, for example, identified words. For example, the training examples may include audio samples including one or more words spoken by a plurality of speakers. By way of another example, the training examples may include audio samples of the one or more words spoken in a variety of intonations. It is contemplated that the machine learning algorithm or neural network may be trained to identify one or more words based on these and/or other training examples. It is further contemplated that the trained machine learning algorithm may be configured to output one or more identified words when presented with one or more audio signals (e.g., audio signal 2802) as inputs. It is also contemplated that a trained neural network for identifying one or more words may be a separate and distinct neural network or may be an integral part of one or more other neural networks discussed above.”) (see Wexler (U.S. Patent Number US 20210390957 A1) [0241] “In some embodiments, the at least one processor may be programmed to generate statistical information associated with the identified at least one word or phrase. The statistical information may include at least one of a total count, an average count, or a frequency of occurrence of the at least one word or phrase in the at least one audio signal. For example, processor 210 may be configured to determine a number of times one or more speakers (e.g., user 100, individual 2710, individual 2720, etc.) in environment 2700 speaks a predetermined word. It is contemplated that processor 210 may be configured to generate various types of statistical information regarding one or more predetermined words. Such information may include total number of times one or more words is spoken, an average over time or over a number of speakers that one or more words is spoken, a frequency with which one or more speakers speaks the one or more predetermined words, etc. By way of example, user 100, individual 2710, and/or individual 2720 may have agreed to minimize a number of curse words that may be included in a conversation. Processor 210 may be configured to tally up the number of times user 100, individual 2710, and/or individual 2720 speaks a curse word. Processor 210 may also be configured to provide the generated statistical information to user 100, individual 2710, and/or individual 2720 by displaying the information on a device (e.g., smartphone, smartwatch, laptop, tablet, or other devices) associated with one or more of user 100, individual 2710, and/or individual 2720.”) (see Wexler (U.S. Patent Number US 20210390957 A1) [0276] “In step 2904, process 2900 may include analyzing the at least one audio signal to distinguish a plurality of voices in the at least one audio signal. For example, processor 210 may analyze audio signal 2802, including audio signals 103, 2714, 2724, etc. associated with, for example, sounds representing the voice of user 100 or individuals 2710, 2720, etc. Processor 210 may analyze the sounds received from microphones 443, 444, and/or 1720 to separate voices of user 100 and/or one or more of individuals 2710, 2720, and/or background noises using any known techniques or algorithms. In some embodiments, processor 210 may perform further analysis on one or more of audio signals 103, 2714, 2724, for example, by determining the identity of user 100 and/or individuals 2710, 2720 using available voiceprints thereof. Alternatively, or additionally, processor 210 may use speech recognition tools or algorithms to recognize the speech of the individuals.”) (see Wexler (U.S. Patent Number US 20210390957 A1) [0271] “In some embodiments, the at least one processor may be programmed to determine at least one facial expression of the identified at least one individual. By way of example, processor 210 may identify, based on analysis of the plurality of images, at least one movement of a face, one or more eyes, nose, forehead, cheeks, lips, etc., associated with the at least one identified individual. Processor 210 may identify the one or more movements of the face, eyes, nose, forehead, cheeks, lips, etc., based on an analysis of the plurality of images. For example, processor 210 may be configured to identify one or more points associated with one or more of a face, eyes, nose, forehead, cheeks, lips, etc. Processor 210 may track the points over multiple frames or images to identify the movements of the face, eyes, nose, forehead, cheeks, lips, etc. Accordingly, processor 210 may use various video tracking algorithms, as described above to determine a facial expression of an identified individual. In some embodiments, the analysis of the plurality of images may be performed by a computer-based model such as a trained neural network. For example, the trained neural network may be trained to receive an image and/or video data, facial expressions, and indications of the facial expressions associated with the received image and/or video data. The neural network may be trained to identify a facial expression (e.g., rolling of eyes, smirk, smile, etc.) when one or more images or video data is provided as an input to the neural network.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1), to incorporate generate, from the mixed audio signal or from the mixed audio signal and visual information, number-of-speakers related information included in the mixed audio signal by using a third Al model; generate, from the image signal and the mixed audio signal, the audio-related information based on the number-of-speakers related information by using the first Al model; and separate, from the mixed audio signal, at least one of the plurality of sound sources included in the mixed audio signal by applying the number-of-speakers related information and the audio-related information to the second Al model, wherein the visual information comprises at least one key frame included in the image signal, and wherein the at least one key frame comprises a facial area including lips of at least one speaker corresponding to at least one sound source included in the mixed audio signal of Wexler (U.S. Patent Number US 20210390957 A1). This allows for improved processing efficiency and/or help to preserve battery life. as recognized by Wexler (U.S. Patent Number US 20210390957 A1) [0106].
As to Claim 8, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teach 8. The electronic device of claim 7,
Furthermore, Wexler (U.S. Patent Number US 20210390957 A1) teaches wherein the number-of-speakers related information included in the mixed audio signal comprises at least one of first number-of-speakers related information about the mixed audio signal or second number-of-speakers related information about the visual information. (see Wexler (U.S. Patent Number US 20210390957 A1) [0241] “In some embodiments, the at least one processor may be programmed to generate statistical information associated with the identified at least one word or phrase. The statistical information may include at least one of a total count, an average count, or a frequency of occurrence of the at least one word or phrase in the at least one audio signal. For example, processor 210 may be configured to determine a number of times one or more speakers (e.g., user 100, individual 2710, individual 2720, etc.) in environment 2700 speaks a predetermined word. It is contemplated that processor 210 may be configured to generate various types of statistical information regarding one or more predetermined words. Such information may include total number of times one or more words is spoken, an average over time or over a number of speakers that one or more words is spoken, a frequency with which one or more speakers speaks the one or more predetermined words, etc. By way of example, user 100, individual 2710, and/or individual 2720 may have agreed to minimize a number of curse words that may be included in a conversation. Processor 210 may be configured to tally up the number of times user 100, individual 2710, and/or individual 2720 speaks a curse word. Processor 210 may also be configured to provide the generated statistical information to user 100, individual 2710, and/or individual 2720 by displaying the information on a device (e.g., smartphone, smartwatch, laptop, tablet, or other devices) associated with one or more of user 100, individual 2710, and/or individual 2720.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1), to incorporate wherein the number-of-speakers related information included in the mixed audio signal comprises at least one of first number-of-speakers related information about the mixed audio signal or second number-of-speakers related information about the visual information of Wexler (U.S. Patent Number US 20210390957 A1). This allows for improved processing efficiency and/or help to preserve battery life. as recognized by Wexler (U.S. Patent Number US 20210390957 A1) [0106].
As to Claim 9, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teach 9. The electronic device of claim 8,
Furthermore, Wexler (U.S. Patent Number US 20210390957 A1) teaches wherein the first number-of-speakers related information comprises a probability distribution of a number of speakers corresponding to the plurality of sound sources included in the mixed audio signal, and wherein the second number-of-speakers related information comprises a probability distribution of the number of speakers included in the visual information. (see Wexler (U.S. Patent Number US 20210390957 A1) [0241] In some embodiments, the at least one processor may be programmed to generate statistical information associated with the identified at least one word or phrase. The statistical information may include at least one of a total count, an average count, or a frequency of occurrence of the at least one word or phrase in the at least one audio signal. For example, processor 210 may be configured to determine a number of times one or more speakers (e.g., user 100, individual 2710, individual 2720, etc.) in environment 2700 speaks a predetermined word. It is contemplated that processor 210 may be configured to generate various types of statistical information regarding one or more predetermined words. Such information may include total number of times one or more words is spoken, an average over time or over a number of speakers that one or more words is spoken, a frequency with which one or more speakers speaks the one or more predetermined words, etc. By way of example, user 100, individual 2710, and/or individual 2720 may have agreed to minimize a number of curse words that may be included in a conversation. Processor 210 may be configured to tally up the number of times user 100, individual 2710, and/or individual 2720 speaks a curse word. Processor 210 may also be configured to provide the generated statistical information to user 100, individual 2710, and/or individual 2720 by displaying the information on a device (e.g., smartphone, smartwatch, laptop, tablet, or other devices) associated with one or more of user 100, individual 2710, and/or individual 2720.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1), to incorporate the first number-of-speakers related information comprises a probability distribution of a number of speakers corresponding to the plurality of sound sources included in the mixed audio signal, and wherein the second number-of-speakers related information comprises a probability distribution of the number of speakers included in the visual information of Wexler (U.S. Patent Number US 20210390957 A1). This allows for improved processing efficiency and/or help to preserve battery life. as recognized by Wexler (U.S. Patent Number US 20210390957 A1) [0106].
As to dependent Claim 18, Claim 18 is a parallel Method claim with limitations similar to that of Claim 7 and is rejected under the same rational.
As to dependent Claim 19, Claim 19 is a parallel Method claim with limitations similar to that of Claim 8 and is rejected under the same rational.
Claims 2 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of MESGARANI (U.S. Patent Number US 20190066713 A1).
As to Claim 2, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teaches 2. The electronic device of claim 1,
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) do not specifically teach wherein the audio-related information comprises a map indicating the degree of overlap in the plurality of sound sources, and wherein each bin of the map has a probability value corresponding to a degree to which one of the plurality of sound sources overlap with another in a time-frequency domain. However, MESGARANI does teach this limitation (see MESGARANI [0226] As discussed herein, an ODAN-based implementation projects T-F bins into a high-dimensional embedding space that is optimal for source separation, meaning that T-F bins belonging to the same source should be placed closer to each other in the embedding space. To confirm that this situation is the case, the representation of two speakers were projected in both the spectrogram domain and embedding domain onto a 2-D space using principal component analysis to allow visualization. This improved separability of speakers is shown in FIG. 27A, where the representations are visualized using the first two principal components of the spectrogram (in graph 2710) and the embedding space (in graph 2720). The improved separation in the embedding space is evident from the decreased overlap in the embedding space. Each dot represents one T-F bin (in the left graph 2710) or one embedded T-F bin (the right graph 2720).”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and MESGARANIare in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1) to incorporate the audio-related information comprises a map indicating the degree of overlap in the plurality of sound sources, and wherein each bin of the map has a probability value corresponding to a degree to which one of the plurality of sound sources overlap with another in a time-frequency domain of MESGARANI. This allows for an a user to communicate more easily with new speakers as recognized by MESGARANI [0236].
As to dependent Claim 15, Claim 15 is a parallel Method claim with limitations similar to that of Claim 2 and is rejected under the same rational.
Claims 3 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Hogden (U.S. Patent Number US 6678658 B1).
As to Claim 3 SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teaches 3. The electronic device of claim 1,
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) do not specifically teach wherein the first Al model comprises: a first submodel configured to generate, from the image signal, a plurality of pieces of mouth movement information representing temporal pronouncing information of a plurality of speakers corresponding to the plurality of sound sources; and a second submodel configured to generate, from the mixed audio signal, the audio-related information, based on the plurality of pieces of mouth movement information. However, Hogden does teach this limitation (see Hogden (3:22-48) “(4) If the data in each window of the acoustic speech signal are first converted to a discrete symbol such as a VQ code, then the process of speech recognition can be thought of as converting the sequence of VQ codes (representing speech acoustics) to phonemes, i.e., finding the most probable sequence of phonemes given a sequence of VQ codes. In the more general case, there may be more than one time-aligned sequence of symbols used as input, e.g., VQ codes representing acoustics and VQ codes representing video images of the mouth. Each of these sequences of symbols is referred to as a input data stream. For example, the sequence of VQ codes representing speech acoustics is an input data stream, and the sequence of VQ codes representing video images is a different input data stream. There may also be more than one set of sequences to output, e.g., a binary variable indicating whether a given segment of speech is voiced and a different binary variable indicating whether the speech segment in nasalized, etc. Each separate output sequence (e.g. the sequence of symbols representing voiced/unvoiced) will be referred to as an output data stream. In general, then, for each window there is a set of output symbols, which are called generally herein the output composite, or, more particularly, speech transcription symbols or word characteristics; and a set of input symbols, which are called generally herein the input composite, or, more particularly, speech codes or words.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and Hogden are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1) to incorporate the first Al model comprises: a first submodel configured to generate, from the image signal, a plurality of pieces of mouth movement information representing temporal pronouncing information of a plurality of speakers corresponding to the plurality of sound sources; and a second submodel configured to generate, from the mixed audio signal, the audio-related information, based on the plurality of pieces of mouth movement information of Hogden. This allows for more context to be utilized as recognized by Hogden (4:23-24).
As to dependent Claim 16, Claim 16 is a parallel Method claim with limitations similar to that of Claim 3 and is rejected under the same rational.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Hogden (U.S. Patent Number US 6678658 B1), and further in view of Joze (U.S. Patent Number US 10931976 B1)
As to Claim 4 SATO in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Hogden teaches 4. The electronic device of claim 3,
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Hogden do not specifically teach wherein the first Al model is trained by comparing training audio-related information estimated from a training image signal and a training audio signal with a ground truth, and wherein the ground truth is generated by a product operation between a plurality of probability maps generated from a plurality of spectrograms generated based on each of a plurality of individual training sound sources included in the training audio signal. However, Joze does teach this limitation (see Joze (7:8-37) “(31) The bridging according to the present techniques is enabled via mutual autoencoders as discussed above, one autoencoder for video data and one autoencoder for audio data. The video autoencoder and the audio autoencoder are trained separately. For ease of description and explanation, the training data may be obtained from the Global Research Identifier Database (GRID) dataset, which includes thirty-four speakers with limited words. Other training data may be used. During training, speech synthesis parameters may be extracted from the training data. For example, the extracted speech parameters may be WORLD parameters as described by M. Morise, F. Yokomori, and K. Ozawa: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications, IEICE transactions on information and systems, vol. E99-D, no. 7, pp. 1877-1884, 2016. In embodiments, the speech synthesis parameters enable neural speech synthesis from the input video. In the examples described herein, the audio sampling frequency for data input to the audio autoencoder may be, for example, 16 kHz. The video frame rate for video data input to the video autoencoder may be, for example, 25 fps. The video frames may be cropped to a size of 112×112 pixels and primarily contain a face of a human. The video autoencoder may be trained to extract facial landmarks using a library or toolkit with machine learning algorithms. In particular, a library such as DLib may be used to extract sixty-eight coordinates (x, y) that map facial points on a human face. DLib may be used to extract a 68×2 element matrix where each row of the element matrix corresponds to a coordinate of a particular feature point in the input image.”) (see Joze (9:6-38) “(43) FIG. 2 illustrates an exemplary encoder portion 200 of an audio autoencoder that enables face-speech bridging by cycle audio/video reconstruction as described herein. In particular, parameters extracted from the audio dataset includes a spectrogram of the audio data, a fundamental frequency (FO), and band aperiodicities (ap). In FIG. 2, the input audio is converted into trainable embedding representations 202. In embodiments, the embedding's 202 represent a relatively low dimensional space to which high dimensional vectors are translated. Thus, an embedding may be a compressed representation of input data. An embedding can be learned and reused across various models. In embodiments, embeddings may be used to map frames of data to low-dimensional real vectors in a way that similar items are close to each other according to a similarity metric. In particular, a frame of audio data and a frame of video data may be mapped in a way that similar audio data and video data are close to each other in the common space. As used herein, being close refers to satisfying a similarity metric. Thus, jointly embedding diverse data types such as audio and video can be accomplished by defining a similarity metric between the audio and video. This similarity metric may be obtained by minimizing a loss, such as the Mean Squared Error minus Correlation (MSE-Corr), with a lower bound value of −1. (44) The spectrogram 204 represents the frequency component of the audio data. In some cases, the spectrogram 204 may be a modulation spectrogram extracted from a speech spectrogram via a short-term spectral analysis. In embodiments, the spectrogram 204 may be a graph of all the frequencies that are present in a sound recording for a given amount of time or a given number of audio frames. The frequency data may be of a dimension 64×513×1.”) (see Joze (12:51-13:7) “(61) FIG. 6 is an illustration of an adversarial network 600. The adversarial network 600 includes dual neural networks a referred to as a generator and a discriminator. The generator may take as input landmarks and embeddings. The generator may map the landmarks into reconstructed frames through a set of convolutional layers, which are modulated by the embeddings. Corresponding audio data may be derived. Accordingly, there are four possible inputs to the adversarial network: the ground truth video data V and the ground truth audio data A; the reconstructed video data V′ and the ground truth audio data A; the ground truth video data V and the reconstructed audio data A′; and the reconstructed video data V′ and the reconstructed audio data A′. For each combination, the adversarial network can output a realism score for each combination. In embodiments, the realism score may be used as a feedback or penalty for the main autoencoding networks. Additionally, in embodiments, the adversarial network may have a generator takes in random numbers and returns an image. The generated image is fed into the discriminator alongside a stream of images taken from the actual, ground-truth dataset. The discriminator takes in both real and fake images and returns probabilities, a number between 0 and 1, with 1 representing a prediction of authenticity and 0 representing fake.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and further in view of Hogden and Joze are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1) and Hogden to incorporate the first Al model is trained by comparing training audio-related information estimated from a training image signal and a training audio signal with a ground truth, and wherein the ground truth is generated by a product operation between a plurality of probability maps generated from a plurality of spectrograms generated based on each of a plurality of individual training sound sources included in the training audio signal of Joze. This allows for improved correlation between facial movement of the person's lips as recognized by Joze (13:21-22).
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Hogden (U.S. Patent Number US 6678658 B1), and further in view of Joze (U.S. Patent Number US 10931976 B1), and further in view of Khoury (U.S. Patent Number US 11715460 B2).
As to Claim 5, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Hogden and further in view of Joze teaches 5. The electronic device of claim 4,
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Hogden and further in view of Joze do not specifically teach wherein each of the plurality of probability maps is generated by MaxClip (log(1 + 11F112), 1), where ||F1|2 is a size of a corresponding spectrogram from among the plurality of spectrograms, and MaxClip (x, 1) is a function that outputs x when x is less than 1, and outputs 1 when x is equal to or greater than 1. However, Khoury does teach this limitation (see Khoury (10:41-53) “(51) As an example, the input audio signal fed into the audio clipping layer 302 contains two seconds of speech. The audio clipping layer 302 randomly selects from any random point in the two-second input audio signal a segment that is between 0 and 300 ms in duration. The audio clipping layer 302 then sets the energy values of the segment to an extreme high or low value (e.g., −1, 1). The audio clipping layer 302 outputs a simulated audio signal having the changes imposed on the input audio signal at the one or more clipped segments and/or the one or more clipped segments. In some cases, the clipping layer 302 may output the original input audio signal.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and further in view of Hogden and further in view of Joze and Khoury are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1) and Hogden, and Joze to incorporate each of the plurality of probability maps is generated by MaxClip (log(1 + 11F112), 1), where ||F1|2 is a size of a corresponding spectrogram from among the plurality of spectrograms, and MaxClip (x, 1) is a function that outputs x when x is less than 1, and outputs 1 when x is equal to or greater than 1 of Khoury. This allows for the improved speaker diarization. as recognized by Khoury (14:50).
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Joze (U.S. Patent Number US 10931976 B1).
As to Claim 6 SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teach 6. The electronic device of claim 1,
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) do not specifically teach wherein the second Al model comprises an input layer, an encoder including a plurality of feature layers, and a bottleneck layer, and wherein the applying of the audio-related information to the second Al model comprises at least one of applying of the audio-related information to the input layer, applying of the audio-related information to each of the plurality of feature layers included in the encoder, or applying of the audio-related information to the bottleneck layer. However Joze does teach this limitation (see Joze (7:7-37) “(31) The bridging according to the present techniques is enabled via mutual autoencoders as discussed above, one autoencoder for video data and one autoencoder for audio data. The video autoencoder and the audio autoencoder are trained separately. For ease of description and explanation, the training data may be obtained from the Global Research Identifier Database (GRID) dataset, which includes thirty-four speakers with limited words. Other training data may be used. During training, speech synthesis parameters may be extracted from the training data. For example, the extracted speech parameters may be WORLD parameters as described by M. Morise, F. Yokomori, and K. Ozawa: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications, IEICE transactions on information and systems, vol. E99-D, no. 7, pp. 1877-1884, 2016. In embodiments, the speech synthesis parameters enable neural speech synthesis from the input video. In the examples described herein, the audio sampling frequency for data input to the audio autoencoder may be, for example, 16 kHz. The video frame rate for video data input to the video autoencoder may be, for example, 25 fps. The video frames may be cropped to a size of 112×112 pixels and primarily contain a face of a human. The video autoencoder may be trained to extract facial landmarks using a library or toolkit with machine learning algorithms. In particular, a library such as DLib may be used to extract sixty-eight coordinates (x, y) that map facial points on a human face. DLib may be used to extract a 68×2 element matrix where each row of the element matrix corresponds to a coordinate of a particular feature point in the input image.”) (see Joze (5:34-60) “(25) An autoencoder, such as the video autoencoder and/or the audio autoencoder described above, is a neural network with equal input and output sizes. During training, the neural network learns to reconstruct the input to derive the output according to an unsupervised learning model by minimizing a reconstruction error custom character. The autoencoder may have an internal, hidden layer that describes a code or common space used to represent the input. Thus, an autoencoder may contain an encoder that maps the input data into the code or common space, the particular common space with mid-level representations of the input data, and a decoder that maps the code or common space to a reconstruction of the input data. In embodiments, the autoencoder may also be further specialized to perform a dimensionality reduction by including a lower dimensional hidden layer. In particular, the common space may constrain the mid-level representations of the input data to be reduced to smaller dimensions than the input data. In some scenarios, this lower dimensional hidden layer may be referred to as a bottleneck. In order to minimize the error between the input data and the reconstructed output data, a training objective of the autoencoder effectively causes the model to learn a transformation from the input space to this lower-dimensional hidden layer and back to a reconstructed output space of the same dimensionality as the input space.”) (see Joze (3:17-42) “(15) The present techniques enable face-speech bridging by cycle audio/video reconstruction. In embodiments, a video and an audio of a speech utterance from a human are mutually autoencoded while maintaining a mid-level representation of each modality that corresponds to the mid-level representation of the remaining one or more modalities. The mid-level representation may be referred to as an embedding. Mutual autoencoding, as used herein, refers to converting information from one or more modalities of communication that share a same relation toward other modalities of the one or more modalities of communication. This same relation may be enforced by a bottleneck loss function as described below. In this manner, the one or more modalities are entangled with each other, such that a same representation of information across each of the one or more modalities of communication exists. This enables a mutual two-way bridge of information sharing across the one or more modalities. In the example of an audio/video modality pair, the present techniques enable a two-way bridge between these modalities. In particular, the audio data can be reconstructed from the video data, and the video data can be reconstructed from the audio data. This mutual two-way bridging via autoencoding has a number of use applications, such as video/audio quality enhancement, helping people with hearing or vision loss, improved audio-visual speech recognition, and improved emotion recognition.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1), to incorporate the second Al model comprises an input layer, an encoder including a plurality of feature layers, and a bottleneck layer, and wherein the applying of the audio-related information to the second Al model comprises at least one of applying of the audio-related information to the input layer, applying of the audio-related information to each of the plurality of feature layers included in the encoder, or applying of the audio-related information to the bottleneck layer of Joze. This allows for improved correlation between facial movement of the person's lips as recognized by Joze (13:21-22).
As to dependent Claim 17, Claim 17 is a parallel Method claim with limitations similar to that of Claim 6 and is rejected under the same rational.
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Joze (U.S. Patent Number US 10931976 B1), and further in view of Zhao (U.S. Patent Number US 20200211569 A1).
As to Claim 10, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teach 10. The electronic device of claim 7,
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) do not specifically teach wherein the second Al model comprises an input layer, an encoder including a plurality of feature layers, and a bottleneck layer, However Joze does teach this limitation (see Joze, (7:7-37) “(31) The bridging according to the present techniques is enabled via mutual autoencoders as discussed above, one autoencoder for video data and one autoencoder for audio data. The video autoencoder and the audio autoencoder are trained separately. For ease of description and explanation, the training data may be obtained from the Global Research Identifier Database (GRID) dataset, which includes thirty-four speakers with limited words. Other training data may be used. During training, speech synthesis parameters may be extracted from the training data. For example, the extracted speech parameters may be WORLD parameters as described by M. Morise, F. Yokomori, and K. Ozawa: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications, IEICE transactions on information and systems, vol. E99-D, no. 7, pp. 1877-1884, 2016. In embodiments, the speech synthesis parameters enable neural speech synthesis from the input video. In the examples described herein, the audio sampling frequency for data input to the audio autoencoder may be, for example, 16 kHz. The video frame rate for video data input to the video autoencoder may be, for example, 25 fps. The video frames may be cropped to a size of 112×112 pixels and primarily contain a face of a human. The video autoencoder may be trained to extract facial landmarks using a library or toolkit with machine learning algorithms. In particular, a library such as DLib may be used to extract sixty-eight coordinates (x, y) that map facial points on a human face. DLib may be used to extract a 68×2 element matrix where each row of the element matrix corresponds to a coordinate of a particular feature point in the input image.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and Joze are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1), to incorporate wherein the second Al model comprises an input layer, an encoder including a plurality of feature layers, and a bottleneck layer of Joze. This allows for improved correlation between facial movement of the person's lips as recognized by Joze (13:21-22).
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and further in view of Joze do not specifically teach and wherein the applying of the number-of-speakers related information to the second Al model comprises at least one of applying of the number-of-speakers related information to the input layer, applying of the number-of-speakers related information to each of the plurality of feature layers included in the encoder, or applying of the number-of-speakers related information to the bottleneck layer. However Zhao does teach this limitation (see Zhao, [0030] “The classification/recognition mechanism 108 may include a linear layer and a Softmax layer (not shown). The Softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression. The output of the Softmax function can be used to represent a categorical distribution, that is, a probability distribution over a number of different possible outcomes. The utterance-level embedding extraction mechanism 106 may feed the utterance-level embedding/representation 116 to the linear layer and the Softmax layer. The linear layer may map the utterance-level embedding/representation 116 into a predetermined dimensional vector. For example, if the number of speakers were 1251, the linear layer would map the 512-D vector to a 1251-D vector. After passing the Softmax layer, each element of the 1251-D vector may have a value corresponding to a probability associated with a class. The element with the maximum value may be selected to determine to which class the input audio signal belongs. Each class may be associated with an ID of a speaker. As an example, if the R.sup.th element of the 1251-D vector were the maximum value, the input utterance/sentence would be determined as belonging to the R.sup.th class, which may correspond an ID of the R.sup.th speaker, where 1≤R≤1251. That is, the input utterance/sentence belongs to the R.sup.th speaker. Numbers and symbols discussed herein are uses for the sake of description without limiting the application thereto.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and further in view Joze and Zhao are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1) and Joze, to incorporate the applying of the number-of-speakers related information to the second Al model comprises at least one of applying of the number-of-speakers related information to the input layer, applying of the number-of-speakers related information to each of the plurality of feature layers included in the encoder, or applying of the number-of-speakers related information to the bottleneck layer of Zhao This allows for improving the performance and stability as recognized by Zhao [0042].
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of Wexler (U.S. Patent Number US 20220172736 A1).
As to Claim 11, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teach 11. The electronic device of claim 1,
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) do not specifically teach wherein the at least one processor is further configured to: obtain a plurality of pieces of mouth movement information associated with the plurality of speakers from the image signal; and separate, from the mixed audio signal, at least one of the plurality of sound sources included in the mixed audio signal by applying the obtained plurality of pieces of mouth movement information to the second Al model. However, Wexler (U.S. Patent Number US 20220172736 A1) does teach this limitation (See Wexler (U.S. Patent Number US 20220172736 A1) “[0244] To separate the speaker's voice from additional speakers or background noise in a noisy audio, a second engine, such as a neural network may receive the noisy audio and the speaker's signature, and output audio (which may also be represented as attributes) of the voice of the speaker as extracted from the noisy audio, separated from the other speech or background noise. It will be appreciated that the same or additional neural networks may be used to separate the voices of multiple speakers. For example, if there are two possible speakers, two neural networks may be activated, each with models of the same noisy output and one of the two speakers. Alternatively, a neural network may receive voice signatures of two or more speakers and output the voice of each of the speakers separately. Accordingly, the system may generate two or more different audio outputs, each comprising the speech of a respective speaker. In some embodiments, if separation is impossible, the input voice may only be cleaned from background noise. Thus, as explained above, processor 210 may be configured to determine whether audio signal 2702 includes a voice of user 100 when, for example, a portion of audio signal 2702 matches with a voiceprint associated with user 100. As also discussed above, processor 210 may additionally or alternatively recognize the voices of individuals 2620, 2630, etc., by tracking lip movements of one or more of individuals 2620, 2630, etc., in the one or more images obtained using camera 1730. It will be appreciated, however, that an audio signal may be separated also if none, or only part of the speakers in the audio are recognized and a corresponding voice print is available.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and Wexler (U.S. Patent Number US 20220172736 A1) are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1), to incorporate the at least one processor is further configured to: obtain a plurality of pieces of mouth movement information associated with the plurality of speakers from the image signal; and separate, from the mixed audio signal, at least one of the plurality of sound sources included in the mixed audio signal by applying the obtained plurality of pieces of mouth movement information to the second Al model of Wexler (U.S. Patent Number US 20220172736 A1) This allows for improve processing efficiency and/or help to preserve battery life. as recognized by Wexler (U.S. Patent Number US 20220172736 A1) [0099].
Claims 12 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over SATO (U.S. Patent Number US 20220335965 A1), in view of Wexler (U.S. Patent Number US 20210390957 A1), and further in view of WEXLER (U.S. Patent Number US 20220021985 A1),
As to Claim 12, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teach 12. The electronic device of claim 1,
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) do not specifically teach further comprising: an input/output interface configured to display a screen on which the video is played back and receive, from a user, an input for selecting at least one speaker from among a plurality of speakers corresponding to the plurality of sound sources included in the mixed audio signal; and an audio output interface configured to output at least one sound source corresponding to the at least one speaker selected from among the plurality of sound sources included in the mixed audio signal. However, WEXLER (U.S. Patent Number US 20220021985 A1) does teach this limitation (see WEXLER (U.S. Patent Number US 20220021985 A1), [0447] In various embodiments, the hearing aid system may include user interface 3670 to allow user 100 to change performance characteristics of the hearing aid system. In some embodiments, the user interface 3670 may include an interface for receiving a visual, audio, tactile, or any other suitable signal from user 100. For example, the interface may include a display that may be part of a mobile device (e.g., a smartphone, laptop, tablet, etc.) In an example embodiment, the interface may include a touch screen, a graphical user interface (GUI) having GUI elements that may be manipulated by user gestures, or by appropriate physical or virtual (i.e., on screen) devices (e.g., keyboard, mouse, etc.). In some embodiments, interface 3670 may be an audio interface capable of receiving user 100 audio inputs (e.g., user 100 voice inputs) for adjusting one or more parameters of the hearing aid system. For example, user 100 may adjust the loudness of the audio signal produced by the hearing aid system using audio voice inputs, the pitch of the audio signal produced by the hearing aid system, tempo of the audio signal, and the like. In some embodiment, user interface 3670 may be configured to assist user 100 in identifying the data record for a speaker in conversation with user 100 and for facilitating separation of the voice of the speaker from the audio data captured by microphones of the hearing aid system. For example, interface 3670 may prompt user 100 to select a name for the speaker from a list of available names, to display an image of the speaker, to select an audio stream corresponding to the voice of the speaker, and the like.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and WEXLER (U.S. Patent Number US 20220021985 A1) are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1), to incorporate an input/output interface configured to display a screen on which the video is played back and receive, from a user, an input for selecting at least one speaker from among a plurality of speakers corresponding to the plurality of sound sources included in the mixed audio signal; and an audio output interface configured to output at least one sound source corresponding to the at least one speaker selected from among the plurality of sound sources included in the mixed audio signal of WEXLER (U.S. Patent Number US 20220021985 A1) doing so allows for improved processing efficiency and/or help to preserve battery life as recognized by WEXLER (U.S. Patent Number US 20220021985 A1) [0164].
As to Claim 13, SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) teach 13. The electronic device of claim 12,
Furthermore, WEXLER (U.S. Patent Number US 20220021985 A1) teaches wherein the at least one processor is further configured to: display, on the screen, a user interface for adjusting a volume of at least one sound source corresponding to the selected at least one speaker and receive, from the user, adjustment of the volume of the at least one sound source; and based on the adjustment of the volume of the at least one sound source, adjust the volume of the at least one sound source that is output through the audio output interface. (see WEXLER (U.S. Patent Number US 20220021985 A1), [0589] “Hearing aid system 4700 may also present the determined identity of sound emanating object 4710 to user 100 visually. FIG. 48 is an illustration showing an exemplary device displaying the name of a sound emanating object consistent with the present disclosure. As shown in FIG. 48, hearing aid system 4700 may display information about sound emanating object 4710 on a display of device 4801. In some embodiments, device 4801 may be a paired wearable device, such as a mobile phone, tablet, personal computer, smart watch, heads up display (HUD), or the like. In embodiments where sound emanating device 4710 is an individual, the at least one action performed by hearing aid system 4700 may include causing a name 4810 of the individual to be shown on the display. Various other information may also be presented on the display. For example, device 4801 may display an image 4811 of the object or individual, as shown in FIG. 48. Where sound emanating object is an individual, hearing aid system 4700 may display various other identification information associated with the individual (e.g., a phone number, address, title, company, relationship, age, etc.). The display may also include other functionality associated with the individual, such as contacting the individual (e.g., by phone, email, SMS, etc.), access an account associated with the individual (e.g., a social media page, file sharing account or location, etc.), or the like. In some instances, the display may also include functionality for confirming or editing the identification of sound emanating object 4710, for example, to improve a trained neural network or other machine learning system, as described above.”) (see WEXLER (U.S. Patent Number US 20220021985 A1) [0432] “If a voiceprint is available, (step 3604, Yes) the speaker's voice may be separated from the audio data and transmitted to user 100 at step 3606. If no voiceprint is available, and/or if the separation of the speaker's voice from the audio data is not successful, (step 3604, No) the hearing aid system may silence the output at step 3601. The output may be silenced using any of the approaches described above. In some embodiments, completely silencing the rest of the voices may create an uneasy and out of context feel, for example, when speaking to a person in a restaurant and seeing the waiter approaching and talking but not hearing anything. Therefore, providing a low but positive amplification for the other sound, for example, 10%, 20%, or any other suitable degree of the volume may feel more natural for user 100. Similarly, if no voice is recognized by the hearing aid system, instead of silencing everything, the loudness of the environmental noises can be reduced to a predetermined level. In such circumstances, the audio related to the environmental sounds may be transmitted at a low volume (e.g., 10% of the original volume) for a more natural feeling, enabling user 100, for example, to hear some background noise at a restaurant. The loudness level selected by the hearing aid system may be set by a user or predetermined, depending on an environmental situation, location of user 100, time of the day, and the like.”)
SATO in view of Wexler (U.S. Patent Number US 20210390957 A1) and WEXLER (U.S. Patent Number US 20220021985 A1) are in the same field of endeavor of signal processing, therefore, it would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the device combination of SATO and Wexler (U.S. Patent Number US 20210390957 A1), to incorporate wherein the at least one processor is further configured to: display, on the screen, a user interface for adjusting a volume of at least one sound source corresponding to the selected at least one speaker and receive, from the user, adjustment of the volume of the at least one sound source; and based on the adjustment of the volume of the at least one sound source, adjust the volume of the at least one sound source that is output through the audio output interface of WEXLER (U.S. Patent Number US 20220021985 A1) doing so allows for improved processing efficiency and/or help to preserve battery life as recognized by WEXLER (U.S. Patent Number US 20220021985 A1) [0164].
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KRISTEN MICHELLE MASTERS whose telephone number is (703)756-1274. The examiner can normally be reached M-F 8:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KRISTEN MICHELLE MASTERS/Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659