DETAILED ACTION
This office action is in response to Applicant’s Request for Continued Examination (RCE), received on 01/23/2026. Claims 1, 22, and 28 have been amended. Claims 1-13, 22-28 are pending and have been considered.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Election/Restrictions
Applicant’s election without traverse of Group 1, consisting of claims 1-13 and 22-28 in the reply filed on 05/21/2025 is acknowledged.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 01/23/2026 has been entered.
Response to Arguments
Applicant’s arguments, see pgs. 11-12, filed 12/30/2025, with respect to the rejection(s) of independent claim(s) 1, 22, and 28 under 35 U.S.C. 103 have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Zou et al. (CN-111754983-A), hereinafter Zou (with respect to the attached machine translation of Zou). Zou discloses “The invention claims a voice de-noising method, device, electronic device and storage medium, the method comprises: obtaining the voice data to be de-noised; extracting the spectrogram information of the voice data to be de-noised; inputting the spectrum map information into the pre-trained neural network model, obtaining the signal-to-noise ratio corresponding to the spectrum map information, wherein the neural network model is obtained based on sample spectrum information of sample voice data marked with known noise data and sample signal-to-noise ratio corresponding to sample spectrum information; de-noising the voice data to be de-noised based on the signal-to-noise ratio corresponding to the spectrum map information, and obtaining the de-noised voice data. when training the neural network model, the noise data in the sample voice data is known, so that the trained neural network model can accurately determine the signal-to-noise ratio corresponding to the frequency spectrum information of the voice data to be de-noised” (abstract). Zou will be used to replace Krishnamoorthy. See updated rejections below.
Applicant’s arguments with respect to claim(s) 2-13, 23-27 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: an "automatic speech recognition (ASR) engine" used for “processing” in claims 2, 23.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-6, 10-13, 22-28 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zou et al. (CN-111754983-A), hereinafter Zou, in view of Jensen et al. (US-20210084407-A1), hereinafter Jensen, further in view of Sharma et al. (US-20210350804-A1), hereinafter Sharma.
Regarding claim 1, Zou discloses: a computer-implemented method ([pg. 3, last para] a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the speech denoising method according to the first aspect are implemented), comprising:
processing, using a trained neural network model ([pg. 5, para 7] neural network model is obtained by training based on sample spectrogram information of sample voice data), a first digital representation of first audio data as first input ([pg. 5, para 6] S130, inputting the spectrogram information into a pre-trained neural network model), to generate as output a first signal-to-noise ratio (SNR) value, wherein the first SNR is a prediction of the trained neural network model ([pg. 5, para 6] to obtain a signal-to-noise ratio corresponding to the spectrogram information, [Inputting a spectrogram into a neural network to obtain a signal-to-noise ratio indicates the SNR is a prediction of the trained neural network]), for the first audio data ([Voice data is first audio data]), wherein the first audio data captures a spoken utterance of a user ([pg. 5, para 1] after the voice data to be denoised is obtained, spectrogram information of the voice data to be denoised can be extracted, [Voice data tracks to a spoken utterance]).
Zou does not disclose:
wherein the first audio data is collected by one or more first microphones of a first computing device within an environment of a user; and,
processing, using the trained neural network model, the second digital representation of the second audio data as second input, to generate as output a second SNR value, wherein the second SNR is a prediction of the trained neural network model, for the second audio data, wherein the second audio data captures the spoken utterance of the user and is collected by one or more second microphones of a second computing device within the environment, and wherein the first and second computing devices are distinct from each other; and,
merging the first audio data and the second audio data to generate merged audio data for the spoken utterance.
Jensen discloses:
wherein the first audio data is collected by one or more first microphones of a first computing device within an environment of a user ([Fig. 1A, ML, MR in the environment of user 106], [0030] The listener 106 has microphones, M.sub.L (108) and M.sub.R (110), respectively positioned near the left ear and right ear of the listener 106… disposed on a head-worn device (e.g., headsets, glasses, earbuds, etc.), [The examiner asserts that a worn microphone must necessarily be connected to a computing device in order for the sound picked up by the microphone to be analyzed as disclosed in Jensen, i.e. smart glasses]); and,
processing, using the trained neural network model ([In view of the previously disclosed trained neural network model of Zou]), a second digital representation of second audio data as second input ([0075] the frequency domain representation of the second input signal comprises a first complex vector representing a spectrogram, [A representation comprising a vector which represents a spectrogram indicates the vector could be converted to a spectrogram using the spectrogram analysis of Zou, i.e. resulting in a second digital representation of second audio data]), to generate as output a second SNR value, wherein the second SNR is a prediction of the trained neural network model ([The examiner asserts that it would have been obvious to apply the SNR predicting of Zou (previously cited, see [pg. 5, para 6]) to the multiple signals to be denoised of Jensen, resulting in a second SNR which is a prediction of a trained neural network model for second audio data in view of the second audio data of Jensen, wherein Jensen explicitly discloses receiving first and second input signals characterized by SNRs as the first step of their method ([0065], Fig. 9) which could be gathered using the method of Zou without a change in functionality to Jensen]), for the second audio data ([0073] The second input signal can be characterized by a second SNR, [In view of the previously disclosed training model of Zou which can determine SNRs of spectrograms, i.e. input signals]);
wherein the second audio data captures the spoken utterance of the user ([Fig. 10, 1004], [Receiving a second input signal representative of the audio, in view of the first receiving step 1002 which receives a signal representative of audio indicates the audio received by the second input signal representation is the same as the first, in view of the voice data of Zou]) and is collected by one or more second microphones of a second computing device within the environment ([Fig. 1A, MP], [All representing microphones wherein MP is on a second computing device in the environment compared to the worn ML and MR]), wherein the first and second computing devices are distinct from each other ([Fig. 7, 702, 704], [Disclosing on-head mics and off-head mics indicates one worn device and one external device]); and,
merging the first audio data and the second audio data to generate merged audio data for the spoken utterance ([Fig. 9, Combining first input signal and second input signal 906], [In view of the voice data of Zou]).
Zou and Jensen are considered analogous art within speech signal enhancement. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Zou to incorporate the teachings of Jensen, because of the novel way to consider both directionality and SNR of received audio sources from a plurality of devices, improving the naturalness of reproduced sounds in terms of SNR and spatial perception (Jensen, [0010]).
Zou in view of Jensen does not disclose:
wherein the merging comprises using a first weight value for the first audio data in the merging and using a second weight value for the second audio data in the merging, and wherein the first weight value and the second weight value are determined based on the first and second SNR values that are predictions of the trained neural network; and,
providing the merged audio data for further processing by one or more additional components.
Sharma discloses:
wherein the merging comprises using a first weight value for the first audio data in the merging and using a second weight value for the second audio data in the merging ([0167] combining 1720 the first speech processing output with the second speech processing output based upon, at least in part, the first audio stream weight and the second audio stream weight, [In view of the first audio output of Zou, further in view of the second audio output of Jensen]), wherein the first weight value and the second weight value are determined based on the first and second SNR values that are predictions of the trained neural network ([0164] weighting 1712 the first audio stream and the second audio stream based upon, at least in part, a signal-to-noise ratio for the first audio stream and a signal-to-noise ratio for the second audio stream, [In view of the trained neural network determining SNRs of Zou, further in view of the multiple input audio signals of Jensen indicating a weighting operation based on first and second SNRs which are predictions of the trained neural network. Further, consider the previously cited “combining” two input signal operation of Jensen ([Fig. 9, 906]) using the SNRs of Zou]); and,
providing the merged audio data for further processing by one or more additional components ([0058] configured to provide visual information 110 and audio information 114… [0067] automated clinical documentation process 10 may provide the information associated with an acoustic environment (e.g., via the user interface) [Providing audio information, in view of the merged signal generated in Sharma, to a user device, i.e. component, via an interface indicates presentation to a user, wherein that user could perform any number of additional processing steps, i.e. listening, transcribing, volume adjustment, etc.]).
Zou, Jensen, and Sharma are considered analogous art within speech signal processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Zou in view of Jensen to incorporate the teachings of Sharma, because of the novel way to apply multi-channel noise reduction and de-reverberation methods which account for user location to combine channel signals, improving the accuracy of generated text transcriptions in environments with multiple microphones (Sharma, [0003]).
Regarding claim 2, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 1.
Sharma further discloses:
wherein the one or more additional components include an automatic speech recognition (ASR) engine ([0063] speech processing system 418 may be an automated speech recognition (ASR) system), and the method further comprising:
processing the merged audio data, using the ASR engine ([In view of the previously disclosed speech processing engine of Sharma]), to generate a recognition of the spoken utterance of the user ([Fig. 4, 418], [Fig. 16, 1716], [0167] process 1716 the first audio stream (e.g., represented as a solid line between device selection weighting module 410 and speech processing system 418) with a first speech processing system to generate a first speech processing output [Speech processing output from a speech processing, i.e. recognition, system 418, wherein the input is a weighted signal from two sources, see Fig. 4, 410 inputs, indicates the output is a recognition of the spoken utterance]).
Regarding claim 3, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 1.
Sharma further discloses:
wherein the first and second computing devices have different orientations and/or distances with respect to the user ([Fig. 3, directional microphone array 200], [0087] Based on the orientation of each microphone, the properties of each microphone, the properties of the audio signals, etc., each microphone may receive a different version of the signal, [Disclosing different orientations for microphones in view of the microphone array 200 indicates the devices, i.e. microphones, have different orientations and distances with respect to the users 106. Further, consider the microphone array 200 in view of the smartphone held by user 226 indicating two device with different orientation/distances]).
Regarding claim 4, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 1.
Sharma further discloses:
wherein a ratio of the first weight value with respect to the second weight value is the same as a ratio of the predicted SNR for the first audio data with respect to the predicted SNR for the second audio data ([0128] weight multiple audio streams (e.g., from different microphone systems) based upon, at least in part, a signal-to-noise ratio for each audio stream… [0167] combining 1720 the first speech processing output with the second speech processing output based upon, at least in part, the first audio stream weight and the second audio stream weight, [Combining two audio signals based on respective weights tracks to a method of generating a ratio, i.e. the final output signal will be a ratio of the two input signals, e.g. as they both contribute to the final output, wherein the weights are based on their respective SNRs indicating the combination, i.e. ratio, of the weights is the same as the combination, i.e. ratio, of the SNRs would be]).
Regarding claim 5, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 1.
Sharma further discloses:
storing the first weight value in association with the first computing device ([Fig. 1, 12, 16], [0128] weighting module 410 of ACD computer system 12, [0034] storage device 16 coupled to ACD computer system 12 [Performing weighting on a device which is connected to a memory indicates the weight values determined in computer system 12 could be sent to storage device 16 without a change in functionality]); and,
storing the second weight value in association with the second computing device ([Performing weighting on a device which is connected to a memory indicates the weight values determined in computer system 12 could be sent to storage device 16 without a change in functionality. Applying this to a second weight associated with a second computing device does not change the functionality of Sharma in view of the plurality of computing devices in Fig. 1, i.e. 30, 32, 34, 36]).
Regarding claim 6, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 5.
Sharma further discloses:
receiving first additional audio data ([0155] receive 1700 audio encounter information from a first microphone system), from the first computing device ([In view of the plurality of computing devices 30-36 of Sharma, any of which are substitutable for that disclosed in Jensen to be used in Zou without a change in functionality]), that captures an additional spoken utterance ([In view of the previously disclosed spoken utterance of Zou, indicating the audio encounter information of Sharma to be an additional utterance. Further, in view of the portion/frame basis for analysis of Sharma indicating one input signal could be comprised of additional audio data as compared to a first portion/frame]);
receiving second additional audio data ([0155] Audio encounter information may be received 1702 from a second microphone system), from the second computing device ([In view of the second audio device of Jensen, further in view of the plurality of computing devices 30-36 of Sharma, any of which are substitutable for that disclosed in Jensen to be used in Zou without a change in functionality]), that captures the additional spoken utterance ([Fig. 8, 808], [In view of the previously disclosed second spoken utterance of Jensen, indicating the second audio encounter information of Sharma to be a second utterance. Further, in view of the portion/frame basis for analysis of Sharma indicating one input signal could be comprised of additional audio data as compared to a first portion/frame. Further, determining time of difference of arrival between microphone for the same received audio encounter information 800 indicates the same audio is received between those microphones, i.e. the additional spoken utterance is the same between the two devices]);
merging the first additional audio data and the second additional audio data to generate additional merged audio data ([0167] combining 1720 the first speech processing output with the second speech processing output, [In view of the first and second additional audio outputs of Sharma]), wherein the merging to generate the additional merged audio data comprises using the stored first weight value and the stored second weight value without using the trained neural network to re-compute the first and second weight values ([0128] In some implementations, device selection and weighting module 410 may provide a previously processed or weighted portion of the audio encounter information to define weighting for each audio stream from multiple microphone systems [Using a previously processed portion of audio for weighting, in view of the portion/frame first and second audio data of Sharma, indicates weight could be gathered for first and second audio data to be used for further portions/frames of first and second audio streams]); and,
processing the additional merged audio data to recognize the additional spoken utterance ([Fig. 4, 418], [Fig. 16, 1716], [0167] process 1716 the first audio stream (e.g., represented as a solid line between device selection weighting module 410 and speech processing system 418) with a first speech processing system to generate a first speech processing output [Speech processing output from a speech processing, i.e. recognition, system 418, wherein the input is a weighted signal from two sources, see Fig. 4, 410 inputs, indicates the output is a recognition of the additional spoken utterance]).
Regarding claim 10, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 1.
Sharma further discloses:
detecting a change in location and/or orientation of the first computing device ([0125] machine vision system 100 may detect and track location estimates for particular speaker representations by tracking the humanoid shapes within the acoustic environment [Tracking humanoid shapes, wherein those humanoid shapes are generally associated with devices, see Fig. 1 and also speaker 226 holding a smartphone in Fig. 3, indicates the tracking will detect positional changes of devices associated with the humanoid shapes as the humans move]);
subsequent to detecting the change in the location and/or the orientation of the first computing device:
receiving further first audio data ([0155] receive 1700 audio encounter information from a first microphone system), from the first computing device ([In view of the first/second audio devices of Jensen, further in view of the plurality of computing devices 30-36 of Sharma, any of which are substitutable for that disclosed in Jensen to be used in Zou without a change in functionality]), that captures a further spoken utterance ([In view of the previously disclosed spoken utterance of Zou, indicating the audio encounter information of Sharma to be a further utterance. Further, in view of the portion/frame based signal analysis of Sharma (see Fig. 17, 1704) indicating later portions of signals represent further spoken utterances. Also consider [0164] which discloses weighting on a portion/frame basis indicating further audio from further portions/frames]);
receiving further second audio data ([0155] Audio encounter information may be received 1702 from a second microphone system), from the second computing device ([In view of the second audio device of Jensen, further in view of the plurality of computing devices 30-36 of Sharma, any of which are substitutable for that disclosed in Jensen to be used in Zou without a change in functionality]), that captures the further spoken utterance ([Fig. 8, 808], [In view of the previously disclosed second spoken utterance of Jensen, indicating the second audio encounter information of Sharma to be a second, further utterance. Further, in view of the portion-based signal analysis of Sharma (see Fig. 17, 1706) indicating portions of signals represent further spoken utterances. Also consider [0164] which discloses weighting on a portion/frame basis indicating further audio from further portions/frames. Further, determining time of difference of arrival between microphone for the same received audio encounter information 800 indicates the same audio is received between those microphones, i.e. the further spoken utterance is the same between the two devices.]).
Zou further discloses:
processing the further first audio data to determine a further first digital representation of the further first audio data ([pg. 5, para 1] after the voice data to be denoised is obtained, spectrogram information of the voice data to be denoised can be extracted, wherein the spectrogram information can include the amplitude of the voice data to be denoised, the phase of the voice data to be denoised, and the like, [Wherein “further” first audio can be a different temporal portion of the same signal, indicating the same operations applied as disclosed in Zou using the portions defined in Sharma]); and,
processing, using the trained neural network ([In view of the previously disclosed trained neural network of Zou]), the further first digital representation of the further first audio data as input ([The examiner asserts that extending the operations of Zou to multiple distinct signals does not change the functionality of Zou in view of the multiple signals of Jensen, any of which could be “further”. Further, consider the temporal analysis of Sharma indicating the signals of Zou can be split into “further” portions]), to generate a further first output indicating an updated first SNR predicted for the further first audio data ([pg. 5, para 6] S130, inputting the spectrogram information into a pre-trained neural network model to obtain a signal-to-noise ratio corresponding to the spectrogram information).
Jensen further discloses:
processing the further second audio data to determine a further second digital representation of the further second audio data ([0075] in some implementations, the frequency domain representation of the second input signal comprises a first complex vector representing a spectrogram of a frame of the second input signal, [In view of the further second audio data of Sharma, further in view of the spectrogram generation based on signals of Zou, indicating the vector of Jensen representing a second audio spectrogram frame could be transformed using the spectrogram generation of Zou without a change in functionality]); and,
processing, using the trained neural network ([In view of the previously disclosed trained neural network of Zou]), the further second digital representation of the further second audio data as input ([In view of the spectrogram of Zou based upon a second input signal of Jensen, wherein that second input signal could be further second audio data as disclosed through the portion-based analysis of audio signals disclosed in Sharma, i.e. analyzing different portions/frames of audio indicates each portion/frame is “further” audio]), to generate a further second output indicating an updated second SNR predicted for the further second audio data ([0073] The second input signal can be characterized by a second SNR, [In view of the SNR determination based on spectrogram analysis of Zou. Further, in view of the further second audio data of Sharma]).
Sharma further discloses:
merging the further first audio data and the second audio data using an updated first weight value and an updated second weight value ([0167] combining 1720 the first speech processing output with the second speech processing output based upon, at least in part, the first audio stream weight and the second audio stream weight), [In view of the further first audio output of Sharma, further in view of the further second audio output of Sharma. Wherein the weights are calculated based on the SNRs of the signals received, indicating a further, i.e. different, signal will have a different SNR, indicating it will also have a different, i.e. updated, associated weight]), to generate further merged audio data ([Fig. 17, 1720], [Combining first and second speech outputs based on their associated weights indicates the resultant combination is “merged audio data”]), wherein the updated first and second weight values are determined based on the further first output indicating the updated first SNR and based on the further second output indicating the updated second SNR ([0164] weighting 1712 the first audio stream and the second audio stream based upon, at least in part, a signal-to-noise ratio for the first audio stream and a signal-to-noise ratio for the second audio stream, [In view of the further first and second audio streams, i.e. portions, of Sharma indicating the SNR calculation could be applied to all portions individually as further audio]).
Regarding claim 11, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 1.
Zou further discloses:
wherein the first digital representation of the first audio data is a first spectrogram showing variation of frequency along time for the first audio data ([pg. 5, para 1] after the voice data to be denoised is obtained, spectrogram information of the voice data to be denoised can be extracted, wherein the spectrogram information can include the amplitude of the voice data to be denoised, the phase of the voice data to be denoised, and the like, [Gathering spectrogram information for generating a spectrogram indicates the spectrogram to be frequency along time as is traditionally how spectrograms are presented]), and
the second digital representation of the second audio data is a second spectrogram showing variation of frequency along time for the second audio data ([The examiner asserts that the process described above for determining a SNR for one spectrogram of audio data could be extended to a second spectrogram using the multiple input signals of Jensen without a change in functionality to Zou as the two representations are generated separately, indicating two steps in time of repeated operations of Zou using the signals of Jensen]).
Regarding claim 12, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 1.
Sharma further discloses:
wherein: the first weight value is greater than the second weight value, indicating that the first audio data contains less noise than does the second audio data ([Fig. 3, Users 240, 230/242, Microphones 210, 218], [0164] weighting 1712 the first audio stream and the second audio stream based upon, at least in part, a signal-to-noise ratio for the first audio stream and a signal-to-noise ratio for the second audio stream, [Determining SNRs and associated weights based on signal SNRs indicates that if a first signal has a higher SNR compared to a second signal, it will inherently have a greater weight as compared to the second audio which means it will contain less noise. Nothing is disclosed in Sharma to prevent the first signal from having a higher SNR as compared to a second signal. Choosing first and second signals so that the first weight is larger than the second weight is an arbitrary decision/labelling of signals. Further, consider the environment and microphone array of Fig. 3. If user 226 speaks signal 220, microphones 210 and 218 will both received the audio; however, the SNR of a signal 220 received by a microphone 210 directly in front of a speaker will be higher than that facing a different direction 218, containing audio from other speakers 224, etc. Further, the signal would arrive at 210 earlier as it is closer to the speaker, indicating a first audio with a higher SNR received at microphone 210 compared to a second audio with a lower SNR received at 218, wherein the SNRs are used to determine weights, indicating that the first audio data (with a higher SNR) will contain less noise as compared to second audio data received at microphone 218]).
Regarding claim 13, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 1.
Sharma further discloses:
wherein merging the first audio data and the second audio data comprises: merging the first audio data weighted with the first weight value with the second audio data weighted with the second weight value, to generate the merged audio data for the spoken utterance ([Fig. 17, 1720], [Combining first and second speeches based upon weight values of streams indicates the combining is a merging operation, in view of the previously disclosed weight values for first and second audio]).
Regarding claim 22, Zou discloses: a system comprising:
one or more processors ([pg. 9, para 2] a processor); and,
memory storing instructions that, when executed by the at least one or more processors ([pg. 9, para 2] computer program stored in the memory and executable on the processor, where the processor implements the steps of the speech denoising method according to the first aspect when executing the program), cause the at least one processor or more processors to:
process, using a trained neural network model ([pg. 5, para 7] neural network model is obtained by training based on sample spectrogram information of sample voice data), a first digital representation of first audio data as first input ([pg. 5, para 6] S130, inputting the spectrogram information into a pre-trained neural network model), to generate as output a first signal-to-noise ratio (SNR) value, wherein the first SNR is a prediction of the trained neural network model ([pg. 5, para 6] to obtain a signal-to-noise ratio corresponding to the spectrogram information, [Inputting a spectrogram into a neural network to obtain a signal-to-noise ratio indicates the SNR is a prediction of the trained neural network]), for the first audio data ([Voice data is first audio data]), wherein the first audio data captures a spoken utterance of a user ([pg. 5, para 1] after the voice data to be denoised is obtained, spectrogram information of the voice data to be denoised can be extracted, [Voice data tracks to a spoken utterance]).
Zou does not disclose:
wherein the first audio data is collected by one or more first microphones of a first computing device within an environment of a user; and,
process, using the trained neural network model, the second digital representation of second audio data as second input, to generate as output a second SNR value, wherein the second SNR is a prediction of the trained neural network model, for the second audio data, wherein the second audio data captures the spoken utterance of the user and is collected by one or more second microphones of a second computing device within the environment, and wherein the first and second computing devices are distinct from each other; and,
merge the first audio data and the second audio data to generate merged audio data for the spoken utterance.
Jensen discloses:
wherein the first audio data is collected by one or more first microphones of a first computing device within an environment of a user ([Fig. 1A, ML, MR in the environment of user 106], [0030] The listener 106 has microphones, M.sub.L (108) and M.sub.R (110), respectively positioned near the left ear and right ear of the listener 106… disposed on a head-worn device (e.g., headsets, glasses, earbuds, etc.), [The examiner asserts that a worn microphone must necessarily be connected to a computing device in order for the sound picked up by the microphone to be analyzed as disclosed in Jensen, i.e. smart glasses]); and,
process, using the trained neural network model ([In view of the previously disclosed trained neural network model of Zou]), a second digital representation of second audio data as second input ([0075] the frequency domain representation of the second input signal comprises a first complex vector representing a spectrogram, [A representation comprising a vector which represents a spectrogram indicates the vector could be converted to a spectrogram using the spectrogram analysis of Zou, i.e. resulting in a second digital representation of second audio data]), to generate as output a second SNR value, wherein the second SNR is a prediction of the trained neural network model ([The examiner asserts that it would have been obvious to apply the SNR predicting of Zou (previously cited, see [pg. 5, para 6]) to the multiple signals to be denoised of Jensen, resulting in a second SNR which is a prediction of a trained neural network model for second audio data in view of the second audio data of Jensen, wherein Jensen explicitly discloses receiving first and second input signals characterized by SNRs as the first step of their method ([0065], Fig. 9) which could be gathered using the method of Zou without a change in functionality to Jensen]), for the second audio data ([0073] The second input signal can be characterized by a second SNR, [In view of the previously disclosed training model of Zou which can determine SNRs of spectrograms, i.e. input signals]);
wherein the second audio data captures the spoken utterance of the user ([Fig. 10, 1004], [Receiving a second input signal representative of the audio, in view of the first receiving step 1002 which receives a signal representative of audio indicates the audio received by the second input signal representation is the same as the first, in view of the voice data of Zou]) and is collected by one or more second microphones of a second computing device within the environment ([Fig. 1A, MP], [All representing microphones wherein MP is on a second computing device in the environment compared to the worn ML and MR]), wherein the first and second computing devices are distinct from each other ([Fig. 7, 702, 704], [Disclosing on-head mics and off-head mics indicates one worn device and one external device]); and,
merge the first audio data and the second audio data to generate merged audio data for the spoken utterance ([Fig. 9, Combining first input signal and second input signal 906], [In view of the voice data of Zou]).
Zou are considered analogous art within speech signal enhancement. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Zou to incorporate the teachings of Jensen, because of the novel way to consider both directionality and SNR of received audio sources from a plurality of devices, improving the naturalness of reproduced sounds in terms of SNR and spatial perception (Jensen, [0010]).
Zou in view of Jensen does not disclose:
wherein in merging one or more of the processors are to use a first weight value for the first audio data in the merging and using a second weight value for the second audio data in the merging, and wherein the first weight value and the second weight value are determined based on the first and second SNR values that are predictions of the trained neural network; and,
provide the merged audio data for further processing by one or more additional components.
Sharma discloses:
wherein in merging one or more of the processors are to use a first weight value for the first audio data in the merging and using a second weight value for the second audio data in the merging ([0167] combining 1720 the first speech processing output with the second speech processing output based upon, at least in part, the first audio stream weight and the second audio stream weight, [In view of the first audio output of Zou, further in view of the second audio output of Jensen]), wherein the first weight value and the second weight value are determined based on the first and second SNR values that are predictions of the trained neural network ([0164] weighting 1712 the first audio stream and the second audio stream based upon, at least in part, a signal-to-noise ratio for the first audio stream and a signal-to-noise ratio for the second audio stream, [In view of the trained neural network determining SNRs of Zou, further in view of the multiple input audio signals of Jensen indicating a weighting operation based on first and second SNRs which are predictions of the trained neural network. Further, consider the previously cited “combining” two input signal operation of Jensen ([Fig. 9, 906]) using the SNRs of Zou]); and,
provide the merged audio data for further processing by one or more additional components ([0058] configured to provide visual information 110 and audio information 114… [0067] automated clinical documentation process 10 may provide the information associated with an acoustic environment (e.g., via the user interface) [Providing audio information, in view of the merged signal generated in Sharma, to a user device, i.e. component, via an interface indicates presentation to a user, wherein that user could perform any number of additional processing steps, i.e. listening, transcribing, volume adjustment, etc.]).
Zou are considered analogous art within speech signal processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Zou to incorporate the teachings of Sharma, because of the novel way to apply multi-channel noise reduction and de-reverberation methods which account for user location (Sharma, [0003]).
Regarding claim 23, Zou in view of Jensen, further in view of Sharma discloses: the system of claim 22.
Sharma further discloses:
wherein the one or more additional components include an automatic speech recognition (ASR) engine ([0063] speech processing system 418 may be an automated speech recognition (ASR) system), and the processors are further operable to:
process the merged audio data, using the ASR engine ([In view of the previously disclosed speech processing engine of Sharma]), to generate a recognition of the spoken utterance of the user ([Fig. 4, 418], [Fig. 16, 1716], [0167] process 1716 the first audio stream (e.g., represented as a solid line between device selection weighting module 410 and speech processing system 418) with a first speech processing system to generate a first speech processing output [Speech processing output from a speech processing, i.e. recognition, system 418, wherein the input is a weighted signal from two sources, see Fig. 4, 410 inputs, indicates the output is a recognition of the spoken utterance]).
Regarding claim 24, Zou in view of Jensen, further in view of Sharma discloses: the system of claim 22.
Sharma further discloses:
wherein a ratio of the first weight value with respect to the second weight value is the same as a ratio of the predicted SNR for the first audio data with respect to the predicted SNR for the second audio data ([0128] weight multiple audio streams (e.g., from different microphone systems) based upon, at least in part, a signal-to-noise ratio for each audio stream… [0167] combining 1720 the first speech processing output with the second speech processing output based upon, at least in part, the first audio stream weight and the second audio stream weight, [Combining two audio signals based on respective weights tracks to a method of generating a ratio, i.e. the final output signal will be a ratio of the two input signals, e.g. as they both contribute to the final output, wherein the weights are based on their respective SNRs indicating the combination, i.e. ratio, of the weights is the same as the combination, i.e. ratio, of the SNRs would be]).
Regarding claim 25, Zou in view of Jensen, further in view of Sharma discloses: the system of claim 22.
Sharma further discloses:
wherein the at least one processor is further operable to:
store the first weight value in association with the first computing device ([Fig. 1, 12, 16], [0128] weighting module 410 of ACD computer system 12, [0034] storage device 16 coupled to ACD computer system 12 [Performing weighting on a device which is connected to a memory indicates the weight values determined in computer system 12 could be sent to storage device 16 without a change in functionality]); and,
store the second weight value in association with the second computing device ([Performing weighting on a device which is connected to a memory indicates the weight values determined in computer system 12 could be sent to storage device 16 without a change in functionality. Applying this to a second weight associated with a second computing device does not change the functionality of Sharma in view of the plurality of computing devices in Fig. 1, i.e. 30, 32, 34, 36]).
Regarding claim 26, Zou in view of Jensen, further in view of Sharma discloses: the system of claim 25.
Sharma further discloses:
wherein one or more of the processors are further operable to:
receive first additional audio data ([0155] receive 1700 audio encounter information from a first microphone system), from the first computing device ([In view of the plurality of computing devices 30-36 of Sharma, any of which are substitutable for that disclosed in Jensen to be used in Zou without a change in functionality]), that captures an additional spoken utterance ([In view of the previously disclosed spoken utterance of Zou, indicating the audio encounter information of Sharma to be an additional utterance. Further, in view of the portion/frame basis for analysis of Sharma indicating one input signal could be comprised of additional audio data as compared to a first portion/frame]);
receive second additional audio data ([0155] Audio encounter information may be received 1702 from a second microphone system), from the second computing device ([In view of the plurality of computing devices 30-36 of Sharma, any of which are substitutable for that disclosed in Jensen to be used in Zou without a change in functionality]), that captures the additional spoken utterance ([Fig. 8, 808], [In view of the previously disclosed second spoken utterance of Jensen, indicating the second audio encounter information of Sharma to be a second utterance. Further, in view of the portion/frame basis for analysis of Sharma indicating one input signal could be comprised of additional audio data as compared to a first portion/frame. Further, determining time of difference of arrival between microphone for the same received audio encounter information 800 indicates the same audio is received between those microphones, i.e. the additional spoken utterance is the same between the two devices]);
merge the first additional audio data and the second additional audio data to generate additional merged audio data ([0167] combining 1720 the first speech processing output with the second speech processing output, [In view of the first and second additional audio outputs of Sharma]), wherein in merging to generate the additional merged audio data one or more of the processors are to use the stored first weight value and the stored second weight value without using the trained neural network to re-compute the first and second weight values ([0128] In some implementations, device selection and weighting module 410 may provide a previously processed or weighted portion of the audio encounter information to define weighting for each audio stream from multiple microphone systems [Using a previously processed portion of audio for weighting, in view of the portion/frame first and second audio data of Sharma, indicates weight could be gathered for first and second audio data to be used for further portions/frames of first and second audio streams]); and,
process the additional merged audio data to recognize the additional spoken utterance ([Fig. 4, 418], [Fig. 16, 1716], [0167] process 1716 the first audio stream (e.g., represented as a solid line between device selection weighting module 410 and speech processing system 418) with a first speech processing system to generate a first speech processing output [Speech processing output from a speech processing, i.e. recognition, system 418, wherein the input is a weighted signal from two sources, see Fig. 4, 410 inputs, indicates the output is a recognition of the additional spoken utterance]).
Regarding claim 27, Zou in view of Jensen, further in view of Sharma discloses: the system of claim 22.
Sharma further discloses:
wherein the one or more processors are further operable to:
detect a change in location and/or orientation of the first computing device ([0125] machine vision system 100 may detect and track location estimates for particular speaker representations by tracking the humanoid shapes within the acoustic environment [Tracking humanoid shapes, wherein those humanoid shapes are generally associated with devices, see Fig. 1 and also speaker 226 holding a smartphone in Fig. 3, indicates the tracking will detect positional changes of devices associated with the humanoid shapes as the humans move]);
subsequent to detecting the change in the location and/or the orientation of the first computing device:
receive further first audio data ([0155] receive 1700 audio encounter information from a first microphone system), from the first computing device ([In view of the plurality of computing devices 30-36 of Sharma, any of which are substitutable for that disclosed in Jensen to be used in Zou without a change in functionality]), that captures a further spoken utterance ([In view of the previously disclosed spoken utterance of Zou, indicating the audio encounter information of Sharma to be a further utterance. Further, in view of the portion/frame based signal analysis of Sharma (see Fig. 17, 1704) indicating later portions of signals represent further spoken utterances. Also consider [0164] which discloses weighting on a portion/frame basis indicating further audio from further portions/frames]);
receive further second audio data ([0155] Audio encounter information may be received 1702 from a second microphone system), from the second computing device ([In view of the second audio device of Jensen, further in view of the plurality of computing devices 30-36 of Sharma, any of which are substitutable for that disclosed in Jensen to be used in Zou without a change in functionality]), that captures the further spoken utterance ([Fig. 8, 808], [In view of the previously disclosed second spoken utterance of Jensen, indicating the second audio encounter information of Sharma to be a second, further utterance. Further, in view of the portion-based signal analysis of Sharma (see Fig. 17, 1706) indicating portions of signals represent further spoken utterances. Also consider [0164] which discloses weighting on a portion/frame basis indicating further audio from further portions/frames. Further, determining time of difference of arrival between microphone for the same received audio encounter information 800 indicates the same audio is received between those microphones, i.e. the further spoken utterance is the same between the two devices.]).
Zou further discloses:
process the further first audio data to determine a further first digital representation of the further first audio data ([pg. 5, para 1] after the voice data to be denoised is obtained, spectrogram information of the voice data to be denoised can be extracted, wherein the spectrogram information can include the amplitude of the voice data to be denoised, the phase of the voice data to be denoised, and the like, [Wherein “further” first audio can be a different temporal portion of the same signal, indicating the same operations applied as disclosed in Zou using the portions defined in Sharma]); and,
process, using the trained neural network ([In view of the previously disclosed trained neural network of Zou]), the further first digital representation of the further first audio data as input ([The examiner asserts that extending the operations of Zou to multiple distinct signals does not change the functionality of Zou in view of the multiple signals of Jensen, any of which could be “further”. Further, consider the temporal analysis of Sharma indicating the signals of Zou can be split into “further” portions]), to generate a further first output indicating an updated first SNR predicted for the further first audio data ([pg. 5, para 6] S130, inputting the spectrogram information into a pre-trained neural network model to obtain a signal-to-noise ratio corresponding to the spectrogram information).
Jensen further discloses:
process the further second audio data to determine a further second digital representation of the further second audio data ([0075] in some implementations, the frequency domain representation of the second input signal comprises a first complex vector representing a spectrogram of a frame of the second input signal, [In view of the further second audio data of Sharma, further in view of the spectrogram generation based on signals of Zou, indicating the vector of Jensen representing a second audio spectrogram frame could be transformed using the spectrogram generation of Zou without a change in functionality]); and,
process, using the trained neural network ([In view of the previously disclosed trained neural network of Zou]), the further second digital representation of the further second audio data as input ([In view of the spectrogram of Zou based upon a second input signal of Jensen, wherein that second input signal could be further second audio data as disclosed through the portion-based analysis of audio signals disclosed in Sharma, i.e. analyzing different portions/frames of audio indicates each portion/frame is “further” audio]), to generate a further second output indicating an updated second SNR predicted for the further second audio data ([0073] The second input signal can be characterized by a second SNR, [In view of the SNR determination based on spectrogram analysis of Zou. Further, in view of the further second audio data of Sharma]).
Sharma further discloses:
merge the further first audio data and the second audio data using an updated first weight value and an updated second weight value ([0167] combining 1720 the first speech processing output with the second speech processing output based upon, at least in part, the first audio stream weight and the second audio stream weight), [In view of the further first audio output of Sharma, further in view of the further second audio output of Sharma. Wherein the weights are calculated based on the SNRs of the signals received, indicating a further, i.e. different, signal will have a different SNR, indicating it will also have a different, i.e. updated, associated weight]), to generate further merged audio data ([Fig. 17, 1720], [Combining first and second speech outputs based on their associated weights indicates the resultant combination is “merged audio data”]), wherein the updated first and second weight values are determined based on the further first output indicating the updated first SNR and based on the further second output indicating the updated second SNR ([0164] weighting 1712 the first audio stream and the second audio stream based upon, at least in part, a signal-to-noise ratio for the first audio stream and a signal-to-noise ratio for the second audio stream, [In view of the further first and second audio streams, i.e. portions, of Sharma indicating the SNR calculation could be applied to all portions individually as further audio]).
Regarding claim 28, Zou discloses: a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform operations ([pg. 9, para 4] a computer-readable storage medium having stored thereon a computer program, which when executed by a processor), the operations comprising:
processing, using a trained neural network model ([pg. 5, para 7] neural network model is obtained by training based on sample spectrogram information of sample voice data), a first digital representation of first audio data as first input ([pg. 5, para 6] S130, inputting the spectrogram information into a pre-trained neural network model), to generate as output a first signal-to-noise ratio (SNR) value, wherein the first SNR is a prediction of the trained neural network model ([pg. 5, para 6] to obtain a signal-to-noise ratio corresponding to the spectrogram information, [Inputting a spectrogram into a neural network to obtain a signal-to-noise ratio indicates the SNR is a prediction of the trained neural network]), for the first audio data ([Voice data is first audio data]), wherein the first audio data captures a spoken utterance of a user ([pg. 5, para 1] after the voice data to be denoised is obtained, spectrogram information of the voice data to be denoised can be extracted, [Voice data tracks to a spoken utterance]).
Zou does not disclose:
a non-transitory computer-readable storage medium;
wherein the first audio data is collected by one or more first microphones of a first computing device within an environment of a user; and,
processing, using the trained neural network model, the second digital representation of the second audio data as second input, to generate as output a second SNR value, wherein the second SNR is a prediction of the trained neural network model, for the second audio data, wherein the second audio data captures the spoken utterance of the user and is collected by one or more second microphones of a second computing device within the environment, and wherein the first and second computing devices are distinct from each other; and,
merging the first audio data and the second audio data to generate merged audio data for the spoken utterance.
Jensen discloses:
a non-transitory computer-readable storage medium ([0087] computer program instructions encoded on a tangible non transitory storage medium);
wherein the first audio data is collected by one or more first microphones of a first computing device within an environment of a user ([Fig. 1A, ML, MR in the environment of user 106], [0030] The listener 106 has microphones, M.sub.L (108) and M.sub.R (110), respectively positioned near the left ear and right ear of the listener 106… disposed on a head-worn device (e.g., headsets, glasses, earbuds, etc.), [The examiner asserts that a worn microphone must necessarily be connected to a computing device in order for the sound picked up by the microphone to be analyzed as disclosed in Jensen, i.e. smart glasses]); and,
processing, using the trained neural network model ([In view of the previously disclosed trained neural network model of Zou]), a second digital representation of second audio data as second input ([0075] the frequency domain representation of the second input signal comprises a first complex vector representing a spectrogram, [A representation comprising a vector which represents a spectrogram indicates the vector could be converted to a spectrogram using the spectrogram analysis of Zou, i.e. resulting in a second digital representation of second audio data]), to generate as output a second SNR value, wherein the second SNR is a prediction of the trained neural network model ([The examiner asserts that it would have been obvious to apply the SNR predicting of Zou (previously cited, see [pg. 5, para 6]) to the multiple signals to be denoised of Jensen, resulting in a second SNR which is a prediction of a trained neural network model for second audio data in view of the second audio data of Jensen, wherein Jensen explicitly discloses receiving first and second input signals characterized by SNRs as the first step of their method ([0065], Fig. 9) which could be gathered using the method of Zou without a change in functionality to Jensen]), for the second audio data ([0073] The second input signal can be characterized by a second SNR, [In view of the previously disclosed training model of Zou which can determine SNRs of spectrograms, i.e. input signals]);
wherein the second audio data captures the spoken utterance of the user ([Fig. 10, 1004], [Receiving a second input signal representative of the audio, in view of the first receiving step 1002 which receives a signal representative of audio indicates the audio received by the second input signal representation is the same as the first, in view of the voice data of Zou]) and is collected by one or more second microphones of a second computing device within the environment ([Fig. 1A, MP], [All representing microphones wherein MP is on a second computing device in the environment compared to the worn ML and MR]), wherein the first and second computing devices are distinct from each other ([Fig. 7, 702, 704], [Disclosing on-head mics and off-head mics indicates one worn device and one external device]); and,
merging the first audio data and the second audio data to generate merged audio data for the spoken utterance ([Fig. 9, Combining first input signal and second input signal 906], [In view of the voice data of Zou]).
Zou are considered analogous art within speech signal enhancement. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Zou to incorporate the teachings of Jensen, because of the novel way to consider both directionality and SNR of received audio sources from a plurality of devices, improving the naturalness of reproduced sounds in terms of SNR and spatial perception (Jensen, [0010]).
Zou in view of Jensen does not disclose:
wherein the merging comprises using a first weight value for the first audio data in the merging and using a second weight value for the second audio data in the merging, and wherein the first weight value and the second weight value are determined based on the first and second SNR values that are predictions of the trained neural network; and,
providing the merged audio data for further processing by one or more additional components.
Sharma discloses:
wherein the merging comprises using a first weight value for the first audio data in the merging and using a second weight value for the second audio data in the merging ([0167] combining 1720 the first speech processing output with the second speech processing output based upon, at least in part, the first audio stream weight and the second audio stream weight, [In view of the first audio output of Zou, further in view of the second audio output of Jensen]), wherein the first weight value and the second weight value are determined based on the first and second SNR values that are predictions of the trained neural network ([0164] weighting 1712 the first audio stream and the second audio stream based upon, at least in part, a signal-to-noise ratio for the first audio stream and a signal-to-noise ratio for the second audio stream, [In view of the trained neural network determining SNRs of Zou, further in view of the multiple input audio signals of Jensen indicating a weighting operation based on first and second SNRs which are predictions of the trained neural network. Further, consider the previously cited “combining” two input signal operation of Jensen ([Fig. 9, 906]) using the SNRs of Zou]); and,
providing the merged audio data for further processing by one or more additional components ([0058] configured to provide visual information 110 and audio information 114… [0067] automated clinical documentation process 10 may provide the information associated with an acoustic environment (e.g., via the user interface) [Providing audio information, in view of the merged signal generated in Sharma, to a user device, i.e. component, via an interface indicates presentation to a user, wherein that user could perform any number of additional processing steps, i.e. listening, transcribing, volume adjustment, etc.]).
Zou are considered analogous art within speech signal processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Zou to incorporate the teachings of Sharma, because of the novel way to apply multi-channel noise reduction and de-reverberation methods which account for user location (Sharma, [0003]).
Claim(s) 7-9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zou in view of Jensen, further in view of Sharma, further in view of Gunawan et al. (US-20190045312-A1), hereinafter Gunawan.
Regarding claim 7, Zou in view of Jensen, further in view of Sharma discloses: the method of claim 6.
Zou in view of Jensen, further in view of Sharma does not disclose:
prior to receiving the first additional audio data and the second additional audio data:
determining that, relative to generating the merged audio data, no change in location and orientation has been detected for the first computing device and no change in location and orientation has been detected for the second computing device;
Gunawan discloses:
determining that, relative to generating the merged audio data ([In view of the merged audio data of Sharma]), no change in location and orientation has been detected for the first computing device and no change in location and orientation has been detected for the second computing device ([0106] An audio capture system according to any one of the preceding EEEs wherein the mixing control signal includes input from one or more accelerometers mounted to a corresponding microphone and adapted to detect movement of that corresponding microphone [Detecting movement of a microphone, i.e. location and/or orientation, in view of the plurality of microphones/devices of Sharma, indicates there is also a determination that no movement is detected for first and second devices, i.e. when the accelerometer reads 0]).
Zou, Jensen, Sharma, and Gunawan are considered analogous art within speech signal processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Zou in view of Jensen, further in view of Sharma to incorporate the teachings of Gunawan, because of the novel way to consider location of microphone inputs when mixing multiple audio channels together, improving SNR balancing between the channels (Gunawan, [0008]).
Sharma further discloses:
wherein using the stored first weight value and the stored second weight value ([In view of the previously disclosed first and second weight values of Sharma]), in the merging to generate the additional merged audio data ([In view of the previously disclosed merging of Sharma]), is based on determining that no change in the location and the orientation has been detected for the first computing device and no change in the location and the orientation has been detected for the second computing device ([0128] In some implementations, device selection and weighting module 410 may provide a previously processed or weighted portion of the audio encounter information to define weighting for each audio stream from multiple microphone systems [Using a previously processed portion of audio for weighting, in view of the SNR weighting ([0122]) of Gunawan which controls mixing, i.e. merging, indicates weights could be gathered for first and second audio data to be used for further portions of first and second audio streams based on an associated location, i.e. based on what location parameters (accelerometer data) are within the mixing control signal. Further, consider the time difference of arrival calculation of Sharma, Fig. 8, 808. If the time difference of arrival between microphones for two portions of audio information remains the same, this indicates no movement detected, allowing the system of Sharma to use the same weights as those from previous portions of audio with the same location, in view of the location determinations of Gunawan]).
Regarding claim 8, Zou in view of Jensen, further in view of Sharma, further in view of Gunawan discloses: the method of claim 7.
Jensen further discloses:
wherein the location and orientation associated with the first computing device are relative to the user ([Fig. 1A, 104], [0030] a microphone array M.sub.P (104)), and the location and orientation associated with the second computing device are also relative to the user ([Fig. 1A, 108, 110], [0030] The listener 106 has microphones, M.sub.L (108) and M.sub.R (110), respectively positioned near the left ear and right ear of the listener 106, [Where Fig. 1A clearly discloses the devices to be relative to a user 106]).
Regarding claim 9, Zou in view of Jensen, further in view of Sharma, further in view of Gunawan discloses: the method of claim 7.
Gunawan further discloses:
wherein:
the first computing device includes a first motion sensor ([Fig. 3, devices 4-6 with microphones 9-11 and vibration sensors 13-15], [An accelerometer tracks to a motion sensor and/or a vibration sensor (see [0047] of Gunawan)]), the second computing device includes a second motion sensor ([In view of the plurality of devices of Sharma and Jensen, the accelerometer in a microphone of Gunawan could be applied to the multiple devices containing microphones of Sharma and Jensen. Further, consider the devices 4, 5, and 6 of Fig. 3 of Gunawan which contain motion sensors 13, 14, 15]); and,
determining that no change in location and orientation has been detected for the first computing device and for the second computing device comprises:
detecting, based on first sensor data from the first motion sensor ([In view of the first motion sensor of Gunawan]), no change in the location and orientation for the first computing device ([0106] the mixing control signal includes input from one or more accelerometers mounted to a corresponding microphone and adapted to detect movement of that corresponding microphone, [The ability to detect movements also indicates when there isn’t movement, i.e. when the accelerometer reads 0]); and,
detecting, based on second sensor data from the second motion sensor ([In view of the first motion sensor of Gunawan]), no change in the location and orientation for the second computing device ([Applying the movement detection for a first device as disclosed in [0106] of Gunawan, in view of the plurality of devices 4-6 of Gunawan, indicating a movement detection could also be applied to a second device without a change in functionality of Gunawan’s system]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Fu et al. (“SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement”) discloses “a signal-to-noise-ratio (SNR) aware convolutional neural network (CNN) model for speech enhancement (SE). Because the CNN model can deal with local temporal-spectral structures of speech signals, it can effectively disentangle the speech and noise signals given the noisy speech signals. In order to enhance the generalization capability and accuracy, we propose two SNR-aware algorithms for CNN modeling. The first algorithm employs a multi-task learning (MTL) framework, in which restoring clean speech and estimating SNR level are formulated as the main and the secondary tasks, respectively, given the noisy speech input. The second algorithm is an SNR adaptive denoising, in which the SNR level is explicitly predicted in the first step, and then an SNR-dependent CNN model is selected for denoising. Experiments were carried out to test the two SNR-aware algorithms for CNN modeling” (abstract). See entire document, SNR estimator of Fig. 2.
Kang (US-20210035594-A1) discloses “Disclosed herein is a method for RNN-based noise reduction in a real-time conference, comprising: performing frame-and-window for a speech signal to obtain a logarithmic spectrum of the speech signal, and placing the logarithmic spectrum into the RNN model to determine a noise reduction suppression coefficient, and then obtaining the denoised speech signal by applying the noise reduction suppression coefficient to the logarithmic spectrum of the original signal, thereby achieving utilization of the RNN noise reduction method in real-time conferences. In the present disclosure, when inputting the RNN model for estimation, only the logarithmic spectrum of the current frame needs to be inputted. The RNN model of the present disclosure has few requirements on inputted information, without performing huge preprocessing on the received speech signal, which in turn reduces computation burden, increases response speed, and enhances real-time performance” (abstract). See entire document. See [0034] for discussion around inputting frames of spectrums into a RNN model for later calculating SNR ratio.
Zhu et al. (CN-113889091-A) discloses “The embodiment of the invention claims a voice recognition method, device, computer readable storage medium and electronic device, wherein the method comprises: obtaining the audio signal to be identified obtained by performing voice collection to the user; performing primary voice recognition to the audio signal to be identified, obtaining at least one primary voice recognition result; judging the scene type of at least one first-level voice recognition result, obtaining the scene type of the scene where the user is located; performing scene voice recognition corresponding to the scene type of the audio signal to be identified, obtaining the scene voice recognition result; based on at least one first-level voice recognition result and scene voice recognition result, determining the second-level voice recognition result. The embodiment of the invention aims at the scene of the user for the targeted voice recognition, under different scenes, it can fully avoid the influence of the corresponding noise to the identification, improves the accuracy of the voice recognition” (abstract). Zhu discloses using neural networks to analyze the a prior and a posteriori SNRs of input frequency spectrums. See entire document.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THEODORE JOHN WITHEY whose telephone number is (703)756-1754. The examiner can normally be reached Monday - Friday, 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/THEODORE WITHEY/Examiner, Art Unit 2655
/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655