Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1 and 15 are independent.
This Application was published as US 20240371386.
Apparent priority is 2 May 2023.
Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection that, if presented, were necessitated by the amendments to the Claims.
This action is Final.
Response to Arguments
35 USC 103
Applicant’s arguments with respect to 35 USC 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 15, and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dyrholm (US 20200213726 A1) in view of Chen (US 20230095526 A1).
Regarding claim 1, Dyrholm discloses: 1. A method of processing audio signals, comprising: receiving an audio signal via a plurality of microphones; ("[0024] The first, the second and the third microphone unit 11, 12, 13 constitute a main microphone array 14, with an output in the form of a vector. The main microphone array 14 thus provides as output a main input vector MM=(X, Y, Q) comprising as components the first, the second and the third input audio signal X, Y, Q." )
generating, based on a neural network trained to detect a biometric voice ID unique to a known audio source, an inference about whether a first frame of the received audio signal includes the biometric voice ID; and ("[0035]... The auxiliary voice detector 35 may derive a user-voice activity signal VAD from the auxiliary voice measure VF such that the user-voice activity signal VAD indicates voice activity when the main input vector WM only, or mainly, contains voice sound V of the user 6, and the main beamformer controller 32 may determine one or more components dMX, dMV, dMQ of the steering vector dM from values of the main input vector MM collected during periods wherein the user-voice activity signal VAD indicates voice activity. ..." - The periods read on a first frame. See [0071] for further description of time periods.)
selectively steering a beam associated with a multi-channel beamformer toward a direction-of-arrival (DOA) of the first frame based at least in part on the inference about whether the first frame includes the biometric voice ID. ("[0035] Alternatively, or additionally, the main beamformer controller 32 may determine the steering vector dM in dependence on the auxiliary voice measure VF...")
Dyrholm does not explicitly disclose the voice activity detector is based on a neural network trained to detect a biometric voice ID unique to a known audio source.
Chen discloses: a neural network trained to detect a biometric voice ID unique to a known audio source (“ [0050] Target speaker VAD model 176 provides system functionality for determining whether an audio recording contains the target speaker's voice or not. In an embodiment, the target speaker VAD model 176 may process an audio recording that has been determined by the multi-speaker detection module 174 to only contain speech from a single user. The target speaker VAD model 176 may compare the audio recording to a target speaker's voiceprint to determine whether the audio recording contains the target speaker's voice or not… In an embodiment, target speaker VAD model 176 may comprise a ML model, such as one or more neural networks, CNNs, DNNs, or other ML models. Target speaker VAD model 176 may include one or more parameters, such as internal weights of a neural network, that may determine the operation of target speaker VAD model 176.”)
Dyrholm and Chen are considered analogous art to the claimed invention because they disclose methods of voice activity detection. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Dyrholm to use the neural network for detecting VAD by a target speaker as disclosed by Chen. Doing so would have been beneficial for faster processing. (Chen [0073]) This combination also falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Claim 15 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale. Additionally, “a processing system; and a memory storing instructions” of the Claim are taught by Dyrholm (“microcontrollers” [0079]; “firmware or software” [0079] – software would inherently be stored in some form of memory).
Regarding claim 21, Dyrholm discloses: 21. The method of claim 1, wherein: the plurality of microphones are coupled to a device; (Fig. 4 shows that Microphone Units 11 and 12 are coupled to a device Auxiliary Controller 40.)
the known audio source is a user of the device; and (“[0035] Alternatively, or additionally, the main beamformer controller 32 may determine the steering vector dM in dependence on the auxiliary voice measure VF. The auxiliary voice detector 35 may derive a user-voice activity signal VAD from the auxiliary voice measure VF such that the user-voice activity signal VAD indicates voice activity when the main input vector WM only, or mainly, contains voice sound V of the user 6…”)
the biometric voice ID comprises one or more unique biometric characteristics unique to a voice of the user. (not explicitly disclosed)
Dyrholm does not explicitly disclose a biometric voice ID.
Chen discloses: the biometric voice ID comprises one or more unique biometric characteristics unique to a voice of the user. (“[0048] Voiceprint extractor 172 provides system functionality for extracting a voiceprint from an audio recording. A voiceprint may comprise a digital representation of voice characteristics of a speaker…” – see also “[0062] Voiceprint extractor 172 receives and processes recorded voice 300 and generates voiceprint 310 based on the voice characteristics of recorded voice 300. The voice characteristics may comprise features of a person's voice that distinguish the voice from the voices of other people and may be dependent on physical features such as the shape and size of a speaker's vocal tract…”)
See motivation statement for claim 1.
Claim(s) 2-3 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dyrholm in view of Chen as applied to claim 1 above, in further view of Rosener (US 20090252351 A1).
Regarding claim 2, Dyrholm discloses: 2. The method of claim 1, wherein the inference comprises a ternary value indicating that the first frame includes the biometric voice ID, that the first frame does not include the biometric voice ID, or that the neural network is undecided as to whether the first frame includes the biometric voice ID. ("[0078] Although the examples disclosed herein are based on a main beamformer 31 configured as a MVDR beamformer, the principles of the present disclosure may be adapted to other adaptive beamformer types that require a steering vector, a user-voice activity signal VAD and/or a no-user-voice activity signal NVAD for proper operation." Chen discloses the biometric voice ID as mapped in claim 1 above.)
Dyrholm does not explicitly disclose an undecided signal. Neither does Chen.
Rosener discloses: wherein the inference comprises a ternary value indicating that the first frame includes the biometric voice ID, that the first frame does not include the biometric voice ID, or that the neural network is undecided as to whether the first frame includes the biometric voice ID. ("[0031]… VAD processor 20 outputs an output signal 30 to processor 22 indicating voice activity, no voice activity, or an indeterminate status." )
Dyrholm, Chen and Rosener are considered analogous art to the claimed invention because they disclose methods of voice activity detection. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Dyrholm in view of Chen with an option for an unknown status. Doing so would have been beneficial so that an alternate VAD method could be used if the VAD cannot determine the state. (Rosener [0042].)
Regarding claim 3, Dyrholm discloses: 3. The method of claim 2, wherein the selective steering of the beam comprises: refraining from steering the beam toward the DOA of the first frame if the ternary value indicates that the first frame does not include the biometric voice ID or if the ternary value indicates that the neural network is undecided as to whether the first frame includes the biometric voice ID. ("[0035]… The main beamformer controller 32 may further restrict modification of the steering vector d.sub.M to periods wherein the user-voice activity signal VAD indicates voice activity. …" – modifying the steering vector only when there is voice activity means that when there is no voice activity the steering is restricted. This would also include undecided, as in an undecided state there would not be a voice activity signal. Chen discloses the biometric voice ID as mapped in claim 1 above.)
Claim 16 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.
Claim(s) 4-13 and 17-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dyrholm in view of Chen and Rosener as applied in claim 2 above, in further view of Stoltze et al. (US 20190313187 A1).
Regarding claim 4, Dyrholm, does not disclose the additional limitations.
Chen discloses: 4. The method of claim 2, further comprising: determining whether the first frame of the receive audio signal includes speech associated with multiple audio sources, the selective steering of the beam being further based on whether the first frame includes speech associated with multiple audio sources. (“[0147] At step 1204, the multi-speaker detection model 174 analyzes the audio frame to determine whether the audio frame includes only a single-speaker or multiple speakers. Multi-speaker detection model 174 may analyze the audio frame and based on features of the audio frame, such as the consistency or distribution of characteristics of the speech in the audio frame, areas of silence or breaks in speech, overlapping speech, and other features, determine whether one speaker or multiple speakers are speaking in the audio frame. Output of the multi-speaker detection module 174 may comprise a binary classification of whether one speaker is speaking or multiple speakers are speaking.” )
Dyrholm, Chen, Rosener, and Stoltze are considered analogous art to the claimed invention because they disclose voice activity detection. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Chen to detect how many speakers are active, in order to only steer the beam toward the current frame if it contains only the target user. Doing so would have been beneficial “to steer a microphone array beam such that it captures sound information of interest (voice information from a source located in a sound field of interest) and attenuates other sound, such as sound information from a source that is not located in the sound field of interest.” (Stoltze [0037])
Chen and Rosener do not disclose selective steering of the beam being further based on whether the first frame includes speech associated with multiple audio sources.
Stoltz discloses: the selective steering of the beam being further based on whether the first frame includes speech associated with multiple audio sources. (“[0014]… In the event that an audio signal is detected, and it is determined that the DOA of the signal does not correspond to a current, valid sound field of interest (i.e., the current camera field of view), then the beamformer can be prevented from updating the current beam direction…” – while Stoltz is directed to a sound field of interest based on location, it would have been obvious to prevent the beamformer from updating the current beam direction for any invalid sound field of interest.)
Dyrholm discloses: “[0030] … In the present context, the desired signal is the voice sound V, and the desired response thus equals the response of the main beamformer 31 when the main input vector MM only contains voice sound V of the user 6. …”
Dyrholm, Chen, Rosener, and Stoltze are considered analogous art to the claimed invention because they disclose voice activity detection. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Stoltze to only steer the beam toward the current frame if it contains only the target user. Doing so would have been beneficial “to steer a microphone array beam such that it captures sound information of interest (voice information from a source located in a sound field of interest) and attenuates other sound, such as sound information from a source that is not located in the sound field of interest.” (Stoltze [0037])
Regarding claim 5, Dyrholm, Chen, and Rosener do not disclose the additional limitations.
Stoltze discloses: 5. The method of claim 4, wherein the selective steering of the beam comprises: refraining from steering the beam toward the DOA of the first frame if the first frame includes speech associated with multiple audio sources. ("[0014]… In the event that an audio signal is detected, and it is determined that the DOA of the signal does not correspond to a current, valid sound field of interest (i.e., the current camera field of view), then the beamformer can be prevented from updating the current beam direction…" – see claim 4 regarding multiple audio sources)
See motivation statement for claim 4.
Regarding claim 6, Dyrholm discloses: 6. The method of claim 4, further comprising: determining a probability of speech in the first frame of the received audio signal, the selective steering of the beam being further based on the probability of speech in the first frame. ("[0052] The auxiliary beamformer controller 34 may e.g. compare the candidate voice measure VW to the auxiliary voice measure Vr and update the auxiliary weight vector Br when the candidate voice measure VW exceeds the auxiliary voice measure VF. Alternatively, or additionally, the auxiliary beamformer controller 34 may compare the candidate voice measure VW to a voice measure threshold, update the auxiliary weight vector BF when the candidate voice measure VW exceeds the voice measure threshold and then also update the voice measure threshold to equal the candidate voice measure VW." )
Regarding claim 7, Dyrholm discloses: 7. The method of claim 6, wherein the selective steering of the beam comprises: refraining from steering the beam toward the DOA of the first frame if the probability of speech in the first frame is less than a threshold probability. ("[0052] The auxiliary beamformer controller 34 may e.g. compare the candidate voice measure VW to the auxiliary voice measure Vr and update the auxiliary weight vector Br when the candidate voice measure VW exceeds the auxiliary voice measure VF. Alternatively, or additionally, the auxiliary beamformer controller 34 may compare the candidate voice measure VW to a voice measure threshold, update the auxiliary weight vector BF when the candidate voice measure VW exceeds the voice measure threshold and then also update the voice measure threshold to equal the candidate voice measure VW." )
Regarding claim 8, Dyrholm discloses: 8. The method of claim 6, wherein the selective steering of the beam comprises: steering the beam toward the DOA of the first frame if the probability of speech in the first frame is greater than or equal to a threshold probability, ("[0052] …update the auxiliary weight vector BF when the candidate voice measure VW exceeds the voice measure threshold …" – see claim 7)
the first frame does not include speech associated with multiple audio sources, (disclosed by Dyrholm in view of Chen, Rosener, Stoltze, and Tran, see claim 5)
and the ternary value indicates that the first frame includes speech associated with a known audio source. (“[0078] Although the examples disclosed herein are based on a main beamformer 31 configured as a MVDR beamformer, the principles of the present disclosure may be adapted to other adaptive beamformer types that require a steering vector, a user-voice activity signal VAD and/or a no-user-voice activity signal NVAD for proper operation.” – see claim 2 regarding an undecided value)
Regarding claim 9, Dyrholm discloses: 9. The method of claim 6, wherein the multi-channel beamformer comprises a minimum variance distortionless response (MVDR) beamformer that reduces a power of a noise component of the audio signal without distorting a speech component of the audio signal. ("[0029] The main beamformer controller 32 preferably operates according to the widely used Minimum Variance Distortionless Response (MVDR) beamformer algorithm." )
Regarding claim 10, Dyrholm discloses: 10. The method of claim 6, further comprising: calculating a filter associated with the MVDR beamformer based on a covariance of the noise component of the audio signal and a covariance of the speech component of the audio signal. ("[0029]... If the desired signal and the undesired noise are uncorrelated, then the variance of the beamformer output signal equals the sum of the variances of the desired signal and the noise. The MVDR beamformer algorithm seeks to minimize this sum, thereby reducing the effect of the noise, preferably by estimating a noise covariance matrix for the main input vector MM and using the estimated noise covariance matrix in the computation of the components BMX, BMY, BMQ of the main weight vector BM as well known in the art." )
Regarding claim 11, Dyrholm discloses: 11. The method of claim 10, further comprising: determining the covariance of the speech component of the audio signal ( “[0029]… If the desired signal and the undesired noise are uncorrelated, then the variance of the beamformer output signal equals the sum of the variances of the desired signal and the noise.”; see also “[0030] … The steering vector dM may thus easily be computed from the main input vector MM when it only contains voice sound V of the user 6….)
when the probability of speech in the first frame is greater than or equal to a threshold probability, the first frame does not include speech associated with multiple audio sources, and the ternary value indicates that the first frame includes the biometric voice ID. (This portion is disclosed in claim 8)
Regarding claim 12, Dyrholm discloses:12. The method of claim 10, further comprising: determining the covariance of the noise component of the audio signal ([0029]… The MVDR beamformer algorithm seeks to minimize this sum, thereby reducing the effect of the noise, preferably by estimating a noise covariance matrix for the main input vector MM and using the estimated noise covariance matrix in the computation of the components BMX, BMY, BMQ of the main weight vector BM as well known in the art.”)
when the probability of speech in the first frame is less than a threshold probability, the first frame includes speech associated with multiple audio sources, or the ternary value indicates that the first frame does not include the biometric voice ID. (This portion is simply the inverse of claim 8. It would be obvious that the sounds that are not target sounds are noise.)
Regarding claim 13, Dyrholm discloses: 13. The method of claim 10, further comprising: refraining from determining the covariances of any of the speech component or the noise component of the audio signal when the probability of speech in the first frame is greater than a first threshold probability but less than a second threshold probability or the ternary value indicates that the neural network is undecided as to whether the first frame includes the biometric voice ID. (“[0060]… Preferably, the auxiliary voice detector 35 further provides a no-user-voice activity signal NVAD in dependence on the auxiliary beamformer score EF or the candidate beamformer score EW not exceeding a no-user-voice threshold EN, which is lower than the user-voice threshold EV. Using the auxiliary beamformer score EF or the candidate beamformer score EW for determination of a user-voice activity signal VAD and/or a no-user-voice activity signal NVAD may ensure improved stability of the signaling of user-voice activity, since the criterion used is in principle the same as the criterion for controlling the auxiliary beamformer. …” Dyrholm discloses a user-voice threshold EV, and a lower no-user-voice threshold EN, and that either can be used. Using the higher user-voice threshold would mean that if the signal is between the two thresholds, it would not steer the beamformer (determine covariances).)
Claim 17 is a system claim with limitations corresponding to the limitations of Claim 4 and is rejected under similar rationale.
Claim 18 is a system claim with limitations corresponding to the limitations of Claim 8 and is rejected under similar rationale.
Claim(s) 14 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dyrholm in view of Chen as applied in claim 1 above, in further view of Stoltze).
Regarding claim 14, Dyrholm in view of Chen does not disclose the additional limitations.
Stoltze discloses: 14. The method of claim 1, further comprising: storing the DOA of the first frame responsive to steering the beam toward the DOA of the first frame; ("CURRENT/MOST RECENT DOA 473" Fig. 4B)
determining a DOA of a second frame of the received audio signal; ("DOA CALC 471" Fig. 4B)
determining whether the DOA of the second frame is within a threshold range of the stored DOA; and ("[0036]...The logic 470 then examines the valid DOA store 472 to determine if the current DOA angle θ calculated by the function 471 falls within a valid DOA range..." )
selectively steering the beam toward the DOA of the second frame based at least in part on whether the DOA of the second frame is within the threshold range of the stored DOA. ("[0036]... and if the current angle θ is a valid DOA, and if the most recent angle θ calculated by the function 471 is different than the current or most recent DOA, then the angle of arrival module 468 sends the currently calculated DOA angle θ to the BF 120." )
Dyrholm, Chen, and Stoltze are considered analogous art to the claimed invention because they disclose voice activity detection. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination with the teaching of Stoltze to only steer the beam toward the current frame if it is within a valid range. Doing so would have been beneficial “to steer a microphone array beam such that it captures sound information of interest (voice information from a source located in a sound field of interest) and attenuates other sound, such as sound information from a source that is not located in the sound field of interest.” (Stoltze [0037])
Claim 20 is a system claim with limitations corresponding to the limitations of Claim 14 and is rejected under similar rationale.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JON C MEIS whose telephone number is (703)756-1566. The examiner can normally be reached Monday - Thursday, 8:30 am - 5:30 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JON CHRISTOPHER MEIS/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654