DETAILED ACTION
This office action is in response to Applicant’s Amendment/Request for Reconsideration, received on 09/18/2025. Claims 1, 3-9, 11, 14-17, and 19 have been amended. All claims are pending and have been considered.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement(s) submitted on 11/02/2023 is/are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Response to Arguments
Applicant’s arguments, see pg. 1, filed 09/18/2025, with respect to Claim Objections have been fully considered and are persuasive. The objections of claims 4 and 19 have been withdrawn.
Applicant’s arguments, see pgs. 1-3, filed 09/18/2025, with respect to “Claims 1-20 Recite Eligible Subject Matter” have been fully considered and are persuasive. The rejections of claims 1-20 under 35 U.S.C. 101 have been withdrawn. The examiner would like to note that the step of digitization of machine-generated voice audio samples, wherein the machine-generated voice audio sample includes human-imperceptible audio artifacts, indicates that this step is not able to be performed mentally as the listening user will not be able to accurately digitize unheard audio artifacts. Further, as the digitization is recited with no specific mathematical equations, it is unreasonable to assume that the digitization step is a generic mathematical operation; therefore, the independent claims with all associated dependent claims are containing eligible subject matter.
Applicant's arguments filed 09/18/2025, see pgs. 3-5, with respect to “Claims 1, 3, 8, 11, 14, and 15 are Allowable” have been fully considered but they are not persuasive.
Applicant’s Representative asserts,
“Claims 1, 3, 8, 11, 14, and 15 were rejected under 35 U.S.C. §102 as being anticipated by Klingler. The Office Action compared the claimed features to the feature extractor and the sound classifier of Klingler that purportedly discriminated between artificial sound and natural sounds, and in particular the discernment of human speech (FIG. 5A), instrument-generated sound (FIG. 5B), and sound output from a television set (FIG. 5C). Applicant respectfully submits that the quantitative discrimination of human-imperceptible audio artifacts inherent in sound generated from a loudspeaker set forth in independent Claims 1, 11, and 14 represents a substantial departure over the spatial signature, sound class diversity, and optional historical database-matching evaluation described in Klingler.
As shown in FIG. 4 and described in paragraphs [0037]-[0045] of Klingler, the cited art is understood to disclose a sound class feature extractor (22a) that determines sound type such as speech music, laughter, cheering, explosions, sounds made by particular objects, and so on. Additionally, there is directional feature extractor that determines spatial signatures or characteristics, such as the existence of a dynamic sound source or static sound source. This directional feature extractor appears to determine spatial correlation or spatial covariance based upon signals from multiple microphones and their time direction of arrival characteristics. Klingler also appears to disclose a distortional feature extractor that determines spectral characteristics of the audio signal.
Moreover, the natural versus artificial sound discriminator may optionally utilize a database of historical sound data, the metadata of which may be compared to classify an audio signal between natural versus artificial. The classifications may then be evaluated in their totality to determine artificial or natural sound sources: low sound class variance, dynamic location, and low distortion sound being classified as natural (FIG. 5A), low sound class variance, static location, low distortion also being classified as natural (FIG. 5B), and high sound class variance, static location, high distortion being classified as artificial (FIG. 5C). Klinger is understood to require a multi- mac array or multiple-device cooperation, and is heavy on computational resources. Further, a historical database and feature logging is needed to be effective.
In contrast, the embodiments of the present disclosure utilizes intrinsic signal characteristics that are introduced by a loudspeaker, an artificial source. These include ringing, or residual oscillations from transducer mechanics, vibration artifacts, including low-frequency rumble or enclosure-induced coloring, distortion from digital-to-analog conversion amplifiers, and/or speaker cones, non-flat frequency responses typical in budget loudspeakers with sharp drop- of at low or high frequencies, and compression artifacts. The methods and systems recited in Claims 1, 11, and 14 recite the quantitative discrimination of the audio artifacts present in the machine-generated voice audio digital samples, and does not depend on speaker placement, content type, or motion. The embodiments of the methods and systems are therefore operable with single microphone, low power edge devices, and are contemplated for the real-time filtering of commands that may appear to originate from a human utterance but is actually from a machine- generated source. No spatial analysis or external data is necessary unlike Klingler to be effective.
Applicant amends Claims 1, 11, and 14 to clarify these distinguishing features, with support for such amendments being found in, for example, paragraphs [0035] and [0037]-[0039]. Because Klingler is not understood to teach or suggest such features as clarified, Applicant respectfully submits that Claims 1, 11 and 14 are not anticipated thereby. To the extent Claims 3 and 8 depend from allowable base Claim 1 and recite additional features of the method, and to the extent Claim 15 depends from allowable base Claim 14 and recite additional features of the system, Applicant respectfully submits that such claims are also allowable.
As a further basis for distinction, the selectively activatable operating modes has been clarified in Claim 14. As described in paragraphs [0041]-[0042], the system may be configured to operate in a direct-only mode where the system responds only to human voices, machine-only mode where the system responds only to playback voices, or a hybrid mode where the system responds to both human voices and machine-generated voices. These may be set and switched dynamically, and a clarification to that effect is being submitted in amended Claim 14, which now recites the operating modes "including a direct voice action mode, a machine-generated voice action mode, and a hybrid mode." This feature represents a departure over the cited art as it allows for the customization of the end product to different deployment settings such as hospitals, homes, retail kiosks that involve different behaviors. Furthermore, end users are able to configure different behavior without firmware updates or deep system changes. A unitary configuration of the device may be deployed, and providers can support multiple behaviors with software mode selection. Privacy-minded users may enable the direct voice action mode to prevent speaker- initiated commands, while retain kiosk owners may prefer machine-generated voice action mode for touchless interaction triggered by audio advertisements. The hybrid mode may be preferred for home automation purposes.
Applicant understands that Gopala has been cited against Claim 16 for its purported disclosure of the machine-only mode while Klingler has been cited for the purported disclosure of the human voice-only mode. Further, Sieracki has been cited against Claim 17 for its purported disclosure of the hybrid mode. None of Klingler, Gopala, or Sieracki, however, teach or suggest switchable operating modes, especially between human-voice only, machine-generated only, or hybrid that the system can be set to operate in. This feature enables end-product customization to accommodate a wide range of use cases that may utilize different behaviors, empowers users to configure the necessary behavior, and offers deployment flexibility for developers because a single platform may be applied to multiple devices. For the foregoing reasons, Applicant respectfully submits that Claim 14 and 15 are also allowable over Klinger, as well as the other cited art.”
In response, the examiner respectfully disagrees with Applicant that the proposed amendments overcome the cited art of the previous rejections. Specifically, new sections of Klingler will be referenced for the amendments to independent claims 1, 11, and 14. The additional amendment made to claim 14 regarding “switchable operating modes” (as disclosed in Applicant’s remarks) will be addressed separately following the examiner’s arguments regarding where/how Klingler still teaches every element of the amended independent claims 1 and 11.
Regarding the amendments made to claims 1 and 11, Applicant asserts that the intrinsic signal characteristics introduced by a loudspeaker, as recited by the amended claims, distinguishes the claims from Klingler. In response, the examiner would like to refer to [0042] of Klingler which discloses feature extractors, i.e. a distortional feature extractor, which determine spectral characteristics of received audio, wherein they also disclose artificial sounds from loudspeakers to be containing “harmonic distortion patterns that may be detectable. In other cases, the recorded sound from a loudspeaker that is playing back a decoded audio program contains detectable distortion due to communication channel encoding and decoding, bit rate reduction compression and decompression, and certain noise signatures”; therefore, it appears to the examiner that Klingler does discloses a machine-generated voice audio sample(s) including human-imperceptible audio artifacts intrinsically present in sound produced by a loudspeaker as determined by the distortional feature extractor of Klingler.
Further, Applicant asserts that the “quantitative discrimination of the audio artifacts present in the machine-generated voice audio digital samples…does not depend on speaker placement, content type, or motion”, as opposed to that disclosed in Klingler as asserted by Applicant. With respect to this argument, the amended claim does not introduce an amount of specificity with which would be required to overcome Klingler. Reciting a “quantitative discrimination of [the] audio artifacts”, wherein the audio artifacts represent things “intrinsically present in sound produced by a loudspeaker” does not specify, indicate, or even suggest what audio artifacts to be discriminated are comprised of. Because of this, an interpretation of the audio artifacts to be containing at least speaker placement, content type, or motion, i.e. these can all be quantitatively represented, is not unreasonable. Further, in view of the previously disclosed distortional feature extractor responsible for identifying/extracting specific loudspeaker features, the examiner would like to introduce the sound classifier 26 of Klingler. [0045] of Klingler discloses, “The classifier 26 receives the plurality of features, as well as previously stored sound metadata. For example, these features and the historical data can be used as inputs to the neural network, which can make a decision on whether the sound is natural vs. artificial.” A comparison of extracted features of a current sound, wherein those can be the distortion features produced intrinsically by loudspeakers, to historical features indicates a quantitative discrimination, i.e. a decision from a neural network is quantitatively represented in that neural network and/or overall sound discriminator, of audio artifacts, i.e. a current machine-generated voice feature can be compared to a historical directly-generated voice feature for determining the decision for the current machine-generated audio sample without extending beyond the disclosure of Klingler. As such, the amendment made to the claims are still all taught by Klingler, as respectfully asserted by the examiner. See updated rejections below.
Regarding the additional amendment to claim 14 regarding the operating modes of the discriminator, the examiner would like to refer to the specific claim language of the proposed amendment in view of Applicant’s argument regarding the amendment. Specifically, there appears to be a disconnect between these two sets of text. The claim recites “selectively activat[ing] one…operating mode” of a plurality of operating modes. Applicant’s arguments refer to a “switchability” between modes, though this is not something claimed. Selectively selecting one mode does not require an additional switch to another mode or does this suggest a switch from an original mode to the activated mode. As such, the claim element requires one mode selectively activated. This is previously disclosed by at least Klingler, Gopala, and/or Sieracki as disclosed by Applicant on pg. 5 of their remarks. See updated rejection below.
In response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., switching between operating modes) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
Applicant's arguments filed 09/18/2025, see pgs. 5-6, with regard to “Dependent Claims 2, 4, 5, 6, 9, 10, 12, 13, 16, 17, 18, and 19 are Allowable Over the Cited Art” have been fully considered but they are not persuasive. In view of the examiner’s assertion that Klingler still teaches all elements of the amended claims, the dependent claims will also be maintaining their status as rejected under Klingler and/or an additional combination of previously cited art.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1, 3, 8, 11, 14, 15, 19 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Klingler et al. (US-20200090644-A1), hereinafter Klingler.
Regarding claim 1, Klingler discloses: a method for discriminating between direct and machine-generated human voice (Abstract, classifier that has a machine learning model which is configured to determine a sound classification, such as artificial versus natural for the sound [natural sounds track to “direct” and artificial sounds track to “machine-generated”]), the method comprising:
capturing on a microphone a directly-generated voice audio sample from a human utterance ([Fig. 5A, person speaking 302]);
capturing on the microphone a machine-generated voice audio sample outputted by a loudspeaker from a pre-recording of another human utterance ([Fig. 5C, sounds from speakers 306 of a television 308], where, [0028] audio signal is originating from a loudspeaker in another device [i.e. television], and, [0049] the classifier 26 in this case may recognize… a high variety of TV-like sounds (cars driving, humans speaking) [Recognizing humans speaking from TV speakers indicates a pre-recording of another human utterance as the audio sample]), the machine-generated voice audio sample including human-imperceptible audio artifacts intrinsically present in sound produced by a loudspeaker ([0042] loudspeakers often generate harmonic distortion patterns that may be detectable. In other cases, the recorded sound from a loudspeaker that is playing back a decoded audio program contains detectable distortion due to communication channel encoding and decoding, bit rate reduction compression and decompression, and certain noise signatures, [Disclosing harmonic distortion patterns of speakers in the same paragraph as the distortional feature extractor indicates the distortion(s) extracted as a feature(s) is an intrinsic distortion from the loudspeaker. Further, detecting this distortion using a feature extractor (responsible for determining spectral characteristics) indicates the feature representing the distortion to not be perceptible to humans, only the feature extractor, i.e. a human cannot determine spectral characteristics of a received audio through listening]);
digitizing the directly-generated voice audio sample and the machine-generated voice audio sample to corresponding digital samples ([0021] Each of the feature extractors 22 generally is configured to determine a specific feature, aspect, or characteristic of the audio signals by digitally processing the audio signals, [Digital processing indicates a required digitalization of the received audio signals 18 before being sent into the feature extractors 22, wherein the received audio signals contain both real and fake audio samples as required for classification of received sounds into ‘natural vs. artificial’, see Fig. 4. Further, extracting features from digital signals indicates the extracted features to be digital samples]);
extracting, with a machine learning feature extractor ([0021] a feature extractor 22 can apply one or more algorithms or models (e.g., including machine learning models)), discriminative features between the directly-generated voice audio sample and the machine-generated voice audio sample ([Fig. 4, Natural vs Artificial Sound Discriminator], [0045] The natural versus artificial sound discriminator includes a classifier 26, employing a neural network or other suitable machine or supervised learning, whose output may be a natural vs. artificial decision. The classifier 26 receives the plurality of features, as well as previously stored sound metadata. For example, these features and the historical data can be used as inputs to the neural network [Discriminating between natural and artificial sound using historical classification data indicates a comparison between directly-generated, i.e. natural, and machine-generated, i.e. artificial, audio to create an output decision as to the appropriate classification]) based upon a quantitative discrimination of the audio artifacts present therein from a processing of the directly-generated voice audio digital samples and the machine-generated audio digital samples ([0026] The classifier 26 may also access or otherwise receive historical sound data or information (e.g., including previously stored sound metadata) from the database 30, and can determine a classification of the audio signals 18 based on this historical sound data or information (in addition to an input feature vector). For example, the historical data can be used as one or more inputs for the machine learning model 28. However, if the classifier 26 determines that an input feature vector is similar to one or more features of audio signals that were previously classified (e.g., as natural vs. artificial), then the classifier 26 can execute a shortcut (e.g., bypassing application of the machine learning model 28) and determine the classification directly based on the historical data, [Determining the classification of a signal to be natural v. artificial based on a feature vector comparison, wherein features are represented quantitatively in vectors, indicates the classification operation to be a quantitative discrimination of audio artifacts from processing a directly-generated voice and machine-generated voice audio samples, i.e. comparing a directly-generated voice feature vector to a historical machine-generated feature vector or any other combination of directly/machine generated feature comparison]); and,
selectively generating a response to a command in the captured directly-generated voice audio sample or the captured machine-generated voice audio sample ([0029] if it is determined that the sound is an artificial sound at action 112, the electronic device may be prevented from performing one or more actions or functions (at action 114). However, if it is determined that the sound is a natural sound, the electronic device may be allowed to take one or more actions or execute a function (at action 116)).
Regarding claim 3, Klingler discloses: the method of claim 1.
Klingler further discloses:
training the machine learning feature extractor with an audio classifier using a first class of voice data from audio captured directly from a human and a second class of voice data from audio captured from the loudspeaker ([Fig. 3, 206 training the machine learning model to classify the data as natural, i.e. directly from a human (first class), or artificial, i.e. data from audio captured from the loudspeaker (second class)], [In view of the loudspeaker 306]).
Regarding claim 8, Klingler discloses: the method of claim 1.
Klingler further discloses:
wherein one of the discriminative features of the machine-generated voice audio frequency response is distortion ([0022] The features determined by the feature extractors 22 include… distortion features (e.g., whether an audio signal has been subjected to dynamic range compression, or whether any spectral characteristics show some type of artificial signature, etc.) [Wherein featured from the feature extractors 22 are used to classify, i.e. discriminate, the sound 26 as natural or artificial]).
Regarding claim 11, Klingler discloses: a system for discriminating between direct and machine-generated human voices (Abstract, classifier that has a machine learning model which is configured to determine a sound classification, such as artificial versus natural for the sound [natural sounds track to “direct” and artificial sounds track to “machine-generated”]), the system comprising:
a microphone capturing both directly-generated voice audio samples from a human utterance and a machine-generated voice audio samples outputted by a loudspeaker from a pre-recording of the human utterance ([0017] one or more microphones 14, where, [Fig. 5A, 5C] represent a user 302, i.e. directly-generated voice audio sample from a human utterance, and machine-generated voice audio sample output by a loudspeaker 306 ([0005] loudspeakers of a television) from a pre-recording of a human utterance, indicating the one microphone 14 can be used to capture both audio types]), the machine-generated voice audio sample including human-imperceptible audio artifacts intrinsically present in sound produced by a loudspeaker ([0042] loudspeakers often generate harmonic distortion patterns that may be detectable. In other cases, the recorded sound from a loudspeaker that is playing back a decoded audio program contains detectable distortion due to communication channel encoding and decoding, bit rate reduction compression and decompression, and certain noise signatures, [Disclosing harmonic distortion patterns of speakers in the same paragraph as the distortional feature extractor indicates the distortion extracted as a feature is an intrinsic distortion from the loudspeaker. Further, detecting this distortion using a feature extractor (responsible for determining spectral characteristics) indicates the feature representing the distortion to not be perceptible to humans, only the feature extractor, i.e. a human cannot determine spectral characteristics of a received audio through listening]);
an analog-to-digital converter ([0017] The electronic device 12 includes one or more microphones 14 (e.g., an array of microphones as shown) which are transducers configured to receive a sound field that is in the ambient environment of the device 12, and in response provide one or more audio signals 18 corresponding thereto, [Transducers transforming a sound field (generally represented as analog or as digital through ADC conversion) into audio signals 18, wherein the audio signals are clearly digitalized (see below mapping), indicates the microphones/transducers contain some form of ADC conversion for the following digital signal processing, otherwise the signals received by microphones could be directly analyzed by the feature extractors]) digitizing the directly-generated voice audio sample and the machine-generated voice audio sample to corresponding digital samples ([0021] Each of the feature extractors 22 generally is configured to determine a specific feature, aspect, or characteristic of the audio signals by digitally processing the audio signals, [Digital processing indicates a required digitalization of the received audio signals (indicating an inherent analog-to-digital conversion) 18 before being sent into the feature extractors 22, wherein the received audio signals contain both real and fake audio samples as required for classification of received sounds into ‘natural vs. artificial’, see Fig. 4. Further, extracting features from digital signals indicates the extracted features to be digital samples]); and,
a machine learning classifier receptive to the directly-generated voice audio digital samples and the machine-generated voice audio digital samples ([Fig. 1, Sound Classifier 26 containing machine learning model 28, in view of the previously received directly-generated and machine-generated audio signals 18]), the machine learning classifier deriving discriminative features between the directly-generated voice audio samples and the machine-generated voice audio samples and classifying as either directly generated or machine generated ([0045] The natural versus artificial sound discriminator includes a classifier 26, employing a neural network or other suitable machine or supervised learning, whose output may be a natural vs. artificial decision) based upon a quantitative discrimination of the audio artifacts present therein from a processing of the directly-generated voice audio digital samples and the machine-generated audio digital samples ([0026] The classifier 26 may also access or otherwise receive historical sound data or information (e.g., including previously stored sound metadata) from the database 30, and can determine a classification of the audio signals 18 based on this historical sound data or information (in addition to an input feature vector). For example, the historical data can be used as one or more inputs for the machine learning model 28. However, if the classifier 26 determines that an input feature vector is similar to one or more features of audio signals that were previously classified (e.g., as natural vs. artificial), then the classifier 26 can execute a shortcut (e.g., bypassing application of the machine learning model 28) and determine the classification directly based on the historical data, [Determining the classification of a signal to be natural v. artificial based on a feature vector comparison, wherein features are represented quantitatively in vectors, indicates the classification operation to be a quantitative discrimination of audio artifacts from processing a directly-generated voice and machine-generated voice audio samples, i.e. comparing a directly-generated voice feature vector to a historical machine-generated feature vector or any other combination of directly/machine generated feature comparison]).
Regarding claim 14, Klingler discloses: a system for discriminating between direct and machine-generated human voices (Abstract, classifier that has a machine learning model which is configured to determine a sound classification, such as artificial versus natural for the sound [natural sounds track to “direct” and artificial sounds track to “machine-generated”]), the system comprising:
a microphone capturing both directly-generated voice audio samples from a human utterance and a machine-generated voice audio samples outputted by a loudspeaker from a pre-recording of the human utterance as input audio samples ([0017] one or more microphones 14, where, [Fig. 5A, 5C] represent a user 302, i.e. directly-generated voice audio sample from a human utterance, and machine-generated voice audio sample output by a loudspeaker 306 ([0005] loudspeakers of a television) from a pre-recording of a human utterance, indicating the one microphone 14 can be used to capture both audio types]), the machine-generated voice audio sample including human-imperceptible audio artifacts intrinsically present in sound produced by a loudspeaker ([0042] loudspeakers often generate harmonic distortion patterns that may be detectable. In other cases, the recorded sound from a loudspeaker that is playing back a decoded audio program contains detectable distortion due to communication channel encoding and decoding, bit rate reduction compression and decompression, and certain noise signatures, [Disclosing harmonic distortion patterns of speakers in the same paragraph as the distortional feature extractor indicates the distortion extracted as a feature is an intrinsic distortion from the loudspeaker. Further, detecting this distortion using a feature extractor (responsible for determining spectral characteristics) indicates the feature representing the distortion to not be perceptible to humans, only the feature extractor, i.e. a human cannot determine spectral characteristics of a received audio through listening]);
an analog-to-digital converter ([0017] microphones 14 (e.g., an array of microphones as shown) which are transducers configured to receive a sound field that is in the ambient environment of the device 12, and in response provide one or more audio signals 18 corresponding thereto, [Transducers transforming a sound field (generally represented as analog or as digital through ADC conversion) into audio signals 18, wherein the audio signals are clearly digitalized (see below mapping), indicates the microphones/transducers contain some form of ADC conversion for the following digital signal processing, otherwise the signals received by microphones could be directly analyzed by the feature extractors]) digitizing the directly-generated voice audio sample and the machine-generated voice audio sample to corresponding digital samples ([0021] Each of the feature extractors 22 generally is configured to determine a specific feature, aspect, or characteristic of the audio signals by digitally processing the audio signals, [Digital processing indicates a required digitalization of the received audio signals (indicating an inherent analog-to-digital conversion) 18 before being sent into the feature extractors 22, wherein the received audio signals contain both real and fake audio samples as required for classification of received sounds into ‘natural vs. artificial’, see Fig. 4. Further, extracting features from digital signals indicates the extracted features to be digital samples]);
a machine learning classifier receptive to the input audio digital samples ([Fig. 1, Sound Classifier 26 containing machine learning model 28, in view of the previously received directly-generated and machine-generated audio signals 18]), the machine learning classifier deriving discriminative features of the audio artifacts between the directly-generated voice audio digital samples and the machine-generated voice audio digital samples and identifying the input audio samples as either directly generated or machine generated based upon the derived discriminative features of the audio artifacts ([0045] The natural versus artificial sound discriminator includes a classifier 26, employing a neural network or other suitable machine or supervised learning, whose output may be a natural vs. artificial decision); and,
a command processor connected to the machine learning classifier ([0020] The various components or modules shown in FIG. 1, e.g., the feature extractors 22, the classifier 26, etc. of the system can include computer programmable instructions, workflows, etc. that can be stored in memory and executed or accessed by one or more processors [Processors responsible for classification 26 containing machine learning model 28 indicates a connection]), the command processor selectively generating responses to commands in the input audio samples depending upon selectively activated one of operating modes including a direct voice action mode, a machine-generated voice action mode, and a hybrid mode ([0029] if it is determined that the sound is an artificial sound at action 112, the electronic device may be prevented from performing one or more actions or functions (at action 114). However, if it is determined that the sound is a natural sound, the electronic device may be allowed to take one or more actions or execute a function (at action 116), [The examiner would like to note that due to the disjunctive nature of the operating modes, they do not all require a mapping. Further, operating to determine artificial sound and natural sound indicates the system of Klingler to be operating in at least a hybrid mode inherently containing all other modes]).
Regarding claim 15, Klingler discloses: the system of claim 14.
Klingler further discloses:
wherein in the direct voice action mode the command processor generates a response to the command when the input audio sample is identified as a directly generated ([0029] if it is determined that the sound is a natural sound, the electronic device may be allowed to take one or more actions or execute a function (at action 116) [Natural sounds track to those directly generated by a human]).
Regarding claim 19, Klingler discloses: the system of claim 14.
Klingler further discloses:
an audio sample classifier training the machine learning classifier using a first class of voice data corresponding to directly-generated voice audio samples and a second class of voice data corresponding to machine-generated voice audio samples ([Fig. 3, 206 training the machine learning model to classify the data as natural, i.e. directly from a human (first class), or artificial, i.e. data from audio captured from the loudspeaker (second class)]).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 2, 12, 16, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Klingler in view of Gopala et al. (US-20210074305-A1), hereinafter Gopala.
Regarding claim 2, Klingler discloses: the method of claim 1.
Klingler does not disclose:
wherein the machine learning feature extractor is selected from a group consisting of: a multilayer perceptron (MCP), a convolutional neural network (CNN), and a recurrent neural network (RNN).
Gopala discloses:
wherein the machine learning feature extractor is selected from a group consisting of: a multilayer perceptron (MCP), a convolutional neural network (CNN), and a recurrent neural network (RNN) ([0072] identification engine 124 may analyze the representation using a predetermined neural network (such as a convolutional neural network, a recurrent neural network, one or more multi-layer perceptrons [Wherein the identification engine is used to classify audio content, in view of the discriminator of Klingler]).
Klingler and Gopala are considered analogous art within real/synthetic speech discrimination. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Klingler to incorporate the teachings of Gopala, because of the novel way to implement specific neural network architectures to perform an audio discrimination process resulting in improved quality of generated speech, thus improving the ability of the system to identify synthetic speech as the speech becomes closer to natural (Gopala, [0111]).
Regarding claim 12, Klingler discloses: the system of claim 11.
Klingler does not disclose:
wherein the machine learning classifier is selected from a group consisting of: a multilayer perceptron (MCP), a convolutional neural network (CNN), and a recurrent neural network (RNN).
Gopala discloses:
wherein the machine learning classifier is selected from a group consisting of: a multilayer perceptron (MCP), a convolutional neural network (CNN), and a recurrent neural network (RNN) ([0072] identification engine 124 may analyze the representation using a predetermined neural network (such as a convolutional neural network, a recurrent neural network, one or more multi-layer perceptrons [Wherein the identification engine is used to classify audio content, in view of the discriminator of Klingler]).
Klingler and Gopala are considered analogous art within real/synthetic speech discrimination. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Klingler to incorporate the teachings of Gopala, because of the novel way to implement specific neural network architectures to perform an audio discrimination process resulting in improved quality of generated speech, thus improving the ability of the system to identify synthetic speech as the speech becomes closer to natural (Gopala, [0111]).
Regarding claim 16, Klingler discloses: the system of claim 14.
Klingler does not disclose:
wherein in the machine generated voice action mode the command processor generates a response to the command when the input audio sample is identified as machine generated.
Gopala discloses:
wherein in the machine generated voice action mode the command processor generates a response to the command when the input audio sample is identified as machine generated ([0100] Furthermore, the computer system may selectively perform a remedial action (operation 218) based at least in part on the classification. For example, the remedial action may include one or more of: providing a warning associated with the audio content [In view of the real/fake audio determination 216 of Gopala, the option to provide a warning indicates an acknowledgement that the input audio is fake, i.e. output a warning that the audio detected is not real in response to a machine-generated classification]).
Klingler and Gopala are considered analogous art within real/synthetic speech discrimination. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Klingler to incorporate the teachings of Gopala, because of the novel way to implement specific neural network architectures to perform an audio discrimination process resulting in improved quality of generated speech, thus improving the ability of the system to identify synthetic speech as the speech becomes closer to natural (Gopala, [0111]).
Regarding claim 20, Klingler discloses: the system of claim 14.
Klingler does not disclose:
wherein the machine learning classifier is selected from a group consisting of: a multilayer perceptron (MCP), a convolutional neural network (CNN), and a recurrent neural network (RNN).
Gopala discloses:
wherein the machine learning classifier is selected from a group consisting of: a multilayer perceptron (MCP), a convolutional neural network (CNN), and a recurrent neural network (RNN) ([0072] identification engine 124 may analyze the representation using a predetermined neural network (such as a convolutional neural network, a recurrent neural network, one or more multi-layer perceptrons [Wherein the identification engine is used to classify audio content, in view of the discriminator of Klingler]).
Klingler and Gopala are considered analogous art within real/synthetic speech discrimination. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Klingler to incorporate the teachings of Gopala, because of the novel way to implement specific neural network architectures to perform an audio discrimination process resulting in improved quality of generated speech, thus improving the ability of the system to identify synthetic speech as the speech becomes closer to natural (Gopala, [0111]).
Claim(s) 4, 5, 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Klingler in view of Binkowski et al. (US-20210089909-A1), hereinafter Binkowski.
Regarding claim 4, Klingler discloses: the method of claim 3.
Klingler does not disclose:
wherein training the machine learning feature extractor includes adding one or more types of noise signals to either or both the audio captured directly from a human and audio captured from the loudspeaker to enhance the machine learning feature extractor to operate over diverse environmental conditions.
Binkowski discloses:
wherein training the machine learning feature extractor includes adding one or more types of noise signals to either or both the audio captured directly from a human and audio captured from the loudspeaker to enhance the machine learning feature extractor to operate over diverse environmental conditions ([Fig. 1, Noise input 104 to generate synthesized audio output 112, i.e. generated audio captured from loudspeaker in view of the loudspeaker of Klingler, all in training system 100], [0025] The noise input 104 can ensure variability in the audio output 112, [Variability in audio output tracks to diverse environmental conditions considering the noise is the varying feature. The examiner would like to note that due to the disjunctive nature of the claims, the “both” and “either” (in terms of applying noise to the real audio sample 108, i.e. directly from a human) conditions of the claim do not have to be met]).
Klingler and Binkowski are considered analogous art within real/synthetic speech discrimination. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Klingler to incorporate the teachings of Binkowski, because of the novel way to update parameters of audio output, through added noise, to improve the realism of synthesized speech increasing the accuracy of the discriminator (Binkowski, [0036]).
Regarding claim 5, Klingler discloses: the method of claim 1.
Klingler does not disclose:
wherein one of the discriminative features of the machine-generated voice audio frequency response is a non-flat frequency response in an audible frequency band.
Binkowski discloses:
wherein one of the discriminative features of the machine-generated voice audio frequency response is a non-flat frequency response in an audible frequency band ([0025] the noise input 104 can be randomly sampled from a predetermined distribution, e.g., a normal distribution [Random sampling a noise input indicates an extremely low likelihood of a flat frequency response, comparing real audio sample 108 to audio output 112, in view of the speaking input of Klingler, indicates that the noise will be in an audible frequency band to affect the speech, which is inherently in an audible frequency band. Further, updating parameters based on prediction results indicates the random, non-flat frequency response as a discriminative feature to improve predictions]).
Klingler and Binkowski are considered analogous art within real/synthetic speech discrimination. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Klingler to incorporate the teachings of Binkowski, because of the novel way to update parameters of audio output, through added noise, to improve the realism of synthesized speech increasing the accuracy of the discriminator (Binkowski, [0036]).
Regarding claim 9, Klingler discloses: the method of claim 1.
Klingler does not disclose:
wherein one of the discriminative features of the machine-generated voice audio sample is added noise.
Binkowski discloses:
wherein one of the discriminative features of the machine-generated voice audio sample is added noise ([Fig. 1, Noise Input 104], [Having noise 104 added to synthesized speech 112, i.e. machine-generated, to be sent into a discriminator network indicates the added noise is used as a discriminative feature to make a prediction 122 as to whether the audio is real or synthetic]).
Klingler and Binkowski are considered analogous art within real/synthetic speech discrimination. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Klingler to incorporate the teachings of Binkowski, because of the novel way to update parameters of audio output, through added noise, to improve the realism of synthesized speech increasing the accuracy of the discriminator (Binkowski, [0036]).
Claim(s) 6, 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Klingler in view of Moffat (US-20160044429-A1).
Regarding claim 6, Klingler discloses: the method of claim 1.
Klingler does not discloses:
wherein one of the discriminative features of the machine-generated voice audio frequency response is a ringing.
Moffat discloses:
wherein one of the discriminative features of the machine-generated voice audio frequency response is a ringing ([0126] the playing of an audio waveform that accelerates and then decelerates the sound-producing portion of a speaker may do so to such a degree that the speaker exhibits a ringing output that deviates from the sound specified in the waveform, and, the recording of the speaker's ringing output with a microphone. Such a recorded pattern of ringing may form a signature unique to the speaker which generated the sound, and to the microphone that recorded it, [In view of Moffat’s system as a “computing device identification”, indicating that the ringing is used as a discriminative feature to determine which device produced audio based on the ringing pattern, in view of the loudspeaker 306 producing audio of Klingler]).
Klingler and Moffat are considered analogous art within audio device discrimination. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Klingler to incorporate the teachings of Moffat, because of the novel way to process audio waveforms for specific characteristics and attributes, i.e. ringing, increasing the accuracy of device identification in a multi-device recording environment (Moffat, [0080]).
Regarding claim 10, Klingler discloses: the method of claim 3.
Klingler does not disclose:
wherein the machine learning feature extractor is trained using voice data from audio captured from a plurality of different loudspeakers, each having a unique set of sound reproduction characteristics.
Moffat discloses:
wherein the machine learning feature extractor is trained using voice data from audio captured from a plurality of different loudspeakers, each having a unique set of sound reproduction characteristics ([0164] FIGS. 6A through 6C illustrate examples of audio generated by sound-generation systems of different computer systems, according to examples. The pattern, magnitude and extent of ringing may vary between individual personal computing devices [In view of the feature extractors and discrimination training of Klingler, further in view of the loudspeaker of Klingler, a system is indicated which can identify the variations in audio devices, i.e. loudspeakers, each having unique sound reproduction characteristics to be used for discrimination training, i.e. using the device identification of Moffat, using the generated audio for a plurality of different audio devices of Moffat]).
Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Klingler in view of Vondersaar et al. (US-20240153518-A1), hereinafter Vondersaar.
Regarding claim 7, Klingler discloses: the method of claim 1.
Klingler does not disclose:
wherein one of the discriminative features of the machine-generated voice audio frequency response is a vibration.
Vondersaar discloses:
wherein one of the discriminative features of the machine-generated voice audio frequency response is a vibration ([0108] The digital vibration signal 102 or gating or flag signal 122 may trigger or otherwise facilitate the discrimination of the sound voiced by the user 12 and the sound voiced by other others in the digital voice stream 130 by the user voice/distractor discriminator 122 [Using vibration to discriminate two audio signals, in view of the human and machine generated audio of Klingler, indicates that the generated vibr