DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The analysis of the claims’ subject matter eligibility will follow the 2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50-57 (January 7, 2019) (“2019 PEG”).
With respect to claim 1.
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1: Is the claim to a process, machine, manufacture, or composition of matter? Yes—claim 1 recites a method, which is a process.
Step 2A, prong one: Does the claim recite an abstract idea, law of nature or natural phenomenon? Yes—the limitations identified below each, under its broadest reasonable interpretation, covers mental processes abstract idea grouping (concepts performed in the human mind (including an observation, evaluation, judgment, opinion)), see MPEP 2106.04(a)(2), subsection III and the 2019 PEG, but for the recitation of generic computer components:
“executing (mental process/mathematical computation: comparison, mathematical evaluation and mental process/mathematical computation: evaluation of a statistical formula, human judgement/prediction).
Step 2A, prong two: Does the claim recite additional elements that integrate the judicial exception into a practical application? No—the judicial exception is not integrated into a practical application.
“receiving a plurality of pleasantness ratings from one or more human jurors, each pleasantness rating corresponding to a respective one of a plurality of sounds emitted by one or more devices; involves the mere gathering of data, which is insignificant extra-solution activity. See MPEP § 2106.05(g).
detecting, via a microphone system, a plurality of measurable sound qualities, each measurable sound quality associated with a respective one of the plurality of sounds;” involves the mere gathering of data, which is insignificant extra-solution activity. See MPEP § 2106.05(g).
“training a regression prediction model based on, for each respective sound, its pleasantness rating and its corresponding measurable sound quality until convergence yields a trained regression prediction model”: The additional elements are directed to training a machine learning model at a high level of generality; therefore the limitations amount to mere instructions to implement an abstract idea on a computer or a recitation of the words "apply it" (or an equivalent). Therefore, the additional element(s) do not integrate the judicial exception into a practical application. See MPEP 2106.05(f).
detecting, via the microphone system, a measurable sound quality of an unrated sound, wherein the unrated sound has not been rated by the one or more human jurors;” involves the mere gathering of data, which is insignificant extra-solution activity. See MPEP § 2106.05(g).
The generic computer components in these steps are recited at a high-level of generality (i.e., as a generic computer component performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? No—there are no additional limitations beyond the mental processes identified above. The limitation treated above, are directed to the well-understood, routine, and conventional activity of storing and retrieving information in memory. See MPEP § 2106.05(d)(II); Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015). It also includes limitations that Merely reciting the words “apply it” (or an equivalent) with the judicial exception, or merely including instructions to implement an abstract idea on a computer, or merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f). The additional element is insignificant application, which is similar to examples of activities that the courts have found to be insignificant extra-solution activity, in accordance with MPEP 2106.05(g), Insignificant Extra-Solution Activity. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible.
Claim 2.
Step 1: A method, as above.
Step 2A Prong 1: The claim recites that “wherein the plurality of measurable sound qualities includes at least one of loudness, tonality, and sharpness”: This limitation merely specifies mental processes- concept of observation and evaluation of labeling datasets.
Step 2A Prong 2, Step 2B: This judicial exception is not integrated into a practical application. Mere recitation of generic computer components neither integrates the judicial exception into a practical application nor provides an inventive concept.
Claim 3.
Step 1: A method, as above.
Step 2A Prong 1: The claim recites that “for each pairwise comparison, combining the predicted pleasantness difference rating with a respective one of the pleasantness ratings to yield a respective summed rating.”: This limitation merely specifies mental processes- concept of observation and evaluation.
Step 2A Prong 2, Step 2B: This judicial exception is not integrated into a practical application. Mere recitation of generic computer components neither integrates the judicial exception into a practical application nor provides an inventive concept.
Claim 4.
Step 1: A method, as above.
Step 2A Prong 1: The claim recites the abstract idea of claim 1.
Step 2A Prong 2, Step 2B: The claim recites that “outputting an overall predicted pleasantness rating of the unrated sound based upon an average of the summed ratings;” involves the mere gathering of data, which is insignificant extra-solution activity. See MPEP § 2106.05(g). This judicial exception is not integrated into a practical application. Mere recitation of generic computer components neither integrates the judicial exception into a practical application nor provides an inventive concept.
Claim 5.
Step 1: A method, as above.
Step 2A Prong 1: The claim recites the abstract idea of claim 1.
Step 2A Prong 2, Step 2B: The claim recites that “outputting an overall predicted pleasantness rating of the unrated sound based upon a weighted average of the summed ratings;” involves the mere gathering of data, which is insignificant extra-solution activity. See MPEP § 2106.05(g). This judicial exception is not integrated into a practical application. Mere recitation of generic computer components neither integrates the judicial exception into a practical application nor provides an inventive concept.
Claim 6.
Step 1: A method, as above.
Step 2A Prong 1: The claim recites that “determining, via pairwise comparisons, differences between each of the plurality of pleasantness ratings and every other of the plurality of pleasantness ratings; wherein the training of the regression prediction model uses the differences as inputs”: This limitation merely specifies mental processes- concept of observation and evaluation.
Step 2A Prong 2, Step 2B: This judicial exception is not integrated into a practical application. Mere recitation of generic computer components neither integrates the judicial exception into a practical application nor provides an inventive concept.
Claim 7.
Step 1: A method, as above.
Step 2A Prong 1: The claim recites that “wherein the plurality of measurable sound qualities is on a temporal spectrum.”: This limitation merely specifies mental processes- concept of observation and evaluation.
Step 2A Prong 2, Step 2B: This judicial exception is not integrated into a practical application. Mere recitation of generic computer components neither integrates the judicial exception into a practical application nor provides an inventive concept.
Claim 8.
Step 1: A method, as above.
Step 2A Prong 1: The claim recites that “wherein the plurality of measurable sound qualities is input into the regression prediction model in a two-dimensional spectra.”: This limitation merely specifies mental processes- concept of observation and evaluation.
Step 2A Prong 2, Step 2B: This judicial exception is not integrated into a practical application. Mere recitation of generic computer components neither integrates the judicial exception into a practical application nor provides an inventive concept.
Claims 9-16
Step 1: The claims recite a system; therefore, they fall into the statutory category of machines.
Step 2A Prong 1: The claims 9-16 recite the same mental processes as claims 1-8, respectively.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. Claims 9-16 recite generic computer components, namely “a processor programmed to process the plurality of sounds; and a memory storing instructions that, when executed by the processor, cause the processor”. As before, the mere recitation that the method is to be performed on a generic computer amounts to a mere instruction to apply the exception on the computer. See MPEP § 2106.05(f). With that exception, the analysis mirrors that of claims 1-8, respectively.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The analysis, with the one exception noted above, mirrors that of claims 1-8, respectively.
Claims 17-20
Step 1: The claims recite a method.
Step 2A Prong 1: The claims 17-20 recite the same mental processes as claims 1-8, respectively.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. As before, the mere recitation that the method is to be performed on a generic computer amounts to a mere instruction to apply the exception on the computer. See MPEP § 2106.05(f). With that exception, the analysis mirrors that of claims 1-8, respectively.
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The analysis, with the one exception noted above, mirrors that of claims 1-8, respectively.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 3-9, 11-17 and 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Serra et al. (US 20230245674 A1) in view of Valentini-Botinhao et al. (“Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks”, 22 Sep 2022, Division of Speech, Music and Hearing).
Regarding claim 1.
Serra teaches a method of predicting a pleasantness of a sound emitted from a device utilizing machine learning, the method comprising: receiving a plurality of pleasantness ratings from one or more human jurors, each pleasantness rating corresponding to a respective one of a plurality of sounds emitted by one or more devices (see ¶ 52, “the proposed method is generally semi-supervised, meaning that it may leverage both ratings obtained from human listeners (e.g., embedded in human annotated data, sometimes also referred to as labeled data) and raw (non-rated) audio as input data (sometimes also referred to as unlabeled data).”);
detecting, via a microphone system, a plurality of measurable sound qualities, each measurable sound quality associated with a respective one of the plurality of sounds (see ¶ 49, “The purpose of an automatic tool (or algorithm) to measure audio quality is to obtain a reliable proxy of human ratings that overcomes the aforementioned investment. There are several automatic tools to measure speech quality of an audio file. Given some input audio, such tools yield a score, typically between 1 and 5, that correlates to some subjective rating of audio quality.”, also see ¶ 53, “Referring to FIG. 1A, a schematic illustration of a (simplified) block diagram of a system 100 for audio quality assessment according to an embodiment of the present disclosure is shown. The system 100 may be composed of an encoding stage (or simply referred to as an encoder) 1010 and an assessment stage 1020. As shown in the example of FIG. 1A, the assessment stage 1020 may comprise a series of “heads” 1021, 1022 and 1023, sometimes (collectively) denoted as H. The different heads will be described in detail below with reference to FIG. 1B.”, also see ¶ 165 and 168, using a smartphone and microphone, also see ¶ 109, “training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure.”);
training a regression prediction model based on, for each respective sound, its pleasantness rating and its corresponding measurable sound quality until convergence yields a trained regression prediction model (see ¶ 100, “this degradation strength head 1127 (sometimes also referred to as the degradation head to be distinguishable from the classification head 1126 illustrated above) may take latent vectors z and further process them (e.g., through an MLP 1135) to produce an output, e.g., a value between 1 and 5. It may then compute a regression-based loss with the level of degradation that has been introduced to the audio, if available (e.g., from the available label information). In some implementations, this level of degradation may be logged (stored) from an (automatic) degradation algorithm that has been applied prior to training the network/system. In other words, broadly speaking, it may be considered that the loss functions may comprise a seventh loss function indicative of a degradation strength metric, and that the seventh loss function may be calculated based on difference between the label information comprising the respective degradation strength information and the prediction thereof.”, also see ¶ 113, “the method 200 performs step S230 of iteratively training the system to predict the respective label information of the audio samples in the training set. In particular, the training may be performed based on a plurality of loss functions and the plurality of loss functions may be generated to reflect differences between the label information of the audio samples in the training set and the respective predictions thereof, as illustrated above with reference to FIG. 1B.”, also see ¶ 101-106 teaches regression metric and losses functions, also see ¶ 109, “training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure.”);
detecting, via the microphone system, a measurable sound quality of an unrated sound, wherein the unrated sound has not been rated by the one or more human jurors (see ¶ 140, “the trained system may then be used or operated for determining a quality indication metric for an input audio. Reference is now made to FIG. 3, where a flowchart illustrating an example of a method 300 of training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure is shown.”);
and executing the trained regression prediction model on the measurable sound quality of the unrated sound see ¶ 68-69, “t may be considered that the loss functions may comprise a second loss function indicative of a pairwise ranking metric, and that the second loss function may be calculated based on the difference between the label information (e.g., ranking established by the label information) comprising the relative degradation information and the prediction thereof… under the notion of pairwise ranking, if a speech signal x.sub.j is a programmatically (algorithmically) degraded version of the same (originally ‘clean’, or ‘cleaner’) utterance x.sub.i, then their scores should reflect such relation, that is, s.sub.i≥s.sub.j. This notion may then be introduced in a training schema by considering learning-to-rank strategies.”, also see ¶ 84, “the pairs of audio frames/signals {x.sub.i,x.sub.j} 1142 may be generated as illustrated above during the calculation of pairwise ranking or in any other suitable means”).
Serra do not specifically teach yield a plurality of predicted pleasantness difference ratings, each predicted pleasantness difference rating corresponding to a respective pairwise comparison between the unrated sound and a respective one of the plurality of sounds.
Valentini-Botinhao teaches yield a plurality of predicted pleasantness difference ratings, each predicted pleasantness difference rating corresponding to a respective pairwise comparison between the unrated sound and a respective one of the plurality of sounds (see page 2, “converting MUSHRA scores derived from several listening evaluations to pairwise preference scores. We evaluate our system using unseen data (different voices and synthesizers) and compare it to MOS Net [14]… To convert MUSHRA scores into pairwise preference scores we compared the scores of every pair of stimuli belonging to the same MUSHRA screen as shown to the left in Fig. 1.”, also see page 4, “These represent the state of the art in publicly available models for MOS score prediction on synthetic speech. The MOS scores predicted by MOS Net were converted to pairwise preferences by checking which of two paired stimuli had the higher predicted MOS score.”, also see page 4, conclusion).
Both Serra and Valentini-Botinhao pertain to the problem of sound quality, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Serra and Valentini-Botinhao to teach the above limitations. The motivation for doing so would be “This paper has introduced PrefNet, an approach to predicting pairwise preferences between synthetic speech stimuli. We demonstrated how data from side-by-side evaluations using numerical ratings can be leveraged to create training data for pairwise preference prediction. We empirically investigated several architectures and described a design rooted in twin neural nets that ensures consistent pairwise preferences if the order of the inputs is reversed. Results showed that GRU-based architectures outperformed those using attention to align and score stimulus pairs, and that our anti-symmetric network design also improved accuracy.” (see Valentini-Botinhao page 4, conclusion).
Regarding claim 3.
Serra and Valentini-Botinhao teach the method of claim 1,
Valentini-Botinhao further teaches further comprising: for each pairwise comparison, combining the predicted pleasantness difference rating with a respective one of the pleasantness ratings to yield a respective summed rating (see page 4, “The results are presented in Table 3. Rows refer to training material and columns to the test set. We present scores at both stimulus level and system level. For system-level results we calculated the accuracy across all sentences of each system pair… These represent the state of the art in publicly available models for MOS score prediction on synthetic speech. The MOS scores predicted by MOSNet were converted to pairwise preferences by checking which of two paired stimuli had the higher predicted MOS score.”).
The motivation utilized in the combination of claim 1, super, applies equally as well to claim 3.
Regarding claim 4.
Serra and Valentini-Botinhao teach the method of claim 3,
Valentini-Botinhao further teaches further comprising: outputting an overall predicted pleasantness rating of the unrated sound based upon an average of the summed ratings (see page 4, “The results are presented in Table 3. Rows refer to training material and columns to the test set. We present scores at both stimulus level and system level. For system-level results we calculated the accuracy across all sentences of each system pair… These represent the state of the art in publicly available models for MOS score prediction on synthetic speech. The MOS scores predicted by MOSNet were converted to pairwise preferences by checking which of two paired stimuli had the higher predicted MOS score.”).
The motivation utilized in the combination of claim 1, super, applies equally as well to claim 4.
Regarding claim 5.
Serra and Valentini-Botinhao teach the method of claim 3,
Valentini-Botinhao further teaches further comprising: outputting an overall predicted pleasantness rating of the unrated sound based upon a weighted average of the summed ratings (see page 4, “The results are presented in Table 3. Rows refer to training material and columns to the test set. We present scores at both stimulus level and system level. For system-level results we calculated the accuracy across all sentences of each system pair… These represent the state of the art in publicly available models for MOS score prediction on synthetic speech. The MOS scores predicted by MOSNet were converted to pairwise preferences by checking which of two paired stimuli had the higher predicted MOS score.”).
The motivation utilized in the combination of claim 1, super, applies equally as well to claim 5.
Regarding claim 6.
Serra and Valentini-Botinhao teach the method of claim 1,
Serra further teaches further comprising: determining, via pairwise comparisons, differences between each of the plurality of pleasantness ratings and every other of the plurality of pleasantness ratings; wherein the training of the regression prediction model uses the differences as inputs (see ¶ 68-69, “t may be considered that the loss functions may comprise a second loss function indicative of a pairwise ranking metric, and that the second loss function may be calculated based on the difference between the label information (e.g., ranking established by the label information) comprising the relative degradation information and the prediction thereof… under the notion of pairwise ranking, if a speech signal x.sub.j is a programmatically (algorithmically) degraded version of the same (originally ‘clean’, or ‘cleaner’) utterance x.sub.i, then their scores should reflect such relation, that is, s.sub.i≥s.sub.j. This notion may then be introduced in a training schema by considering learning-to-rank strategies.”, also see ¶ 84, “the pairs of audio frames/signals {x.sub.i,x.sub.j} 1142 may be generated as illustrated above during the calculation of pairwise ranking or in any other suitable means”, also see ¶ 100, “It may then compute a regression-based loss with the level of degradation that has been introduced to the audio, if available (e.g., from the available label information). In some implementations, this level of degradation may be logged (stored) from an (automatic) degradation algorithm that has been applied prior to training the network/system. In other words, broadly speaking, it may be considered that the loss functions may comprise a seventh loss function indicative of a degradation strength metric, and that the seventh loss function may be calculated based on difference between the label information comprising the respective degradation strength information and the prediction thereof.”).
Regarding claim 7.
Serra and Valentini-Botinhao teach the method of claim 1,
Valentini-Botinhao further teaches wherein the plurality of measurable sound qualities is on a temporal spectrum (see page 2, “Fig. 2 shows the general architecture of PrefNet. Each input waveform is processed by an encoder network made of a mel spectogram extraction layer (with fixed coefficients) followed by several convolutional layers (which are shared across the different inputs).”, i.e. wherein mel spectrogram measures sound over time).
The motivation utilized in the combination of claim 1, super, applies equally as well to claim 7.
Regarding claim 8.
Serra and Valentini-Botinhao teach the method of claim 1,
Valentini-Botinhao further teaches wherein the plurality of measurable sound qualities is input into the regression prediction model in a two-dimensional spectra (see page 2, “Fig. 2 shows the general architecture of PrefNet. Each input waveform is processed by an encoder network made of a mel spectogram extraction layer (with fixed coefficients) followed by several convolutional layers (which are shared across the different inputs).”, i.e. wherein mel spectrogram measures sound over time which is 2D processed by convolution layers).
The motivation utilized in the combination of claim 1, super, applies equally as well to claim 8.
Regarding claim 9.
Serra teaches a system for predicting a pleasantness of a sound emitted from a device utilizing machine learning, the system comprising: a microphone configured to detect a plurality of sounds emitted by one or more devices (see ¶ 49, “The purpose of an automatic tool (or algorithm) to measure audio quality is to obtain a reliable proxy of human ratings that overcomes the aforementioned investment. There are several automatic tools to measure speech quality of an audio file. Given some input audio, such tools yield a score, typically between 1 and 5, that correlates to some subjective rating of audio quality.”, also see ¶ 53, “Referring to FIG. 1A, a schematic illustration of a (simplified) block diagram of a system 100 for audio quality assessment according to an embodiment of the present disclosure is shown. The system 100 may be composed of an encoding stage (or simply referred to as an encoder) 1010 and an assessment stage 1020. As shown in the example of FIG. 1A, the assessment stage 1020 may comprise a series of “heads” 1021, 1022 and 1023, sometimes (collectively) denoted as H. The different heads will be described in detail below with reference to FIG. 1B.”, also see ¶ 165 and 168, using a smartphone and microphone, also see ¶ 109, “training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure.”); a processor programmed to process the plurality of sounds; and a memory storing instructions that, when executed by the processor (see ¶ 37, “an apparatus including a processor and a memory coupled to the processor is provided. The processor may be adapted to cause the apparatus to carry out all steps of the example methods described throughout the disclosure.”), cause the processor to:
receive a plurality of pleasantness ratings from one or more human jurors, each pleasantness rating corresponding to a respective one of the plurality of sounds, detect a plurality of measurable sound qualities, each measurable sound quality associated with a respective one of the plurality of sounds detected by the microphone (see ¶ 52, “the proposed method is generally semi-supervised, meaning that it may leverage both ratings obtained from human listeners (e.g., embedded in human annotated data, sometimes also referred to as labeled data) and raw (non-rated) audio as input data (sometimes also referred to as unlabeled data).”, see ¶ 49, “The purpose of an automatic tool (or algorithm) to measure audio quality is to obtain a reliable proxy of human ratings that overcomes the aforementioned investment. There are several automatic tools to measure speech quality of an audio file. Given some input audio, such tools yield a score, typically between 1 and 5, that correlates to some subjective rating of audio quality.”, also see ¶ 53, “Referring to FIG. 1A, a schematic illustration of a (simplified) block diagram of a system 100 for audio quality assessment according to an embodiment of the present disclosure is shown. The system 100 may be composed of an encoding stage (or simply referred to as an encoder) 1010 and an assessment stage 1020. As shown in the example of FIG. 1A, the assessment stage 1020 may comprise a series of “heads” 1021, 1022 and 1023, sometimes (collectively) denoted as H. The different heads will be described in detail below with reference to FIG. 1B.”, also see ¶ 165 and 168, using a smartphone and microphone, also see ¶ 109, “training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure.”),
train a regression prediction model based on, for each respective sound, its pleasantness rating and its corresponding measurable sound quality until convergence yields a trained regression prediction model (see ¶ 100, “this degradation strength head 1127 (sometimes also referred to as the degradation head to be distinguishable from the classification head 1126 illustrated above) may take latent vectors z and further process them (e.g., through an MLP 1135) to produce an output, e.g., a value between 1 and 5. It may then compute a regression-based loss with the level of degradation that has been introduced to the audio, if available (e.g., from the available label information). In some implementations, this level of degradation may be logged (stored) from an (automatic) degradation algorithm that has been applied prior to training the network/system. In other words, broadly speaking, it may be considered that the loss functions may comprise a seventh loss function indicative of a degradation strength metric, and that the seventh loss function may be calculated based on difference between the label information comprising the respective degradation strength information and the prediction thereof.”, also see ¶ 113, “the method 200 performs step S230 of iteratively training the system to predict the respective label information of the audio samples in the training set. In particular, the training may be performed based on a plurality of loss functions and the plurality of loss functions may be generated to reflect differences between the label information of the audio samples in the training set and the respective predictions thereof, as illustrated above with reference to FIG. 1B.”, also see ¶ 101-106 teaches regression metric and losses functions, also see ¶ 109, “training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure.”), detect a measurable sound quality of an unrated sound, wherein the unrated sound has not been rated by the one or more jurors (see ¶ 140, “the trained system may then be used or operated for determining a quality indication metric for an input audio. Reference is now made to FIG. 3, where a flowchart illustrating an example of a method 300 of training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure is shown.”), and execute the trained regression prediction model on the measurable sound quality of the unrated sound see ¶ 68-69, “t may be considered that the loss functions may comprise a second loss function indicative of a pairwise ranking metric, and that the second loss function may be calculated based on the difference between the label information (e.g., ranking established by the label information) comprising the relative degradation information and the prediction thereof… under the notion of pairwise ranking, if a speech signal x.sub.j is a programmatically (algorithmically) degraded version of the same (originally ‘clean’, or ‘cleaner’) utterance x.sub.i, then their scores should reflect such relation, that is, s.sub.i≥s.sub.j. This notion may then be introduced in a training schema by considering learning-to-rank strategies.”, also see ¶ 84, “the pairs of audio frames/signals {x.sub.i,x.sub.j} 1142 may be generated as illustrated above during the calculation of pairwise ranking or in any other suitable means”).
Serra do not specifically teach yield a plurality of predicted pleasantness difference ratings, each predicted pleasantness difference rating corresponding to a respective pairwise comparison between unrated sound and a respective one of the plurality of sounds.
Valentini-Botinhao teaches yield a plurality of predicted pleasantness difference ratings, each predicted pleasantness difference rating corresponding to a respective pairwise comparison between unrated sound and a respective one of the plurality of sounds (see page 2, “converting MUSHRA scores derived from several listening evaluations to pairwise preference scores. We evaluate our system using unseen data (different voices and synthesizers) and compare it to MOS Net [14]… To convert MUSHRA scores into pairwise preference scores we compared the scores of every pair of stimuli belonging to the same MUSHRA screen as shown to the left in Fig. 1.”, also see page 4, “These represent the state of the art in publicly available models for MOS score prediction on synthetic speech. The MOS scores predicted by MOS Net were converted to pairwise preferences by checking which of two paired stimuli had the higher predicted MOS score.”, also see page 4, conclusion).
Both Serra and Valentini-Botinhao pertain to the problem of sound quality, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Serra and Valentini-Botinhao to teach the above limitations. The motivation for doing so would be “This paper has introduced PrefNet, an approach to predicting pairwise preferences between synthetic speech stimuli. We demonstrated how data from side-by-side evaluations using numerical ratings can be leveraged to create training data for pairwise preference prediction. We empirically investigated several architectures and described a design rooted in twin neural nets that ensures consistent pairwise preferences if the order of the inputs is reversed. Results showed that GRU-based architectures outperformed those using attention to align and score stimulus pairs, and that our anti-symmetric network design also improved accuracy.” (see Valentini-Botinhao page 4, conclusion).
Claims 11-16 recites a system to perform the method recited in claims 3-8. Therefore the rejection of claims 3-8 above applies equally here.
Regarding claim 17.
Serra teaches a method of predicting a pleasantness of a sound emitted from a device utilizing machine learning, the method comprising:
receiving a plurality of pleasantness ratings from one or more human jurors, each pleasantness rating corresponding to a respective one of a plurality of sounds emitted by one or more devices (see ¶ 52, “the proposed method is generally semi-supervised, meaning that it may leverage both ratings obtained from human listeners (e.g., embedded in human annotated data, sometimes also referred to as labeled data) and raw (non-rated) audio as input data (sometimes also referred to as unlabeled data).”);
detecting, via a microphone system, a plurality of measurable sound qualities, each measurable sound quality associated with a respective one of the plurality of sounds (see ¶ 49, “The purpose of an automatic tool (or algorithm) to measure audio quality is to obtain a reliable proxy of human ratings that overcomes the aforementioned investment. There are several automatic tools to measure speech quality of an audio file. Given some input audio, such tools yield a score, typically between 1 and 5, that correlates to some subjective rating of audio quality.”, also see ¶ 53, “Referring to FIG. 1A, a schematic illustration of a (simplified) block diagram of a system 100 for audio quality assessment according to an embodiment of the present disclosure is shown. The system 100 may be composed of an encoding stage (or simply referred to as an encoder) 1010 and an assessment stage 1020. As shown in the example of FIG. 1A, the assessment stage 1020 may comprise a series of “heads” 1021, 1022 and 1023, sometimes (collectively) denoted as H. The different heads will be described in detail below with reference to FIG. 1B.”, also see ¶ 165 and 168, using a smartphone and microphone, also see ¶ 109, “training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure.”);
detecting, via the microphone system, a measurable sound quality of an unrated sound, wherein the unrated sound has not been rated by the one or more human jurors (see ¶ 140, “the trained system may then be used or operated for determining a quality indication metric for an input audio. Reference is now made to FIG. 3, where a flowchart illustrating an example of a method 300 of training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure is shown.”);
executing a regression prediction model on the measurable sound quality of the unrated sound to see ¶ 68-69, “t may be considered that the loss functions may comprise a second loss function indicative of a pairwise ranking metric, and that the second loss function may be calculated based on the difference between the label information (e.g., ranking established by the label information) comprising the relative degradation information and the prediction thereof… under the notion of pairwise ranking, if a speech signal x.sub.j is a programmatically (algorithmically) degraded version of the same (originally ‘clean’, or ‘cleaner’) utterance x.sub.i, then their scores should reflect such relation, that is, s.sub.i≥s.sub.j. This notion may then be introduced in a training schema by considering learning-to-rank strategies.”, also see ¶ 84, “the pairs of audio frames/signals {x.sub.i,x.sub.j} 1142 may be generated as illustrated above during the calculation of pairwise ranking or in any other suitable means”, see ¶ 100, “this degradation strength head 1127 (sometimes also referred to as the degradation head to be distinguishable from the classification head 1126 illustrated above) may take latent vectors z and further process them (e.g., through an MLP 1135) to produce an output, e.g., a value between 1 and 5. It may then compute a regression-based loss with the level of degradation that has been introduced to the audio, if available (e.g., from the available label information). In some implementations, this level of degradation may be logged (stored) from an (automatic) degradation algorithm that has been applied prior to training the network/system. In other words, broadly speaking, it may be considered that the loss functions may comprise a seventh loss function indicative of a degradation strength metric, and that the seventh loss function may be calculated based on difference between the label information comprising the respective degradation strength information and the prediction thereof.”, also see ¶ 113, “the method 200 performs step S230 of iteratively training the system to predict the respective label information of the audio samples in the training set. In particular, the training may be performed based on a plurality of loss functions and the plurality of loss functions may be generated to reflect differences between the label information of the audio samples in the training set and the respective predictions thereof, as illustrated above with reference to FIG. 1B.”, also see ¶ 101-106 teaches regression metric and losses functions, also see ¶ 109, “training a deep-learning-based (e.g., neural-network-based) system for determining an indication of an audio quality of an audio input according to an embodiment of the disclosure.”
Serra do not specifically teach yield a plurality of predicted pleasantness difference ratings, each predicted pleasantness difference rating corresponding to a respective pairwise comparison between the unrated sound and a respective one of the plurality of sounds; for each pairwise comparison, combining the predicted pleasantness difference rating with a respective one of the pleasantness ratings to yield a respective summed rating; and outputting an overall predicted pleasantness rating of the unrated sound based upon an average of the summed ratings.
Valentini-Botinhao teaches yield a plurality of predicted pleasantness difference ratings, each predicted pleasantness difference rating corresponding to a respective pairwise comparison between the unrated sound and a respective one of the plurality of sounds (see page 2, “converting MUSHRA scores derived from several listening evaluations to pairwise preference scores. We evaluate our system using unseen data (different voices and synthesizers) and compare it to MOS Net [14]… To convert MUSHRA scores into pairwise preference scores we compared the scores of every pair of stimuli belonging to the same MUSHRA screen as shown to the left in Fig. 1.”, also see page 4, “These represent the state of the art in publicly available models for MOS score prediction on synthetic speech. The MOS scores predicted by MOS Net were converted to pairwise preferences by checking which of two paired stimuli had the higher predicted MOS score.”, also see page 4, conclusion); for each pairwise comparison, combining the predicted pleasantness difference rating with a respective one of the pleasantness ratings to yield a respective summed rating (see page 4, “The results are presented in Table 3. Rows refer to training material and columns to the test set. We present scores at both stimulus level and system level. For system-level results we calculated the accuracy across all sentences of each system pair… These represent the state of the art in publicly available models for MOS score prediction on synthetic speech. The MOS scores predicted by MOSNet were converted to pairwise preferences by checking which of two paired stimuli had the higher predicted MOS score.”);
and outputting an overall predicted pleasantness rating of the unrated sound based upon an average of the summed ratings (see page 4, “The results are presented in Table 3. Rows refer to training material and columns to the test set. We present scores at both stimulus level and system level. For system-level results we calculated the accuracy across all sentences of each system pair… These represent the state of the art in publicly available models for MOS score prediction on synthetic speech. The MOS scores predicted by MOSNet were converted to pairwise preferences by checking which of two paired stimuli had the higher predicted MOS score.”).
Both Serra and Valentini-Botinhao pertain to the problem of sound quality, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Serra and Valentini-Botinhao to teach the above limitations. The motivation for doing so would be “This paper has introduced PrefNet, an approach to predicting pairwise preferences between synthetic speech stimuli. We demonstrated how data from side-by-side evaluations using numerical ratings can be leveraged to create training data for pairwise preference prediction. We empirically investigated several architectures and described a design rooted in twin neural nets that ensures consistent pairwise preferences if the order of the inputs is reversed. Results showed that GRU-based architectures outperformed those using attention to align and score stimulus pairs, and that our anti-symmetric network design also improved accuracy.” (see Valentini-Botinhao page 4, conclusion).
Claims 19-20 recites limitations to perform the method recited in claims 5 and 8. Therefore the rejection of claims 5 and 8 above applies equally here.
Claim(s) 2 , 10 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Serra et al. (US 20230245674 A1) in view of Valentini-Botinhao et al. (“Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks”, 22 Sep 2022, Division of Speech, Music and Hearing) in view of Tsunoda et al. (JP 2005037559 A). (Note: See attached description for paragraph numbers).
Regarding claim 2.
Serra and Valentini-Botinhao teaches the method of claim 1,
Serra and Valentini-Botinhao do not teach the limitations of claim 2.
Tsunoda teaches wherein the plurality of measurable sound qualities includes at least one of loudness, tonality, and sharpness (see ¶ 14, “A sound pressure level value, a loudness value, a sharpness value, a tonality value, an impulsiveness value of an acoustic physical quantity obtained from the sound emitted from the image forming apparatus when image formation is performed on the image forming target sheet”)
Serra, Valentini-Botinhao and Tsunoda pertain to the problem of sound quality, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine Serra and Valentini-Botinhao and Tsunoda to teach the above limitations. The motivation for doing so would be to collect loudness, sharpness, and tonality of a sound because those metrics can be used to help predict subjective quality metrics such as psychological discomfort (Tsunoda [0014])
Claims 10 and 18 recites limitations to perform the method recited in claim 2. Therefore the rejection of claim 2 above applies equally here.
Related arts:
Taniai et al. (US 12374095 B2) teaches a neural network module that includes an extraction operation to extract an element satisfying a predetermined condition from a set of targets. In the machine learning, the model generation apparatus performs the extraction operation in a phase of forward propagation with the neural network module, and replaces, in a phase of backpropagation, the extraction operation with a differentiable alternative operation and differentiates the alternative operation to compute an approximate gradient corresponding to differentiation for the extraction operation.
Li et al. (US 20230282188 A1) teaches generating a beatbox transcript are disclosed. Some examples may include: receiving an audio signal having a plurality of beatbox sounds, generating a spectrogram of the audio signal, processing the spectrogram of the audio signal with a neural network model trained on training samples including beatbox sounds, generating, by the neural network model a beatbox sound activation map including a plurality of activation times for a plurality of beatbox sounds, decoding the beatbox sound activation map into a beatbox transcript and providing the beatbox transcript as an output.
Yao et al. (US 20230056955 A1) teaches obtaining data characteristics of an audio data to be processed by extracting features from user preference data including the audio data to be processed; based on the data characteristics, generating a sound quality processing result of the audio to be processed by using a trained baseline model; wherein the baseline model is a neural network model trained by using audio data behavioral data, and other relevant data from multiple users or a single user.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IMAD M KASSIM whose telephone number is (571)272-2958. The examiner can normally be reached 10:30AM-5:30PM, M-F (E.S.T.).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached at (303) 297 - 4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/IMAD KASSIM/Primary Examiner, Art Unit 2129