DETAILED ACTION
Introduction
Applicant's submission filed on 12/10/2025 has been entered. Claims 1-3 and 5-7 are pending in the application and have been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The response filed on 12/10/2025 has been correspondingly accepted and considered in this Office Action. Claims 1-3 and 5-7 have been examined. Claim 4 has been cancelled.
Applicant’s amendments to claim 1, to include the claim 4 overcomes the 35 U.S.C 101 rejections previously set forth in the Non-Final Office Action mailed 9/15/2025. The dependent claims 3 and 7 overcome the 35 U.S.C 101 rejections previously set forth in the Non-Final Office Action mailed 9/15/2025 based on their dependency to the amended claim 1 respectively. Therefore, the above referenced rejections under 35 U.S.C. 101 are withdrawn.
The change to the Specification title as filed on 12/10/2025 overcomes the objection to the Title of the invention.
Response to Arguments
Applicant's arguments filed 12/10/2025 have been fully considered as follows:
Applicant’s arguments with respect to claim 1 (also representative of claim 7) state that
“In the present application, the examiner primarily relies on Zhang to teach the claimed invention. The examiner concedes that Zhang fails to disclose "a point is defined as an initial value point, the point being in the vector space and representing a series of feature amounts of the input data for learning, a function is defined as a score function, the function having a point x in the vector space as an independent variable and indicating a gradient of a path from the point x to a nearest stationary point that is a stationary point on the target feature amount distribution function and is a stationary point nearest to the initial value point". The examiner relies on Chen to teach this aspect.... Rather, Chen generates speech signals from random numbers. ”
The examiner respectfully disagrees, Chen teaches “WaveGrad network architecture. The inputs consists of the mel-spectrogram conditioning signal x, the noisy waveform generated from the previous iteration yn, and the noise level √¯α. The model produces n at each iteration, which can be interpreted as the direction to update yn.” in Chen, Fig. 3, pg. 4, the input data signal is a mel-spectrogram which is based on a voice signal which is processed further by the noisy signal yn using a denoising method as indicated in Chen, section 2. Therefore, Cho a point is defined as an initial value point, the point being in the vector space and representing a series of feature amounts of the input data for learning, a function is defined as a score function, the function having a point x in the vector space as an independent variable and indicating a gradient of a path from the point x to a nearest stationary point that is a stationary point on the target feature amount distribution function and is a stationary point nearest to the initial value point and therefore, the rejections of Claims 1, 5, 6 and 7 are rejected under 35 U.S.C. 103 are sustained and further updated accordingly. It is the recommendation of the examiner to include details on the score approximation such that it does not necessarily have to acquire the form of the target feature amount distribution function p(x) in advance as prior information at the time of learning.
In response to the art rejection(s) of the remainder of dependent claims are rejected under 35 U.S.C 103, in case said claims are correspondingly discussed and/or argued for at least the same rationale presented in Remarks filed 12/10/2025, Examiner respectfully notes as follows. For completeness, should the mentioned claims be likewise traversed for similar reasons to independent claims 1, 5, 6 and 7 correspondingly, Examiner respectfully directs Applicant to the same previous supra reasons provided in the response directed towards claims 1, 5, 6 and 7 correspondingly discussed above. For at least the same supra provided reasons, Examiner likewise respectfully disagrees, and Applicant's arguments have been fully considered but they are not persuasive.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 5 and 6 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
According to USPTO guidelines, a claim is directed to non-statutory subject matter if:
STEP 1: the claim does not fall within one of the four statutory categories of invention (process, machine, method of manufacture, or composition of matter), or
STEP 2: the claim recites a judicial exception (e.g. an abstract idea) without reciting additional elements that amount to significantly more than the judicial exception, as determined using the following analysis:
STEP 2A (Prong 1): Does the claim recite an abstract idea, law of nature, or natural phenomenon? The guidelines provide three groupings of subject matter that are considered abstract ideas:
Mathematical concepts- mathematical relationships, formulas or equations, calculations
Certain methods of organizing human activity- fundamental economic principles or practices, commercial or legal interactions, managing personal behavior or relationships or interactions between people
Mental processes- concepts that are practicably performed in the human mind (including an observation, evaluation, judgement, or opinions)
STEP 2A (Prong 2): Does the claim recite additional elements that integrate the judicial exception into a practical application? The guidelines provide the following exemplary considerations that are indicative than an additional element (or combination of elements) may have integrated the judicial exception into a practical application:
an additional element reflects an improvement in the functioning of a computer, or an improvement to other technology or technical field;
an additional element that applies or uses a judicial exception to affect a particular treatment or prophylaxis for a disease or medical condition;
an additional element implements a judicial exception with, or uses a judicial exception in conjunction with, a particular machine or manufacture that is integral to the claim;
an additional element effects a transformation or reduction of a particular article to a different state or thing; and
an additional element applies or uses the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claim as a whole is more than a drafting effort designed to monopolize the exception.
While the guidelines further state that the exemplary considerations are not an exhaustive list and that there may be other examples of integrating the exception into a practical application, the guidelines also list examples in which a judicial exception has not been integrated into a practical application:
an additional element merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea;
an additional element adds insignificant extra-solution activity to the judicial exception; and
an additional element does no more than generally link the use of a judicial exception to a particular technological environment or field of use.
STEP 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? Consider whether an additional element or combination of elements:
adds a specific limitation or combination of limitations that are not well-understood, routine, or conventional activity in the field, which is indicative that an inventive concept may be present; or
simply appends well-understood, routine and conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception, which is indicative that an inventive concept may not be present.
Using the two-step inquiry, claim 6 is directed to an abstract idea as show below:
STEP 1: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
YES. Claim 1 is directed to a method.
STEP 2A (Prong 1): Does the claim recite an abstract idea, law of nature, or natural phenomenon?
YES. The claim recites an abstract idea:
The limitation of executing a conversion learning model that is a model of machine learning that converts the input data for learning into learning stage conversion destination data that is a voice signal of a conversion destination , as drafted, is a process that, under its broadest reasonable interpretation, is a data-gathering processing but for the recitation of generic computer components.
The limitation of updating the conversion learning model by learning, wherein a probability density function is defined as a target feature amount distribution function, the probability density function being a function on a vector space representing a series of voice feature amounts that are feature amounts obtained from a voice signal and representing a distribution of a series of voice feature amounts of a target voice signal that is a voice signal having a predetermined attribute, as drafted, is a process that, under its broadest reasonable interpretation, recites mathematical formula or calculation but for the recitation of generic computer components.
The limitation of a point is defined as an initial value point, the point being in the vector space and representing a series of feature amounts of the input data for learning, a function is defined as a score function, the function having a point x in the vector space as an independent variable and indicating a gradient of a path from the point x to a nearest stationary point that is a stationary point on the target feature amount distribution function and is a stationary point nearest to the initial value point, the input data for learning is converted into the learning stage conversion destination data on a basis of the score function in the executing, and the score function in updating the conversion learning model in the updating, as drafted, is a process that, under its broadest reasonable interpretation, recites mathematical formula or calculation but for the recitation of generic computer components.
STEP 2A (Prong 2): Does the claim recite additional elements that integrate the judicial exception into a practical application?
NO.
Claim 6 recites the additional element of generating and executing through a “model of machine learning”, which are recited at a high level of generality and amounts to merely using a computer as a tool to perform an abstract idea or mere instructions to apply the exception using a generic computer component. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the insignificant extra-solution activities abstract idea but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. See MPEP 2106.04(d). This judicial exception is not integrated into a practical application.
STEP 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
NO.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a model of machine learning and processors, storage medium amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Claim 6 is not patent eligible.
Claims 5 is analogous to claims 6 respectively as directed to a system using processors and methods, the processing device to perform the operations set forth in claim 6, and are subjected to the same rejection as claims 6 respectively.
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1-3, 5-7 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et. al. US PgPub. 2020/0365166 in view of Chen, Nanxin, et al. "Wavegrad: Estimating gradients for waveform generation." arXiv preprint arXiv:2009.00713 (2020).
PNG
media_image1.png
466
582
media_image1.png
Greyscale
Regarding claim 1, Zhang teaches a voice signal conversion model learning device comprising: a processor; and a storage medium having computer program instructions stored thereon(see Zhang, Fig. 11), wherein the computer program instruction, when executed by the processor, perform processing of: acquiring input data for learning, the input data being a voice signal input (see Zhang, [0005] source speaker data as input ( voice signal input)); executing a conversion learning model that is a model of machine learning that converts the input data for learning into learning stage conversion destination data that is a voice signal of a conversion destination (Zhang, Fig. 3, [0035] describes the Conversion of content embedding to the target style embedding using the Autoencoder(machine learning model) style transfer for voice conversion); and updating the conversion learning model by learning, wherein a probability density function is defined as a target feature amount distribution function, the probability density function being a function on a vector space representing a series of voice feature amounts that are feature amounts obtained from a voice signal and representing a distribution of a series of voice feature amounts of a target voice signal that is a voice signal having a predetermined attribute (see Zhang, [0028] describes in Fig. 2, the speaker content Z, Speaker identity U, Speech X as a variables for speech generation using the probability function and the ideal convertor which is probability function of the distribution of the Speech content in the selected speaker identity ( predetermined attribute)), the input data for learning is converted into the learning stage conversion destination data on a basis of the score function in the executing (see Zhang, [0036] discusses speech loss and content reconstruction loss (score function)), and the score function in updating the conversion learning model in the updating(see Zhang, [0037] The loss function to minimize the weighted combination of the self-reconstruction error and the content code reconstruction error 302 shown in FIG. 3 for the training of the autoencoder voice conversion).
However, Zhang fails to teach a point is defined as an initial value point, the point being in the vector space and representing a series of feature amounts of the input data for learning, a function is defined as a score function, the function having a point x in the vector space as an independent variable and indicating a gradient of a path from the point x to a nearest stationary point that is a stationary point on the target feature amount distribution function and is a stationary point nearest to the initial value point; wherein a neural network is defined as a score approximator, the neural network representing a function that includes a parameter θ and in which a result of predetermined optimization processing of updating the parameter θ is substantially identical to a score function, a neural network representing the conversion learning model includes a single piece of the score approximator, and the score function is updated in the updating on a basis of a sum of a plurality of differences included in the score approximator, wherein each of the differences is a difference between a value of the score function and a difference between data of the point x to which noise is added and data of the point x of the space before the noise is added.
PNG
media_image2.png
171
377
media_image2.png
Greyscale
However, Chen teaches a point is defined as an initial value point, the point being in the vector space and representing a series of feature amounts of the input data for learning (see Chen, pg. 2; The Stein score function (Hyvarinen , 2005) is the gradient of the data log-density log p(y) with respect to the datapoint y ( the initial datapoint) ;Chen, pg. 3 Figure 2: WaveGrad directed graphical model for training, conditioned on iteration index. q(yn+1| yn) (series of feature points) iteratively adds Gaussian noise to the signal starting from the waveform y0 ( initial value point), the process of diffusion based on iteration index and gaussian noise based on the index are the feature amounts for learning data(series of feature amounts)), a function is defined as a score function, the function having a point x in the vector space as an independent variable and indicating a gradient of a path from the point x to a nearest stationary point that is a stationary point on the target feature amount distribution function and is a stationary point nearest to the initial value point (see Chen, pg. 3, A generative model can be built by training a neural network to learn the Stein score function directly, using Langevin dynamics for inference. The denoising score matching framework relies on a noise distribution to provide support for learning the gradient of the data log density (i.e., q in Equation 3, and N (·, σ) in Equation 4)), the input data for learning is converted into the learning stage conversion destination data on a basis of the score function in the executing, and the score function in updating the conversion learning model in the updating (see Chen, sect 2.1 and 2.2 teaches the algorithm 1 for training the WaveGrad model to iteratively adds Gaussian noise to the signal starting from the waveform y0. The inference denoising process progressively removes noise, starting from Gaussian noise yN , akin to Langevin dynamics), wherein a neural network is defined as a score approximator, the neural network representing a function that includes a parameter 8 and in which a result of predetermined optimization processing of updating the parameter 8 is substantially identical to a score function (see Chen, pg. 2, WaveGrad combines recent techniques from score matching (Song et al., 2020; Song & Ermon, 2020) and diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020) to address conditional speech synthesis; pg. 4 algorithm 1 for training based on predefined noise schedule ( parameter θ) ), a neural network representing the conversion learning model includes a single piece of the score approximator (see Chen, pg. 4, Algorithm 1 (conditions on the continuous noise level √α¯. l is from a predefined noise schedule)), the score function is updated in the updating on a basis of a sum of a plurality of differences included in the score approximator(see Chen, pg. 4 Ho et al. (2020) proposed to train on pairs (y0, yn), and to reparametrize the neural network to model €θ. This objective resembles denoising score matching as in Equation 3; Chen, pg. 4 equation 8 discusses the diffusion process ( plurality of differences) which are further processed by equation (10)(denoising score matching)(score approximator) ), wherein each of the differences is a difference between a value of the score function and a difference between data of the point x to which noise is added and data of the point x of the space before the noise is added (see Chen, pg. 4 Ho et al. (2020) proposed to train on pairs (y0, yn), and to reparametrize the neural network to model €θ. This objective resembles denoising score matching as in Equation 3; Chen, pg. 4 equation 8 discusses the diffusion process ( plurality of differences) which are further processed by equation (10)(denoising score matching)(score approximator); see Chen, pg. 4 equation (7) (adds the noise to initial point yN) , equation 8 discusses the diffusion process ( difference), equation (10)(denoising score matching) (similar to specifications equation 14); Chen, Fig. 4 shows the convolution structure)..
Zhang and Chen are considered to be analogous to the claimed invention because they relate to voice conversion using one or more neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Zhang on zero-shot voice conversion with non-parallel data with the score matching and diffusion probabilistic models techniques of Chen to generating high fidelity samples using as few as six refinement steps ( see Chen, pg. 2 ).
Regarding claim 2, Zhang in view Chen teach the voice signal conversion model learning device according to claim 1. Chen teaches wherein a neural network is defined as a score approximator, the neural network representing a function that includes a parameter θ and in which a result of predetermined optimization processing of updating the parameter θ is substantially identical to a score function (see Chen, pg. 2, WaveGrad combines recent techniques from score matching (Song et al., 2020; Song & Ermon, 2020) and diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020) to address conditional speech synthesis; pg. 4 algorithm 1 for training based on predefined noise schedule ( parameter θ) ), a neural network representing the conversion learning model includes a plurality of the score approximators (see Chen, Fig. 3 , The model produces €n at each iteration, which can be interpreted as the direction to update yn), and the update unit updates the score function is updated in the updating on a basis of a sum of differences for the respective score approximators, wherein each of the differences is a difference between a value of the score function and a difference between data of the point x to which noise is added and data of the point x of the space before the noise is added(see Chen, pg. 4 equation (7) (adds the noise to initial point yN) , equation 8 discusses the diffusion process ( difference), equation (10)(denoising score matching) (similar to specifications equation 14); Chen, Fig. 4 shows the convolution structure). The same motivation to combine as claim 1 applies here.
Regarding claim 3, Zhang in view Chen teach the voice signal conversion model learning device according to claim 2, Chen further teaches a method for updating the score function on a basis of the sum is weighted Denoising Score Matching (DSM) (see Chen, pg. 4, equation (10) ).
Regarding claim 4, is directed to voice signal conversion model learning device claim corresponding to voice signal conversion model learning device claim 2 and is rejected under the same grounds stated above regarding claim 2. Chen further teaches single piece of the score approximator (see Chen, pg. 4, Algorithm 1 (conditions on the continuous noise level √α¯. l is from a predefined noise schedule)), the score function is updated in the updating on a basis of a sum of a plurality of differences included in the score approximator (see Chen, pg. 4 Ho et al. (2020) proposed to train on pairs (y0, yn), and to reparametrize the neural network to model €θ. This objective resembles denoising score matching as in Equation 3; Chen, pg. 4 equation 8 discusses the diffusion process ( plurality of differences) which are further processed by equation (10)(denoising score matching)(score approximator) ).
Regarding claim 5, is directed to performing conversion of the conversion target by using a learned conversion learning model obtained by a voice signal conversion model learning device corresponding to voice signal conversion model learning device claim 1 without the details regarding the neural network defined as score approximator as per the amendment dated 12/10/2025 and is rejected under the same grounds stated above regarding claim 1. Zhang further teaches a voice signal conversion device comprising: a processor; and a storage medium having computer program instructions stored thereon (see Zhang, Fig. 11), wherein the computer program instruction, when executed by the processor, perform processing of: acquiring a voice signal of a conversion target (see Zhang, [0005] Target speaker input speech is received as input data into a target speaker encoder).
Regarding claim 6, is directed to a method claim corresponding to the device claim presented in claim 5 and is rejected under the same grounds stated above regarding claim 5.
Regarding claim 7, is directed to a non-transitory computer readable medium (Zhang, [0006])claim corresponding to the device claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kameoka, H, et al. "VoiceGrad: Non-parallel any-to-many voice conversion with annealed Langevin dynamics." arXiv preprint arXiv:2010.02977 (2020) (cited in IDS) teaches a non-parallel any-to many voice conversion (VC) method termed VoiceGrad (see Kameoka, abstract).
Pang et. al. US Patent 12,027,151 teaches linguistic content and speaking style disentanglement models to generate the utterance in the target style (Pang, Fig. 1 ).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 12:00pm - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached at (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NANDINI SUBRAMANI/ Examiner, Art Unit 2656
/BHAVESH M MEHTA/ Supervisory Patent Examiner, Art Unit 2656