DETAILED ACTION
1. This action is responsive to remarks filed 1/16/2026.
Notice of Pre-AIA or AIA Status
2. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
3. Claims 1, 8, 15 have been amended.
Response to Arguments
4. Applicant’s arguments filed have been fully considered but are moot based on the new grounds of rejection responsive to the amendments, where Li teaches:
Abstract: We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversar ial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech.
Introduction: Furthermore, when trained on a corpus with diverse speech styles, our model shows the ability to convert into stylistic speech, such as converting a plain reading voice into an emotive acting voice and convert a chest voice into a falsetto voice.;
2.1: We have adopted the same architecture to voice conversion, treated each speaker as an individual domain, and added a pre-trained joint detection and classification (JDC) F0 extraction network
Mapping network. The mapping network M generates a style vector hM = M(z,y) with a random latent code z ∈ Z in a domain y ∈ Y. The latent code is sampled from a Gaussian distribution to provide diverse style representations in all domains. The style vector representation is shared for all domains until the last layer, where a domain-specific projection is applied to the shared representation.
Style encoder. Given a reference mel-spectrogram Xref, the style encoder S extracts the style code hsty = S(Xref,y) in the domain y ∈ Y. Similar to the mapping network M, S first processes an input through shared layers across all domains. A domain-specific projection then maps the shared features into a domain-specific style code.
Generator. The generator G converts an input mel-spectrogram Xsrc into G(Xsrc,hsty,hf0) that reflects the style in hsty, which is given either by the mapping network or the style encoder, and the fundamental frequency in hf0, which is provided by the convolution layers in the F0 extraction network F.
And Figure 1:
PNG
media_image1.png
325
808
media_image1.png
Greyscale
which still reads on the limitations as currently recited.
Claim Rejections - 35 USC § 102
5. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
6. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
7. Claims 1-2, 5 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li et al StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion (provided in IDS).
Regarding claim 1 Li teaches A method of voice conversion (abstract; introduction; 2.1: voice conversion), the method comprising:
receiving, via a data interface, a source utterance of a first style and a target utterance of a second style (figure 1: X src is the source input; encoder; Xref is the reference input that contains the style information; style encoder – corresponding to target; 2.1 Generator: the generator converts an input mel-spectrogram into G that reflects the style …which is given either by the mapping network or the style encoder; Style encoder: given an reference mel=spectrogram…the style encoder extracts the style code);
generating, via a first encoder, a vector representation of the target utterance, the first encoder comprising a pretrained speaker encoder that outputs a target-speaker embedding (fig 1 style encoder; pg 1350 style encoder: style code hsty; abstract; intro 2nd col 1st para; 2.1; 2.1:mapping network);
generating, via a second encoder, a vector representation of the source utterance (fig 1 encoder – hx; 2.1 Generator: converts an input mel-spectrogram Xsrc into G(Xsrc, Hsty, Hf0));
generating, via a filter generator, a generated filter based on the vector representation of the target utterance the filter generator being configured to transform the target-speaker embedding into a dynamic style embedding (section 2 Method: Generator: generator converts an input …into G that reflects the style (of the target); fig 1 generator); and
generating, via a decoder, a generated utterance based on the vector representation of the source utterance and the generated filter, the decoder being conditioned by the dynamic style embedding to apply the second style to the source utterance (figure 1: decoder, X^; decoder which receives source, reference/target, and frequency information at generator).
Regarding claim 2 Li teaches The method of claim 1, wherein the generating the generated utterance includes applying the generated filter to an adaptive instance normalization layer of the decoder (figure 1: injected into the decoder by the adaptive instance normalization (AdaIN)).
Regarding claim 5 Li teaches The method of claim 1, further comprising:
generating, via a discriminator, a first prediction of real or fake based on the generated utterance or the source utterance (figure 1: the discriminators that determine whether a generated sample is real or fake; Discriminators: The discriminator D in [10] has shared layers that learns the common features between real and fake samples in all domains);
computing a first loss function based on the first prediction and an indication of real or fake (1350 Adversarial loss: generator takes an input…and style vector s and learn to generate a new mel-spectrogram via the adversarial loss…where D(.,y) denotes the output of real/fake classifier for the domain); and
updating parameters of at least one of the filter generator, the second encoder, or the decoder based on the first loss function (2.2 Training Objectives: we train our model with the following loss functions; Full Objective).
Claim Rejections - 35 USC § 103
8. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
9. Claims 3-4, 8-12, 15-19 are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Zhao et al (2021/0082438).
Regarding claim 3 Li does not specifically teach where Zhao teaches The method of claim 1, wherein the generated filter includes a weight vector and a bias vector (0058: weights and bias vectors).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Zhao and the network with attentive pooling and weights and bias vectors for more efficient representation, classification, and generation of the utterance/ speech and speaker features using the target and source information.
Li teaches voice conversion, and using networks and loss functions for continuous training and network/model optimization. One could thus look to Zhao to further incorporate additional network information to modify parameters…and repeat the process until the training objectives are sufficiently met (0064) and in which all components are jointly optimized using a … loss function (0066).
Regarding claim 4 Li does not specifically teach where Zhao teaches The method of claim 3, wherein:
the weight vector is generated via a first attentive pooling model based on the vector representation of the target utterance (0030-31; 58), and
the bias vector is generated via a second attentive pooling model based on the vector representation of the target utterance (0019: An attentive pooling layer is employed to extract speaker-discriminative information from the augmented input features based on learned attention weights and to generate a speaker-discriminative embedding.; 0030-31; 58).
Rejected for similar rationale and reasoning as claim 3
Regarding claim 8 Li teaches
A system for voice conversion, the system comprising:
a memory that stores a plurality of processor executable instructions;
a data interface that receives a source utterance of a first style and a target utterance of a second style; and
one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising:
generating, via a first encoder, a vector representation of the target utterance, the first encoder comprising a pretrained speaker encoder that outputs a target-speaker embedding;
generating, via a second encoder, a vector representation of the source utterance;
generating, via a filter generator, a generated filter based on the vector representation of the target utterance the filter generator being configured to transform the target-speaker embedding into a dynamic style embedding; and
generating, via a decoder, a generated utterance based on the vector representation of the source utterance and the generated filter, the decoder being conditioned by the dynamic style embedding.
Rejected for similar rationale and reasoning as claim 1
While Li must incorporate these components for execution, for the purposes of advancing prosecution Li does not explicitly teach where Zhao teaches the system:
A system for voice conversion, the system comprising:
a memory that stores a plurality of processor executable instructions;
a data interface; and
one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations (fig 12; 0085-0089).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate the system to allow for execution of voice conversion.
Claim 9 recites limitations similar to claim 2 and is rejected for similar rationale and reasoning.
Claim 10 recites limitations similar to claim 3 and is rejected for similar rationale and reasoning.
Claim 11 recites limitations similar to claim 4 and is rejected for similar rationale and reasoning.
Claim 12 recites limitations similar to claim 5 and is rejected for similar rationale and reasoning.
Regarding claim 15 Li and Zhao teach A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising:
receiving, via a data interface, a source utterance of a first style and a target utterance of a second style;
generating, via a first encoder, a vector representation of the target utterance;
generating, via a second encoder, a vector representation of the source utterance;
generating, via a filter generator, a generated filter based on the vector representation of the target utterance, the filter generator being configured to transform the target-speaker embedding into a dynamic style embedding; and
generating, via a decoder, a generated utterance based on the vector representation of the source utterance and the generated filter, the decoder being conditioned by the dynamic style embedding.
Claim 15 recites limitations similar to claim 8 and is rejected for similar rationale and reasoning.
Claim 16 recites limitations similar to claim 2 and is rejected for similar rationale and reasoning.
Claim 17 recites limitations similar to claim 3 and is rejected for similar rationale and reasoning.
Claim 18 recites limitations similar to claim 4 and is rejected for similar rationale and reasoning.
Claim 19 recites limitations similar to claim 5 and is rejected for similar rationale and reasoning.
10. Claims 6-7 are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Suzuki et al (2021/0280169)
Regarding claim 6 Li teaches The method of claim 5, further comprising:
generating, via a source classifier, a second prediction of utterance source based on the generated utterance (figure 1: Two classifiers form the discriminators that determine whether a generated sample is real or fake and who the source speaker of X is);
computing a second loss function based on the second prediction, the second loss function [being an additive angular margin loss] (Section 2.2 loss functions); and
updating parameters of at least one of the filter generator, the second encoder, or the decoder further based on the second loss function (2.2 Training Objectives: we train our model with the following loss functions; Full Objective);
But does not specifically teach where Suzuki teaches an additive angular margin loss (0027: an Additive Angular Margin Loss (ArcFace) method is adapted as a loss function).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate the loss for improved optimization while still presenting a reasonable expectation of success. Li already teaches determining loss for the training and optimizing the model, and one could look to Suzuki to implement the specific loss function for improved optimization while still allowing for the model training and optimization.
Regarding claim 7 Li teaches The method of claim 6, wherein the vector representation of the source utterance is a first vector representation of the source utterance, further comprising:
generating, via a pretrained encoder model, a second vector representation of the source utterance (figure 2);
computing a third loss function based on a comparison of the first and second vector representations of the source utterance (Adversarial source classifier loss); and
updating parameters of at least one of the filter generator, the second encoder, or the decoder further based on the third loss function (2.2 Training Objectives: we train our model with the following loss functions; Full Objective).
11. Claims 13-14, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Zhao et al (2021/0082438) in further view of Suzuki.
Claims 13-14, 20 recite limitations similar to claims 6-7 and are rejected for similar rationale and reasoning.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541. The examiner can normally be reached Monday-Friday 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.
For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SHAUN ROBERTS/Primary Examiner, Art Unit 2655