Last updated: May 29, 2026

Application No. 18/794,588

VOICE DATA PROCESSING METHOD, APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM

Non-Final OA §102§103

Filed

Aug 05, 2024

Priority

Aug 21, 2023 — CN 202311058124.0

Examiner

PATEL, SHREYANS A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Lemon Inc.

OA Round

1 (Non-Final)

Interview Optional

— +7.7% interview lift. Interview lift (+7.7%) is below the 15.0% threshold. A written response is recommended.

Based on 406 resolved cases, 2023–2026

Examiner Intelligence

PATEL, SHREYANS A View full profile →

Grants 89% — above average

Career Allowance Rate

361 granted / 406 resolved

+26.9% vs TC avg

Moderate +8% lift

Without

With

+7.7%

Interview Lift

resolved cases with interview

Fast prosecutor

2y 0m

Avg Prosecution

26 currently pending

Career history

449

Total Applications

across all art units

Statute-Specific Performance

§101

11.0%

-29.0% vs TC avg

§103

67.1%

+27.1% vs TC avg

§102

11.6%

-28.4% vs TC avg

§112

0.8%

-39.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 406 resolved cases

Office Action

§102 §103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 4, 8-9, 12, 16-17 and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Wang et al. (“Non-Parallel Voice Conversion for ASR Augmentation”; Sept. 15, 2022).

Claims 1, 9 and 17,
Wang teaches a method of processing voice data, comprising: obtaining voice data to be processed ([1.] [pg. 1] convert input speech), and inputting the voice data to be processed into a pre-trained first voice processing model for feature extraction to obtain feature data to be processed corresponding to the voice data to be processed ([1.] [2.2] [pg. 1] initialize the VC encoder with a pre-trained ASR encoder; pre-trained speech encoder to extract representation); 
inputting the feature data to be processed into a trained second voice processing model for reprocessing to obtain discretized feature data corresponding to the voice data to be processed ([Fig. 1] [3.] [VQVAE Bottleneck] VQVAE Bottleneck is used to quantize encoder outputs into discrete codebook entries), 
wherein the second voice processing model comprises a feature encoder ([3.] [pg. 2] [Fig. 1] architecture consisting of an encoder) and a vector quantizer connected to the feature encoder ([3.] [Fig. 1] [pg. 2] VQVAE Bottleneck used to quantize encoder outputs), 
the second voice processing model is obtained by training a model to be trained that is pre-created based on sample feature data corresponding to sample voice data ([3.] [pg. 2] [Objective Function] we train on non-parallel data, utilize speech features; log-mel features are used as input and target), and 
the model to be trained comprises the second voice processing model and a feature decoder connected to the vector quantizer in the second voice processing model ([Fig. 1] [pg. 2] [3.] architecture consisting of an encoder, a bottleneck layer and a non-auto-regressive decoder; reconstruction loss is computed on the decoder; training the decoder, VQVAE).

Claims 4, 12 and 20,
Wang further teaches the method of claim 1, wherein before inputting the feature data to be processed into the trained second voice processing model for reprocessing, the method further comprises: obtaining the sample voice data, and inputting the sample voice data into the pre-trained first voice processing model for feature extraction to obtain the sample feature data corresponding to the sample voice data ([3.] sub-sampling convolution layer that reduces the time dimension of the input speech by 4x; a frozen encoder, where the weights are obtained from an ASR mode trained on 960 hours of Librispeech; feature extraction result is evidenced by given encoder outputs menc(x)); 
inputting the sample feature data into the second voice processing model in the model to be trained that is pre-created for re-encoding to obtain prediction feature data corresponding to the sample voice data, wherein the model to be trained comprises the feature encoder and the vector quantizer ([3.] VQVAE bottleneck is used to quantize encoder outputs into discrete codebook entries during auto-encoding training); 
inputting the prediction feature data into the feature decoder in the model to be trained for data reconstruction processing to obtain voice reconstruction data corresponding to the prediction feature data ([3.] a decoder receiving encoder output (through the bottleneck) and training reconstruction a the decoder (decoder speaker embedding concatenated with encoder output to be fed to the decoder) and reconstruction loss is computed on the decoder); 
optimizing the second voice processing model in the model to be trained based on the voice reconstruction data, the prediction feature data and the sample feature data, to obtain the second voice processing model ([3.] optimization using reconstruction (target vs predicted) and VQ losses (during VC training, the same log-mel features are used as input and target; the model is jointly optimized by minimizing Ltotal); see Lcodebook and Lcommit equations).

Claims 8 and 16,
Wang further teaches the method of claim 1, wherein the first voice processing model comprises at least one of encoder of: a HuBERT model, a data2vec model, a wav2vec model, or a Whisper model ([Abstract] ASR encoder within the VC model).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 2, 10 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (“Non-Parallel Voice Conversion for ASR Augmentation”; Sept. 15, 2022) and further in view of Ling et al. (“Decoar 2.0…”; Dec. 11, 2020).

Claims 2, 10 and 18,
Gray further teaches the method of claim 1, wherein inputting the feature data to be processed into the trained second voice processing model for reprocessing to obtain the discretized feature data corresponding to the voice data to be processed comprises: inputting the feature data to be processed into a trained feature encoder for encoding processing to obtain a plurality of feature data to be quantized corresponding to the voice data to be processed ([3.] [VQVAE Bottleneck] given encoder outputs m_enc(x) (encoder outputs provided for downstream quantization); VQVAE Bottleneck is used to quantize encoder outputs into discrete codebook entries); and 
The difference between the prior art and the claimed invention is that Gray does not explicitly teach wherein the number of output channels of the trained feature encoder is the same as the number of candidate feature clusters in the vector quantizer; and inputting the feature data to be quantized into the vector quantizer to convert feature data to be quantized for each output channel into a cluster identifier of a candidate feature cluster corresponding to each output channel based on a code table stored in the vector quantizer, and using the converted feature data to be quantized as the discretized feature data corresponding to the voice data to be processed.
Ling further teaches wherein the number of output channels of the trained feature encoder is the same as the number of candidate feature clusters in the vector quantizer ([3.2] a set of codebooks C1,··· ,CG where G is the number of codebooks, V entries in each codebook, we select one variable from each codebook and stack the resulting vectors followed by a linear transformation to obtain new representation v; we map the encoder output z to logits l ∈ RG× V via a linear layer); and 
inputting the feature data to be quantized into the vector quantizer to convert feature data to be quantized for each output channel into a cluster identifier of a candidate feature cluster corresponding to each output channel based on a code table stored in the vector quantizer ([3.2] selecting one entry from a fixed-size codebook C = = {c1,··· ,cV }; in inference, the index with largest value in logits l is selected from each codebook), and 
using the converted feature data to be quantized as the discretized feature data corresponding to the voice data to be processed ([3.2] map it to a new representation from a fixed size codebook).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wang with teachings of Ling by modifying non-parallel voice conversion for ASR augmentation as taught by Wang to include wherein the number of output channels of the trained feature encoder is the same as the number of candidate feature clusters in the vector quantizer; and inputting the feature data to be quantized into the vector quantizer to convert feature data to be quantized for each output channel into a cluster identifier of a candidate feature cluster corresponding to each output channel based on a code table stored in the vector quantizer, and using the converted feature data to be quantized as the discretized feature data corresponding to the voice data to be processed as taught by Ling for the benefit of combining the reconstructive loss with vector quantization diversity loss to train speech representations (Ling [Abstract]).

Claim(s) 3, 11 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (“Non-Parallel Voice Conversion for ASR Augmentation”; Sept. 15, 2022) and further in view of Defossez et al. (“High Fidelity Neural Audio Compression”; Oct. 24, 2022).

Claims 3, 11 and 19,
Wang further teaches the method of claim 1, wherein the feature encoder comprises an encoder input convolutional layer, and at least one encoding block connected to the encoder input convolutional layer ([3.] the encoder consists of sub-sampling convolution layer).
The difference between the prior art and the claimed invention is that an encoder output convolutional layer connected to the last encoding block, and each encoding block comprises at least one residual unit and a unit output convolutional layer connected to the last residual unit.
Defossez teaches an encoder output convolutional layer connected to the last encoding block, and each encoding block comprises at least one residual unit and a unit output convolutional layer connected to the last residual unit ([3.2] [pg. 2] convolution blocks are followed by a final 1D convolution layer; each convolution block is composed on a single residual unit; residual unit followed by strided convolution).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wang with teachings of Ling by modifying non-parallel voice conversion for ASR augmentation as taught by Wang to include an encoder output convolutional layer connected to the last encoding block, and each encoding block comprises at least one residual unit and a unit output convolutional layer connected to the last residual unit as taught by Defossez for the benefit of introducing a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks (Ling [Abstract]).

Claim(s) 7 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (“Non-Parallel Voice Conversion for ASR Augmentation”; Sept. 15, 2022) and further in view Oord et al (“Neural Discrete Representation Learning”; 2017).

Claims 7 and 15,
Wang teaches all the limitations in claim 4. The difference between the prior art and the claimed invention is that Wang does not explicitly teach determining an encoding loss corresponding to the feature encoder of the second voice processing model in the model to be trained based on the voice reconstruction data and the feature encoding data, and optimizing parameters to be optimized of the feature encoder based on the encoding loss; and determining a quantization loss corresponding to the vector quantizer in the model to be trained based on the prediction feature data and the feature encoding data, and optimizing the parameters to be optimized of the vector quantizer based on the quantization loss.
Oord teaches determining an encoding loss corresponding to the feature encoder of the second voice processing model in the model to be trained based on the voice reconstruction data and the feature encoding data, and optimizing parameters to be optimized of the feature encoder based on the encoding loss ([3.2] [eq. 3] the decoder optimizes the first loss term only, the encoder optimizes the first and last loss terms); and 
determining a quantization loss corresponding to the vector quantizer in the model to be trained based on the prediction feature data and the feature encoding data, and optimizing the parameters to be optimized of the vector quantizer based on the quantization loss ([3.2] assigning the middle (codebook/quantization) terms to the embeddings (i.e. code table/vector quantizer); the embeddings are optimized by the middle loss term).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wang with teachings of Oord by modifying non-parallel voice conversion for ASR augmentation as taught by Wang to include determining an encoding loss corresponding to the feature encoder of the second voice processing model in the model to be trained based on the voice reconstruction data and the feature encoding data, and optimizing parameters to be optimized of the feature encoder based on the encoding loss; and determining a quantization loss corresponding to the vector quantizer in the model to be trained based on the prediction feature data and the feature encoding data, and optimizing the parameters to be optimized of the vector quantizer based on the quantization loss as taught by Oord for the benefit of generate high quality images, videos, and speech (Oord [Abstract]).

Allowable Subject Matter
Claims 5-6 and 13-14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims AND overcoming the 101 Abstract Idea set forth.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Lim et al. (US 11,664,037) – Methods of encoding and decoding a speech signal using a neural network model that recognizes sound sources, and encoding and decoding apparatuses for performing the methods are provided. A method of encoding a speech signal includes identifying an input signal for a plurality of sound sources; generating a latent signal by encoding the input signal; obtaining a plurality of sound source signals by separating the latent signal for each of the plurality of sound sources; determining a number of bits used for quantization of each of the plurality of sound source signals according to a type of each of the plurality of sound sources; quantizing each of the plurality of sound source signals based on the determined number of bits; and generating a bitstream by combining the plurality of quantized sound source signals.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Primary Examiner
Art Unit 2653



/SHREYANS A PATEL/               Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Aug 05, 2024

Application Filed

Feb 23, 2026

Non-Final Rejection mailed — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/132,165

Patent 12608559

METHOD AND SYSTEM FOR ENHANCING A MUTIMODAL INPUT CONTENT

3y 0m to grant Granted Apr 21, 2026

18/696,802

Patent 12609128

METHOD FOR IMPROVING FAR-FIELD SPEECH INTERACTION PERFORMANCE, AND FAR-FIELD SPEECH INTERACTION SYSTEM

2y 0m to grant Granted Apr 21, 2026

17/934,906

Patent 12586597

ENHANCED AUDIO FILE GENERATOR

3y 6m to grant Granted Mar 24, 2026

18/744,449

Patent 12586561

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE

1y 9m to grant Granted Mar 24, 2026

17/983,671

Patent 12548549

ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)

3y 3m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

89%

Grant Probability

97%

With Interview (+7.7%)

2y 0m (~2m remaining)

Median Time to Grant

Low

PTA Risk

Based on 406 resolved cases by this examiner. Grant probability derived from career allowance rate.