Last updated: May 29, 2026

Application No. 18/438,225

ELECTRONIC DEVICE FOR IDENTIFYING SYNTHETIC VOICE AND CONTROL METHOD THEREOF

Final Rejection §103

Filed

Feb 09, 2024

Priority

Dec 13, 2022 — RE 10-2022-0174206 +2 more

Examiner

MCLEAN, IAN SCOTT

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Samsung Electronics Co., Ltd.

OA Round

2 (Final)

Interview Optional

— +31.4% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 45% grant rate with +31.4% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 47 resolved cases, 2023–2026

Examiner Intelligence

MCLEAN, IAN SCOTT View full profile →

Grants 45% of resolved cases

Career Allowance Rate

21 granted / 47 resolved

-17.3% vs TC avg

Strong +31% interview lift

Without

With

+31.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

22 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§103

88.4%

+48.4% vs TC avg

§102

11.6%

-28.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 47 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
2.	Applicant's arguments filed 12/30/2025 have been fully considered but they are not persuasive. 

a.	Applicant’s argument that Lakhdhar fails to teach obtaining multiple segmentation voices from different sample voices  is not persuasive. Lakhdhar explicitly discloses parsing audio into multiple subframes and providing each subframe as an input to the feature extraction block (Lakhdhar ¶[0058]). The reference imposes no limitation on how many segments originate from any given sample voice. Selecting two segments from one sample and one from another represents a choice within the disclosed segmentation framework.

b.	Applicant’s reliance on ¶[0103] of the present a specification is not persuasive. The claims do not recite any limitation requiring segments to possess distinct characteristics beyond the differences from the origin of their subframe, naturally speaking, subframes all contain different characteristics, hence the need to take multiple subframes from an acoustic sample. Therefore, the claimed segmentation arrangement remains a variation that is anticipated by Lakhdhar’s subframe processing

c.	Applicant further argues that Lakhdhar only discloses triplet loss. Lakhdhar’s  ¶[0066] discloses triplet architecture already computes similarity relationships among multiple vectors. Gopala teaches training an emotion classifier using feature vectors and corresponding loss functions. Combining an emotion based classification loss with similarity based training is a predictable use of known training objectives within neural network training and would have been obvious to one of ordinary skill in the art.

d.	Applicant further asserts that Gopala fails to disclose the specific configuration of using multiple feature vectors to obtain predicted emotions and update classifier weights is not persuasive. Gopala ¶[0040]  and ¶[0137]-[0138] teaches inputting feature vectors into an emotion classifier and associated loss functions. The recited sequence of generating predictions, computing loss and updating weights represents conventional supervised learning operations.

e.	Applicant’s arguments directed to dependent claims 7-8 and 10-18 rely on alleged patentability of claim 1. Because claim 1 remains unpatentable for the reasons set forth above, the dependent claims are maintained.

Claim Rejections - 35 USC § 103
3.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

4.	Claims 1, 4-6, 9, 11-12, 15-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lakhdhar (US 2020/0322377) in view of Gopala (US 2021/0074305).

Regarding Claim 1:
Lakhdhar discloses an electronic device comprising: 
a microphone (Lakhdhar: p[0046] discloses a microphone);
and at least one processor (Lakhdhar: p[0030] ) configured to: 
based on receiving voice data through the microphone (Lakhdhar: p[0046] discloses receiving a voice),
input the voice data into a non-semantic feature extractor model and acquire a non-semantic feature included in the voice data using the non-semantic feature extractor model (Lakhdhar: p[0052] discloses the normalized audio signal is sent to a CONV-Net Feature Extraction Block model herein interpreted as a non-semantic feature extractor model, in some embodiments such as p[0066]-[0077] its disclosed that the convolutional network feature extractor block outputs a vector representation of the audio signal, these are non-semantic vectors because the goal of this process as disclosed in p[0007]-[0008] is to infer spoofed or genuine audio through acoustic features, not linguistic meaning), 
input the non-semantic feature into a synthetic voice classifier model, and classify the voice data into a synthetic voice or a user voice using the synthetic voice classifier model (Lakhdhar: Fig. 2A the output of the extraction block is placed into a classifier), 
and provide a result of the classification (Lakhdhar: Fig. 2A outputs a classification result (i.e. spoofed or genuine)), 
wherein the non-semantic feature comprises a feature vector corresponding to the voice data (Lakhdhar: ¶[0058] and ¶[0066] discloses segmenting/slicing audio into multiple subframes and using them as inputs to the feature extractor. Lakhdhar states the system can parse/slice PCEN audio signals into subframes and provide them to the feature extraction block and that each subframe can form an input to the feature extraction block, these are feature vectors), wherein the at least one processor is further configured to:
acquire a first sample user voice and a second sample user voice among a plurality of sample user voices, acquire a first segmentation voice and a second segmentation voice from the first sample user voice, acquire a third segmentation voice from the second sample user voice, input each of the first to third segmentation voices into the non-semantic feature extractor model and acquire first to third feature vectors corresponding to the first to third segmentation voices (Lakhdhar: ¶[0066] discloses using three parallel networks inputs during training, consistent with feeding first/second/third segmented inputs into a feature extractor to produce representations, triplet network architecture 412/422/432, each producing an embedding vector feature),
acquire an emotion classification loss based on the first feature vector and the second feature vector among the first to third feature vectors and acquire a similarity loss based on the first to third feature vectors, and update the non-semantic feature extractor model based on the emotion classification loss and the similarity loss (Lakhdhar: ¶[0066]-[0069] expressly uses the triplet network architecture for training, triplet loss uses distance based objective over the three embeddings (anchor/positive/negative) ¶[0076] specifically states after a given batch of training samples are processed, a loss function may be calculated based on the respective outputs 414, 424, 434 of the first, second, and third feed-forward neural networks 412, 422, 432. The computed loss function may be used to train the respective neural networks 412, 422, 432 of the feature extraction block), and wherein the at least one processor is further configured to:


	Lakhdhar does not explicitly disclose:
wherein the synthetic voice classifier model is a model that is transfer-learned based on the non-semantic feature extractor model;
input the first feature vector and the second feature vector into an emotion classifier and acquire a first predicted emotion corresponding to the first feature vector and a second predicted emotion corresponding to the second feature vector,
acquire the emotion classification loss based on the first predicted emotion, the second predicted emotion, and a first true emotion corresponding to the first sample user voice, and update the emotion classifier based on a weight corresponding to the emotion classification loss.
However, Gopala discloses:
wherein the synthetic voice classifier model is a model that is transfer-learned based on the non-semantic feature extractor model (Gopala: ¶[0081] and ¶[0143] discloses that transfer learning may be used on classifiers that determine if audio is synthetic, base models can be used to transfer train a new model for a similar task with slightly different goals);
input the first feature vector and the second feature vector into an emotion classifier and acquire a first predicted emotion corresponding to the first feature vector and a second predicted emotion corresponding to the second feature vector (Gopala: ¶[0137]-[0138] discloses audio features may be extracted by audio feature extractors from voice clips, and that a gated recurrent unit may provide representation over time, these are feature vectors/embeddings. The segmentation voices are processed by the neural network pipeline to acquire corresponding feature vectors/representation),
acquire the emotion classification loss based on the first predicted emotion, the second predicted emotion, and a first true emotion corresponding to the first sample user voice, and update the emotion classifier based on a weight corresponding to the emotion classification loss (Gopala: ¶[0040] teaches determining emotional state outputs, ¶[0078]-[0079] discloses training models using supervised learning and training data which is based off the loss from ground truth labels. Gopala further teaches generating multiple learned representations from voice clips via fusion in ¶[0137]-[0138], which would be optimized using a similarity objective across representations (similarity loss))
Lakhdhar in view of Gopala are combinable because they are both from the same field of endeavor, identifying fake audio. Both disclose methods for determining whether fake audio content was generated. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose a transfer-learned classifier trained on a previous classifier model and classifying emotional states from audio subframe vectors. The motivation for doing so is “Because there are lexical and phonetic invariants within a language family, the structure of language can be exploited. For example: German is an agglutinative language. In transfer learning, good representations (such as nonlinear combinations of input features) may be used as seeds to accelerate training” as disclosed in paragraph [0143] of Gopala and “the audio analysis techniques may provide an improved user experience when listening to audio content or viewing images and videos that include associated audio content.” In ¶[0053] of Gopala.

Regarding Claim 4:
The combination of Lakhdhar and Gopala further discloses the electronic device of claim 1, wherein the first predicted emotion and the second predicted emotion are identical (Gopala: p[0140] discloses classifying emotional states based on speech. When two inputs (two feature vectors) correspond to audio expressing the same emotional state, it is expected and routine for the classifier to output the same predicted label, this is a natural result of running two highly similar identical feature vectors through the same classifier model, which would generate the same output class when they both represent the same underlying emotion).
It would have been obvious to one of ordinary skill in the art to disclose that the first predicted emotion and the second predicted emotion are identical as disclosed in Gopala. Lakhdhar teaches feature extraction and a framework for generating feature vectors, Gopala provides the emotion classifier and explicitly describes how emotional states are predicted from audio. When both feature vectors represent the same emotional state, Gopala’s classifier would naturally output identical predictions. This is an expected result and a design choice with no unexpected technical benefit. Therefore, it would have been obvious to combine Lakhdhar and Gopala.

Regarding Claim 5:
The combination of Lakhdhar and Gopala further discloses the electronic device of claim 1, wherein the at least one processor is further configured to: acquire the similarity loss based on distance information among the first to third feature vectors (Lakhdhar: p[0066] directly corresponds to calculating a similarity loss based on distance information among three feature vectors. Lakhdhar explicitly discloses three-segment configurations and computing distances to determine loss values), and update the non-semantic feature extractor model based on a weight corresponding to an aggregation of the emotion classification loss and the similarity loss (Lakhdhar: p[0077] discloses updating the model weights using its loss function, this is standard supervised learning).

Regarding Claim 6:
The combination of Lakhdhar and Gopala further discloses the combination of Lakhdhar and Gopala further discloses the electronic device of claim 1, wherein the at least one processor is further configured to: transfer train the synthetic voice classifier model based on the non-semantic feature extractor model and a loss function (Gopala: p[0081]-[0086] explicitly teaches transfer learning, where an initial model trained on a general task is retrained/fine-tuned for a new task using a smaller dataset).
Lakhdhar in view of Gopala are combinable because they are both from the same field of endeavor, identifying fake audio. Both disclose methods for determining whether fake audio content was generated. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose a transfer-learned classifier trained on a previous classifier model. The motivation for doing so is “Because there are lexical and phonetic invariants within a language family, the structure of language can be exploited. For example: German is an agglutinative language. In transfer learning, good representations (such as nonlinear combinations of input features) may be used as seeds to accelerate training” as disclosed in paragraph [0143] of Gopala.




Regarding Claim 9:
The combination of Lakhdhar and Gopala further disclose the electronic device of claim 1, wherein the synthetic voice classifier model is further configured to: output a probability that the voice data is included in the synthetic voice, and wherein the at least one processor is further configured to: based on the probability exceeding a threshold probability, classify the voice data as the synthetic voice (Lakhdhar: p[0052] discloses a score (probability) which can be compared to a threshold to determine if it is genuine or synthetic audio).

Regarding Claim 11:
 The combination of Lakhdhar and Gopala further disclose the electronic device of claim 1, 
wherein the non-semantic feature comprises a feature vector corresponding to the voice data (Lakhdhar: p[0066]-[0077] discloses that the network outputs feature vectors based on the audio input, These are non-semantic because they represent acoustic characteristics rather than linguistic meaning (signal level features such as pitch, frequency and energy patterns), and wherein the at least one processor is further configured to: 
input the feature vector into an emotion classifier and acquire a predicted emotion corresponding to the feature vector (Gopala: p[0140] discloses a perception engine that receives audio input and outputs parameters such as the emotional state of the user, which is derived from analyzing the input features, i.e., a model to classify emotion based on audio derived features),
and provide a feedback corresponding to the predicted emotion (Gopala: p[0141] explains that after determining emotional state the system provides a corresponding output (speech that matches tone or visual representation such as a smiling avatar).
Lakhdhar in view of Gopala are combinable because they are both from the same field of endeavor, identifying fake audio. Both disclose methods for determining whether fake audio content was generated. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose a transfer-learned classifier trained on a previous classifier model. The motivation for doing so is “Because there are lexical and phonetic invariants within a language family, the structure of language can be exploited. For example: German is an agglutinative language. In transfer learning, good representations (such as nonlinear combinations of input features) may be used as seeds to accelerate training” as disclosed in paragraph [0143] of Gopala.

Regarding Claim 12:
Claim 12 has been analyzed with regard to claim 1 (see rejection above) and is rejected for the same reasons of obviousness as used above.

Regarding Claim 15:
Claim 15 has been analyzed with regard to claim 4 (see rejection above) and is rejected for the same reasons of obviousness as used above.

Regarding Claim 16:
Claim 16 has been analyzed with regard to claim 5 (see rejection above) and is rejected for the same reasons of obviousness as used above.

Regarding Claim 17:
Claim 17 has been analyzed with regard to claim 9 (see rejection above) and is rejected for the same reasons of obviousness as used above.


Regarding Claim 19:
Claim 19 has been analyzed with regard to claim 11 (see rejection above) and is rejected for the same reasons of obviousness as used above.

Regarding Claim 20:
Claim 20 has been analyzed with regard to claim 1 (see rejection above) and
is rejected for the same reasons of obviousness as used above.

5.	Claims 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Lakhdhar, in view of Gopala as applied to claim 6 above and further in view of Havdan (US 2023/0206925).

Regarding Claim 7:
	The combination of Lakhdhar and Gopala further discloses the electronic device of claim 6, wherein the at least one processor is further configured to: 
input a plurality of sample voice data into the non-semantic feature extractor model, and acquire non-semantic features corresponding to the plurality of sample voice data (Lakhdhar: p[0077] discloses CNN extracts feature representations (feature vectors) from audio samples for classification, these features are non-semantic because they relate to signal properties), 
and input the non-semantic features corresponding to the plurality of sample voice data into the synthetic voice classifier model, and acquire a prediction result that each of the plurality of sample voice data is classified into the synthetic voice or the user voice (Lakhdhar: p[0072]-[0073] discloses classifying audio as fake (synthetic) or genuine based on extracted features, producing and output representing the classification), acquire a (Lakhdhar: p[0077] discloses using triplet loss),
and update the synthetic voice classifier model based on the (Lakhdhar: p[0077] discloses updating network weights using the loss function during training).
The combination of Lakhdhar and Gopala do not explicitly disclose using cross entropy loss, however, Havdan discloses acquire a cross entropy loss corresponding to the prediction result and a true result based on the loss function (Havdan: p[0051] discloses the loss function may be a standard loss function such as cross-entropy);
and update the synthetic voice classifier model based on the cross entropy loss (Havdan: p[0031] discloses that the system may train a neural network  to classify genuine or spoofed voice samples, a loss function is also mentioned for this training process (as is necessary for any neural network training mechanism), p[0051] discloses this may be cross-entropy loss).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to choose cross entropy as the loss function used in Lakhdhar or Gopala. Cross-entropy loss was specifically created for classification purposes, given the need be able to accurately classify synthetic speech to prevent fraud in many fields but in particular call centers (as disclosed by Havdan p[0002]. There is a finite number of potential loss function that could be reasonably used and because this is a classification problem, cross-entropy would be an obvious decision for one of ordinary skill in the art to choose for the fact it would have a reasonable expectation of success as cross-entropy is well tested and understood within the field of endeavor (natural language processing).

Regarding Claim 8:
	The combination of Lakhdhar, Gopala and Havdan further disclose the electronic device of claim 7, wherein the plurality of sample voice data comprise: 
a plurality of sample user voices and a plurality of sample synthetic voices (Lakhdhar: p[0035] discloses a database with corresponding audio which will train the system to perform synthetic audio detection), 
and wherein the true result is a result of classifying each of the plurality of sample voice data into the synthetic voice or the user voice based on true labels corresponding to the plurality of sample voice data (Lakhdhar: p[0077] discloses the use of sample baches to train using backpropagation and a loss function, this is all labeled data used to train the model, therefore the true result is a result of using all of this batch data, corresponding to this limitation).

6.	Claims 10 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Lakhdhar, in view of Gopala as applied to claim 9 above and further in view of Schroeter (US 9,812,133).

Regarding Claim 10:
 	The combination of Lakhdhar and Gopala further disclose the electronic device of claim 9, except wherein the at least one processor is further configured to: adjust the threshold probability based on a security level corresponding to an application that is being executed in the electronic device, and based on the voice data being classified as the synthetic voice, provide a notification.  However, Schroeter discloses wherein the at least one processor is further configured to: adjust the threshold probability based on a security level corresponding to an application that is being executed in the electronic device, and based on the voice data being classified as the synthetic voice, provide a notification (Schroeter: Col 4:64 – Col 5:6 discloses explicitly adjusting a pre-determined threshold depending on the security context).
	It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Schroeter’s threshold adjustment approach with Lakhdhar’s spoof detection system because doing so merely applies a well-known practice of varying sensitivity based on security needs. The motivation for doing so is “in a lower security environment, such as using a voice sample to unlock a cell phone in order to place a call, the threshold is much lower so users do not get annoyed or frustrated with a stricter threshold” as disclosed in Col 4:67 – Col 5:6.

Regarding Claim 18:
Claim 18 has been analyzed with regard to claim 10 (see rejection above) and is rejected for the same reasons of obviousness as used above.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN SCOTT MCLEAN whose telephone number is (703)756-4599. The examiner can normally be reached "Monday - Friday 8:00-5:00 EST, off Every 2nd Friday".
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/IAN SCOTT MCLEAN/Examiner, Art Unit 2654          

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Show 2 earlier events

Dec 02, 2025

Applicant Interview (Telephonic)

Dec 02, 2025

Examiner Interview Summary

Dec 30, 2025

Response Filed

Mar 02, 2026

Final Rejection mailed — §103

Apr 22, 2026

Applicant Interview (Telephonic)

Apr 22, 2026

Examiner Interview Summary

May 01, 2026

Request for Continued Examination

May 06, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/499,296

Patent 12609127

NEUTRALIZING DISTORTION IN AUDIO DATA

2y 5m to grant Granted Apr 21, 2026

18/245,802

Patent 12602553

SPEECH TRANSLATION METHOD, DEVICE, AND STORAGE MEDIUM

3y 0m to grant Granted Apr 14, 2026

17/952,401

Patent 12494199

VOICE INTERACTION METHOD AND ELECTRONIC DEVICE

3y 2m to grant Granted Dec 09, 2025

18/063,167

Patent 12443805

Systems and Methods for Multilingual Data Processing and Arrangement on a Multilingual User Interface

2y 10m to grant Granted Oct 14, 2025

17/559,283

Patent 12437144

Content Recommendation Method and User Terminal

3y 9m to grant Granted Oct 07, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

45%

Grant Probability

76%

With Interview (+31.4%)

3y 0m (~9m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 47 resolved cases by this examiner. Grant probability derived from career allowance rate.