Last updated: April 19, 2026

Application No. 18/733,524

SYSTEM AND METHOD FOR DISTINGUISHING ORIGINAL VOICE FROM SYNTHETIC VOICE IN AN IOT ENVIRONMENT

Non-Final OA §103

Filed

Jun 04, 2024

Examiner

GAY, SONIA L

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Samsung Electronics Co., Ltd.

OA Round

1 (Non-Final)

Interview Optional

— +11.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 855 resolved cases, 2023–2026

Examiner Intelligence

GAY, SONIA L View full profile →

Grants 82% — above average

Career Allow Rate

701 granted / 855 resolved

+20.0% vs TC avg

Moderate +11% lift

Without

With

+11.4%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

33 currently pending

Career history

888

Total Applications

across all art units

Statute-Specific Performance

§101

10.2%

-29.8% vs TC avg

§103

50.6%

+10.6% vs TC avg

§102

11.9%

-28.1% vs TC avg

§112

13.9%

-26.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 855 resolved cases

Office Action

§103

DETAILED ACTION
This action is in response to the initial filing of application no. 18/733,524 on 06/04/2024.
Claims 1 – 20 are still pending in this application, with claims 1 and 12 being independent.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 15 is objected to because of the following informalities: “generate comprising a first output comprising” should recite “generate a first output comprising”.  Appropriate correction is required.

Allowable Subject Matter
Claims 3 and 14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The prior art fails to teach or suggest in reasonable combination the following limitations: wherein the one or more processors are further configured to execute further instructions to: perform re-verification of the voice of the user based on the ranked score being equal to the dynamic threshold value, wherein to perform the re-verification of the voice of the user comprises: perform a search in a combined database for phrases of spatial features and temporal features that are similar to spatial features and temporal features of the voice of the user and the environmental factors; select, from among the searched phrases, a phrase of least variation for an environment; simulate environmental factors of the environment of the selected phrase using multiple IoT devices; prompt the user to speak the selected phrase for re-verification of the voice of the user; and count a number of times the re-verification of the voice of the user is performed; and stop the re-verification based on the number of times reaching a predefined count value.
For example, Keret et al. (US 2020/0184979) (“Keret”) discloses the following limitations recited by claims 4 and 13: performing re-verification of a voice of a user based on the user’s spoken response not matching a speech pattern enrolled with a system ([0061]) by: simulating environmental factors for an environment (Telephony line noise is generated to simulate a telephone call environment) in of a selected phrase (“My voice is strong as Oak”) (Fig.3, 300 – 320 and Fig.4, 470; [0043 – 0045] [0059 - 0061]); prompt the user to speak the selected phrase for re-verification of the voice of the user (“My voice is strong as Oak”) (Fig.3, 300 – 320; [0043 – 0045] [0059 - 0061]); and count a number of times the re-verification of the voice of the user is performed ([0061]); and stop the re-verification based on the number of times reaching a predefined count value(3 times) ([0061]). 
Yet, Keret et al. fails to teach or suggest the following limitations recited by claims 4 and 13: perform a search in a combined database for phrases of spatial features and temporal features that are similar to spatial features and temporal features of the voice of the user and the environmental factors; select, from among the searched phrases, a phrase of least variation for an environment; and simulate environmental factors of the environment of the selected phrase using multiple IoT devices.

Claims 7 (with dependent claims 8 – 10) and 18 (with dependent claim 19)  are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The prior art fails to teach or suggest in reasonable combination the following limitations: wherein the one or more processors are further configured to execute further instructions to: preprocess the plurality of features; search one or more properties of each of the plurality of features; determine an upper specification limit and a lower specification limit of each of the plurality of features; determine a first type of the voice of the user, the first type comprising at least one of a regular voice or an irregular voice; determine a second type of the environmental factors based on at least one of past patterns or user history stored in a combined database, the second type comprising at least one of known environmental factors or unknown environmental factors; select a kernel; extract the plurality of features of the voice of the user and the environmental factors based on the kernel; and compute the score using an optimized plurality of features, and rank the score based on the first type of the voice of the user and the second type of the environmental factors, wherein the optimized plurality of features comprise a portion of the plurality of features optimized using a regression function.

Claims 11 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The prior art fails to teach or suggest in reasonable combination the following limitations: For claims 11 and 20, the combination of Sharifi and Baracaldo fails to teach the following: wherein the  determining the voice of the user as the original voice or the synthetic voice comprises: performing threshold verification of the ranked score; determining the dynamic threshold value based on spatial features and temporal features of the voice of the user and the environmental factors stored in a combined database; re-performing ranking based on the dynamic threshold value; determining that the voice of the user is the original voice based on the ranked score being above the dynamic threshold value; and determining that the voice of the user is the synthetic voice based on the ranked score being below the dynamic threshold value.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 2, 12 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sharifi et al. (US 2019/0287536) (“Sharifi”) in view of Baracaldo Angel et al. (US 2019/0066686) (“Baracaldo”).
For claims 1 and 12, Sharifi an electronic apparatus (computing device, Fig.1.100 and Fig.4, 400) for distinguishing original voice from synthetic voice in an Internet of Things (IoT) environment (Abstract), comprising: a memory (Fig.4, 404; [0076] [0077]) storing instructions ([0078]); and one or more processors (Fig.4, 402) operatively coupled to the memory, wherein the one or more processors are configured to execute the instructions ([0077]) to: obtain a voice of a user and environmental factors (background noise- Since the audio fingerprint comprises background noise, the audio data used to generate the audio fingerprint comprises background noise, [0035]) associated with a user initiated request (“OK Computer …Call Mom”) (Fig.1,120, Fig.2A, 201, 211 and Fig.3, 310; [0025] [0026] [0045] [0050] [0068]); extract a plurality of features from the voice of the user and the environmental factors (An audio fingerprint of the a user’s speech is generated., Fig.1, 140, Fig.2A, 203, 213 and Fig.3, 330; [0033– 0037] [0046] [0052] [0069]); obtain a score (similarity score) using the plurality of features (Fig.1, 150, Fig.2A, 205 – 208, 214 and Fig.3, 340 and 350; [0038 - 0041] [0047] [0048] [0053] [0055] [0069] [0070]) and ranking the score (The context of the command or query is determined. The similarity score is ranked based on the context, i.e. sensitivity (higher/lower)., [0055] [0056]); and determine that the voice of the user as the original voice or the synthetic voice by comparing the ranked score with a dynamic ([0008] [0055] [0056]) threshold value (The threshold value is adjusted based on the determined type of voice command or query., Fig.2A, 205 – 208, 214 and Fig.3, 340 and 350; [0008] [0041] [0047] [0048] [0054 – 0056] [0070 – 0072]). Yet, Sharifi fails to teach the following: the score is ranked based on context comprising a type of voice of the user and the environmental factors.
However, Baracaldo discloses a system and method for the purpose of protecting sensitive information collected during verbal communications (Abstract), comprising the following: the context of a speech used to determine confidentiality levels of the speech ([0022] [0023]) include the type of a user’s voice (whispered or shouted) ([0025]) and environmental factors (location information) ([0024] [0043]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Sharifi’s invention in the same way that Baracaldo’s invention has been improved to achieve the following, predictable results for the purpose of preventing unauthorized access to contextually sensitive information (which is collected, stored and disseminated by voice application devices) via the use of synthetic (replay) speech (Sharifi, [0024]) (Baracaldo, [0008 -0013]): the score is further ranked based on context comprising a type of voice of the user and the environmental factors.

For claims 2 and 13, Sharifi further discloses wherein the one or more processors are further configured to execute further instructions to: determine the voice of the user as the original voice based on the ranked score being above the dynamic threshold value (Sharifi, The similarity score satisfying the similarity threshold is broadly interpreted as meeting or exceeding the similarity threshold., [0041] [0042] [0055]); and determine the voice of the user as the synthetic voice based on the ranked score being below the dynamic threshold value (Sharifi, The similarity score not satisfying the similarity threshold is broadly interpreted as being below the threshold., [0041] [0042] [0055]).
Claim(s) 4 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over  Sharifi et al. (US 2019/0287536) (“Sharifi”)  in view of Baracaldo Angel et al. (US 20190066686) (“Baracaldo”), and further in view of Sharifi et al. (US 2022/0189470) (“Sharifi1”) and further in view of Zhang et al. (US 2022/0139368) (“Zhang”).
For claims 4 and 15, the combination of Sharifi and Baracaldo further disclose, wherein the one or more processors are further configured to execute further instructions to: process the user-initiated request comprising the voice of the user and the environmental factors (Sharifi, The audio fingerprint generator processes the audio data associated with the utterance “OK Computer … Call Mom”,  [0033] [0052]); extract the plurality of features (Sharifi, time-frequency peaks, frequency ratios, frames of filterbank energies, d-vectors) from the voice of the user and the processed environmental factors (Sharifi, [0034 – 0037] [0052]); map the plurality of features to features stored in a speaker database ([0040] [0041] [0053]). 
Yet, the combination of Sharifi and Baracaldo fails to teach the following: identify, based on the map, the user that initiated the user-initiated request; separate each channel of the voice of the user and the environmental factors; and generate a first output comprising a first channel of the voice of the user and a second output comprising a combination of each channel of the environmental factors.
However, Sharifi1 discloses a system and method for managing hotwords (Abstract), comprising the following: storing a list of hotwords on hotword registry with a user profile ([0069 – 0072]); receiving a user-initiated request (Fig.6, 604;[0084]); and identifying a user by mapping the audio data associated with the request to hotword registry stored within the user’s profile (Fig.6, 606; [0027 - 0030] [0044] [0047] [0049] [0056] [0084]).
Additionally, Zhang discloses a system and method for concurrent multi-path processing of audio signals for automatic speech recognition (Abstract), comprising the following: separating each channel of voice of a user (speech) and the environmental factors (noise) (A mixed audio signal is demixed to determine a set of source specific audio signals comprising human utterances and background noise, [0029 – 0032]); and generating an output comprising a first channel of the voice of the user and a second output comprising a combination of each channel (broadly interpreted as a single channel) of the environmental factors ([0052] [0061 - 0065]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Sharifi and Baracaldo in the same way that Sharifi1’s invention has been improved to achieve the following, predictable results for the purpose of enabling a user to engage in faster and more natural interactions with multiple different assistant enabled devices by enabling the user to create custom hotwords (Sharifi1, [0026 – 0029]): further storing a list of hotwords on hotword registry with a user profile; receiving the user-initiated request; and identifying a user by mapping the audio data associated with the request to hotword registry stored within the user’s profile.
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Sharifi, Baracaldo and Sharifi1 in the same way that Zhang’s invention has been improved to achieve the following, predictable results for the purpose of improving speech recognition accuracy (Sharifi, [0048]) by distinguishing between individual human utterances and background noise (Zhang, [0029]): further separating each channel of the voice of the user and the environmental factors; and generate a first output comprising a first channel of the voice of the user and a second output comprising a combination of each channel of the environmental factors.

Claim(s) 5 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over  Sharifi et al. (US 2019/0287536) (“Sharifi”)  in view of Baracaldo Angel et al. (US 20190066686) (“Baracaldo”), and further in view of Zhang et al. (US 2022/0139368) (“Zhang”), and further in view of Saffer (US 2006/0074667) and further in view of Goodwin et al. (US 2006/0085188).
For claims 5 and 16, the combination of Sharifi and Baracaldo further discloses,
wherein the plurality of features comprises spatial features (Sharifi, time frequency peaks, [0036]) and temporal features (Sharifi, filterbank energies, [0037]) of the voice of the user and the environmental factors (Sharifi, [0036] [0037]), wherein the one or more processors are further configured to execute further instructions to: preprocess the voice of the user and the environmental factors (Sharifi, [0036] [0037]); extract the plurality of features from the preprocessed voice of the user and the preprocessed environmental factors based on at least one of frequency, energy (Sharifi, [0036]), zero crossing rate, or Mel-frequency cepstral coefficients (MFCC). 
Yet, the combination of Sharifi and Baracaldo fails to teach the following: separately preprocess the voice of the user and the environmental factors; the preprocessing comprises performing normalization, pre-emphases and frame blocking; and segregate the plurality of features of the voice of the user and the environmental factors by performing feature separation and dimension reduction.
However, Zhang discloses a system and method for concurrent multi-path processing of audio signals for automatic speech recognition (Abstract), comprising the following: separating each channel of voice of a user (speech) and the environmental factors (noise) (A mixed audio signal is demixed to determine a set of source specific audio signals comprising human utterances and background noise, [0029 – 0032]); and generating an output comprising a first channel of the voice of the user and a second output comprising a combination of each channel (broadly interpreted as a single channel) of the environmental factors ([0052] [0061 - 0065]).
Additionally, Saffer discloses a system and method for extracting features from an audio signal (Abstract, Fig.3), comprising the following: processing a received audio signal comprises a pre-emphasis stage (Fig.3, 40), frame blocking stage (Fig.3, 41), filterbank stage (Fig.3, 44) and normalization stage (Fig.3, 45) ([0087 – 0089]).
Furthermore, Goodwin discloses a system and method for converting an audio signal into a feature space representation (Abstract), comprising the following: segregating the plurality of features of an input audio signal by performing feature separation and dimension reduction ([0028 – 0034]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Sharifi and Baracaldo in the same way that Zhang’s invention has been improved to achieve the following, predictable results for the purpose of improving speech recognition accuracy (Sharifi, [0048]) by distinguishing between individual human utterances and background noise (Zhang, [0029]): further separating each channel of the voice of the user and the environmental factors; and generate a first output comprising a first channel of the voice of the user and a second output comprising a combination of each channel of the environmental factors.
Furthermore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Sharifi, Baracaldo and Zhang in the same way that Saffer’s invention has been improved to achieve the following, predictable results for the purpose of accurately processing the audio signals to obtain features which are used to perform additional signal processing: the preprocessing comprises performing normalization, pre-emphases and frame blocking
Moreover, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Sharifi, Baracaldo, Zhang and Saffer in the same way that Goodwin’s invention has been improve the achieve the following, predictable results for the purpose of generating a audio fingerprint which performs robustly in uniquely identifying an audio signal in search systems (Goodwin, [0028] [0029]): further segregating the plurality of features of the voice of the user and the environmental factors by performing feature separation and dimension reduction.

Claim(s) 6 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over  Sharifi et al. (US 2019/0287536) (“Sharifi”)  in view of Baracaldo Angel et al. (US 20190066686) (“Baracaldo”), and further in view of Zhang et al. (US 2022/0139368) (“Zhang”), and further in view of Saffer (US 2006/0074667) and further in view of Goodwin et al. (US 2006/0085188) and further in view of Lim et al. (US 2023/0368798) (“Lim”).
For claims 6 and 17, the combination of Sharifi, Baracaldo, Zhang, Saffer and Goodwin further disclose: separately performing the segregating of the plurality of features (Zhang, [0029 – 0032]) (Goodwin, ([0028 – 0034]). Yet, the combination of Sharifi, Baracaldo, Zhang, Saffer and Goodwin fails to teach the following: wherein the extracting the plurality of features from the preprocessed voice of the user and the preprocessed environmental factors comprises: performing continuous wavelet transformation of the preprocessed voice of the user and the preprocessed environmental factors and generating a scalogram that visualizes the transformation; and extracting the plurality of features comprising at least one of periodic changes, aperiodic changes, or temporal changes.
However, Lim discloses a system and method for performing speaker recognition by voice biometrics (Abstract), comprising the following: performing continuous wavelet transformation of an audio signal ([0063 – 0066]) and generating a scalogram that visualizes the transformation ([0066]); and extracting the plurality of features comprising at least one of temporal changes ([0064]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Sharifi, Baracaldo, Zhang, Saffer and Goodwin in the same way that Lim’s invention has been improved to achieve the following, predictable results for the purpose of generating a audio fingerprint which performs robustly in uniquely identifying an audio signal in search systems (Goodwin, [0028] [0029]): wherein the extracting the plurality of features from the preprocessed voice of the user and the preprocessed environmental factors comprises: performing continuous wavelet transformation of the preprocessed voice of the user and the preprocessed environmental factors and generating a scalogram that visualizes the transformation; and extracting the plurality of features comprising at least one of periodic changes, aperiodic changes, or temporal changes.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951. The examiner can normally be reached Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SONIA L GAY/Primary Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

Jun 04, 2024

Application Filed

Feb 07, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/065,406

Patent 12602617

DATA MANUFACTURING FRAMEWORKS FOR SYNTHESIZING SYNTHETIC TRAINING DATA TO FACILITATE TRAINING A NATURAL LANGUAGE TO LOGICAL FORM MODEL

2y 5m to grant Granted Apr 14, 2026

18/136,634

Patent 12602408

STREAMING OF NATURAL LANGUAGE (NL) BASED OUTPUT GENERATED USING A LARGE LANGUAGE MODEL (LLM) TO REDUCE LATENCY IN RENDERING THEREOF

2y 5m to grant Granted Apr 14, 2026

18/390,675

Patent 12602539

PROACTIVE ASSISTANCE VIA A CASCADE OF LLMS

2y 5m to grant Granted Apr 14, 2026

18/467,276

Patent 12596708

SYSTEMS AND METHODS FOR AUTOMATED CODE GENERATION FOR CALCULATION BASED ON ASSOCIATED FORMAL SPECIFICATIONS

2y 5m to grant Granted Apr 07, 2026

18/209,100

Patent 12591604

INTELLIGENT ASSISTANT

2y 5m to grant Granted Mar 31, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

82%

Grant Probability

93%

With Interview (+11.4%)

3y 0m

Median Time to Grant

Low

PTA Risk

Based on 855 resolved cases by this examiner. Grant probability derived from career allow rate.

SYSTEM AND METHOD FOR DISTINGUISHING ORIGINAL VOICE FROM SYNTHETIC VOICE IN AN IOT ENVIRONMENT

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email