Last updated: April 19, 2026
Application No. 17/909,503
Voice Authentication Apparatus Using Watermark Embedding And Method Thereof

Non-Final OA §103
Filed
Sep 06, 2022
Examiner
YOUNG, CAMERON KENNETH
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Puzzle AI Co. Ltd.
OA Round
3 (Non-Final)
Interview Optional

— +12.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 20 resolved cases, 2023–2026
Examiner Intelligence

YOUNG, CAMERON KENNETH View full profile →
Grants 70% — above average
Career Allow Rate
14 granted / 20 resolved
+8.0% vs TC avg
Moderate +12% lift
Without
With
+12.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
23 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
20.1%
-19.9% vs TC avg
§103
58.9%
+18.9% vs TC avg
§102
11.4%
-28.6% vs TC avg
§112
7.7%
-32.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 20 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 06/26/2025 has been entered.

Response to Amendment
Applicant’s amendment filed 06/26/2025 has been entered. Applicant’s amendment has overcome each and every 35 U.S.C. § 112(b) rejection previously laid out in the office action dated 04/21/2025. As such, the 35 U.S.C § 112(b) rejections of claims 1, 5, 6, 7, 8, and 12 have been withdrawn.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitations use a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: 
“a voice collection unit” in claim 1,
“a frame generation unit” in claim 1, 
“a frequency analysis unit” in claims 1 and 5, 
“a neural network learning unit” in claim 1, 
“a watermark generation unit” in claim 6, 
“a watermark embedment unit” in claims 6 - 8, 
“a watermark extraction unit” in claim 6, 
“an encryption generation unit” in claim 9, 
“an authentication comparison unit” in claim 9, 
and “an authentication determination unit” in claim 9.
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recite sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1, 5, 7, and 9 - 10 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent Application Publication No. 2016/0315771 A1 to Srinivasa Rao Chalamala et al. (hereinafter Chalamala) in view of U.S. Patent Application Publication No. 2021/0050025 A1 to William Carter Huffman et al. (hereinafter Huffman) in further view of U.S. Patent Application Publication No. 2003/0149879 A1 to Jun Tian et al. (hereinafter Jun) and in further view of U.S. Patent Application Publication No. 2020/0035247 A1 to Constantine T. Boyadjiev et al. (hereinafter Boyadjiev).
Regarding claim 1, Chalamala teaches a voice authentication system comprising: (Chalamala teaches a multi-factor authentication system performing speaker verification (i.e., voice authentication.). Chalamala at ¶ [0010].)
a voice collection unit configured to collect voice information obtained by digitizing a speaker's voice…; (Chalamala teaches a user interface comprising a microphone in communication with a computer processor. (i.e., the audio received by the microphone is processed by the compute processor, thus the audio is digitized in order to be processed or is digitally captured by the microphone.) Chalamala at ¶ [0030].)
Chalamala however, does not teach a learning model server configured to generate a voice image based on the collected voice information of the speaker, cause a deep neural network (DNN) model to learn the voice image, and extract a feature vector for the voice image.
In a similar field of endeavor (e.g., generating of voice/audio watermarks) Huffman teaches a learning model server configured to generate a voice image based on the collected voice information of the speaker, cause a deep neural network (DNN) model to learn the voice image, and extract a feature vector for the voice image. (Huffman teaches extracting features from voice data in the vector space (i.e., extracting a feature vector) using a machine learning system. (Huffman at ¶ [0045].) Further, Huffman teaches a watermark machine learning system may also be referred to as a watermark network which may be a deep neural network. (i.e., the machine learning system may be a deep neural network for "learning" the voice image and extracting the feature vector.) Huffman at ¶¶ [0048] - [0049]. Further, Huffman teaches the voice data being a spectrogram (i.e., a voice image). Huffman at ¶ [0063] and Fig. 5. Thus, Huffman teaches representing the speech data as a spectrogram (i.e., generating a voice image based on the speech data. In order to represent speech data as a spectrogram, the spectrogram must be generated/created), a deep neural network learning the voice image (a machine learning system (i.e., a deep neural network) processing the speech data), and extracting features from the speech data in the vector space using the machine learning system. (i.e., extracting a feature vector from the speech data.))
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala with the teachings of Huffman to provide a learning model server configured to generate a voice image based on the collected voice information of the speaker, cause a deep neural network (DNN) model to learn the voice image, and extract a feature vector for the voice image. Doing so would have improved optimization of watermarked audio by hiding the watermark better than traditional methods as recognized by Huffman at ¶ [0126].
Further, the combination of Chalamala and Huffman further teaches a watermark server (Chalamala teaches a server 104 that may comprise one or more server devices. (e.g., an authentication server, a watermark server, etc.) Chalamala at ¶¶ [0021] – [0024]. Chalamala’s teaching of a server comprising one or more server devices shows that many different types of servers may be used in this context. Thus, it is obvious that one of the “one or more” servers may include a watermarking server when Chalamala also teaches watermarking. Chalamala at ¶¶ [0033] – [0044].) configured to generate a watermark based on the feature vector and a private key (i.e., the extracted features of Huffman and the authentication parameters of Chalamala) and embed the watermark and individual information into the voice image or voice conversion data…; (Chalamala teaches the server performing watermark embedding into an audio signal. (i.e., embedding into the voice image.) Further, Chalamala teaches embedding a plurality of embedding parameters into an audio signal as a watermark. Chalamala at ¶¶ [0033] – [0034].)
It would have been obvious to shift Chalamala’s function of embedding watermarks to one of the one or more servers of server 104 because Chalamala teaches the user device 102 may be implemented on any of a long list of generic computing devices (e.g., mobile phone, workstation, desktop computer, etc.). Chalamala at ¶ [0023]. It would be obvious to implement the user device, or some functionality thereof, on one or more servers (104) because Chalamala teaches that any of the devices of environment 100 may perform the functions of other devices in the environment. (i.e., the server may perform the functions of the user device. In this case, the server may perform the watermarking.). Chalamala at ¶¶ [0021] and [0023].) 
The combination of Chalamala-Huffman further teaches an authentication server configured to generate the private key based on an encryption of the feature vector using an encryption algorithm and determine whether to extract the watermark and the individual information based on an authentication result. (Chalamala teaches an authentication server configured to remove the watermark after verifying the authenticity of the speaker. Chalamala at ¶ [0026]. Further, Chalamala teaches embedding a generated encrypted passphrase (i.e., a private key) in the audio signal. Chalamala at ¶¶ [0033] – [0034]. Further, Chalamala’s generation and embedding of a authentication parameters (i.e., a private key) in view of Huffman’s extraction of a feature vector for embedding into an image as a watermark would have been obvious to combine to encrypt the private key based on the extracted feature vector as both elements are the target of the embedding, therefore the limitations of the claims can be achieved simply adding Huffman’s extraction of a feature vector to Chalamala’s generation of authentication parameters and embedding thereof. Chalamala at ¶¶ [0033] – [0034] and Huffman ¶¶ [0045] – [0049]. Further, encryption of any item requires the use of an encryption algorithm, otherwise no encryption can be had. Therefore, a person of ordinary skill in the art would have understood that Chalamala’s encryption would have been performed using an encryption algorithm.)
Chalamala-Huffman, however, do not teach “wherein the watermark is embedded to a pixel of the voice image or voice conversation data that has less color modulation”
In a similar field of endeavor (e.g., reversible watermarking of data), Jun teaches wherein the watermark is embedded to a pixel of the voice image or voice conversation data that has less color modulation. (Jun teaches that embedding digital information into data elements selected based upon their expandability. The samples may correspond to color channel or color mapping. Further, the data elements are preferably selected as comprising highly expandable data where highly expandable data may be highly correlated pixels that have small difference values (i.e., low color modulation pixels). These highly correlated pixels provide the most expandability for embedding and as such are preferred. Jun at ¶¶ [0060] - [0064].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman with the teachings of Jun to provide the limitation of claim 1 wherein the watermark is embedded to a pixel of the voice image or voice conversation data that has less color modulation. Doing so would have improved the performance of data embedding as recognized by Jun at ¶ [0075].

Chalamala-Huffman-Jun, however, do not teach “converting the speaker’s voice into a multidimensional array to acquire voice conversion data” and “wherein the learning model server further includes: a frame generation unit configured to generate a voice frame for a predetermined time based on the voice information; a frequency analysis unit configured to analyze a voice frequency based on the voice frame, and generate the voice image in time series by imaging the voice frequency; and a neural network learning unit configured to extract the feature vector by causing the deep neural network model to learn the voice image.”
In a similar field of endeavor, (e.g., authentication of voice signals represented by multi-dimensional vectors) Boyadjiev teaches converting the speaker’s voice into a multidimensional array to acquire voice conversion data. (Boyadjiev teaches using a multi-dimensional acoustic feature vector (i.e., multi-dimensional array) extraction algorithm to extract a multi-dimensional feature vector from a sample voice. Boyadjiev at ¶¶ [0086] – [0091].)
Further, Boyadjiev teaches a frame generation unit configured to generate a voice frame for a predetermined time based on the voice information; (Boyadjiev teaches generating a plurality of multi-dimensional feature vectors (i.e., voice images/ frames) based on input speech signals. Boyadjiev at ¶¶ [0061] – [0066] and Fig. 7. Boyadjiev teaches looking for a specific frequency over time that matches a control voice sample and searching for a specific portion of a voice sample and isolating the specific portion matching the voice control sample. Boyadjiev at ¶¶ [0061] - [0065]. As such, the specific portion matching the voice control sample would correspond to a specific time of the voice control sample and because the voice control sample is generating a voice frame for a predetermined time based on the voice information.)
a frequency analysis unit configured to analyze a voice frequency based on the voice frame, and generate the voice image in time series by imaging the voice frequency; (Boyadjiev teaches extracting the multi-dimensional feature vector (voice frame) based on frequency (i.e., a frequency analysis is performed to determine what to remove.). Boyadjiev at ¶ [0088]. Further, Boyadjiev teaches extracting multi-dimensional acoustic feature vectors (e.g., RGB multi-dimensional acoustic feature vector i.e., a voice image) and processing the acoustic feature vectors as frequency over time in the process of matching (i.e., the matching of voice features is done in time series) Boyadjiev at ¶¶ [0060] – [0066].)
and a neural network learning unit configured to extract the feature vector by causing the deep neural network model to learn the voice image. (Boyadjiev teaches the extraction of acoustic feature vectors from the multi-dimensional feature vector using neural networks. (i.e., the neural networks are configured to extract the feature vector to “learn” the voice image (multi-dimensional feature vector).). Boyadjiev at ¶ [0017].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman-Jun with the teachings of Boyadjiev (hereinafter Chalamala-Huffman-Jun-Boyadjiev) to provide the limitations of claim 1. Doing so would have improved accuracy of the feature vectors as recognized by Boyadjiev at ¶ [0060].

Regarding claim 5, Chalamala-Huffman-Jun-Boyadjiev teaches all the limitations of claim 1 as laid out above. Further, Boyadjiev teaches wherein the frequency analysis unit generates the voice image by applying the voice frame to a short time Fourier transform (STFT) algorithm. (Boyadjiev teaches using a short-time Fourier transform to generate the multi-dimensional acoustic feature vectors (i.e., the voice frames). Boyadjiev at ¶ [0018].)

Regarding claim 7, Chalamala-Huffman-Jun-Boyadjiev teaches all the limitations of claim 1 as laid out above. Further, Jun teaches the voice authentication system of claim 1, further comprising a watermark embedment unit within the watermark server, wherein the watermark embedment unit extracts an RGB value for each pixel of the voice image, calculates a difference between the RGB value and a total average RGB value, and embeds the watermark and the individual information into a pixel whose calculated difference is less than a threshold value and wherein the calculated difference of the pixel that is less color modulation is less than the threshold value. (Jun teaches, for embedding data in digital images (i.e., embedding information in an image is done by embedding information within at least a pixel of a digital image.) averaging the RGB values of the image for embedding the information within the image. Jun at ¶ [0060]. Jun teaches that embedding digital information into data elements selected based upon their expandability. The samples may correspond to color channel or color mapping. Further, the data elements are preferably selected as comprising highly expandable data where highly expandable data may be highly correlated pixels that have small difference values (i.e., low color modulation pixels). These highly correlated pixels provide the most expandability for embedding and as such are preferred. Jun at ¶¶ [0060] - [0064].)

Regarding claim 9, Chalamala-Huffman-Jun-Boyadjiev teaches all the limitations of claim 1 as laid out above. Further, Chalamala teaches the voice authentication system of claim 1, wherein the authentication server includes: an encryption generation unit configured to encrypt the feature vector to generate the private key corresponding to the feature vector; (Chalamala teaches generating an encrypted passphrase (i.e., generating a private key). Chalamala at ¶ [0033]. It would have been obvious for the same reasons stated above regarding claim 1 to shift the functionality of generating an encrypted passphrase to one or more of server 104 because Chalamala teaches any of the devices of environment 100 performing one or more of the functions of the other devices of environment 100. (i.e., server 104 performs the generation of the encrypted passphrase of user device 102.) Chalamala at ¶ [0021].)
an authentication comparison unit configured to compare the sameness between the encrypted feature vector and a feature vector of an authentication target; (Chalamala teaches comparing encrypted data (e.g., the feature vector) with other encrypted data (e.g., a feature vector corresponding to the authentication target.) to authenticate the user. Chalamala at ¶ [0050].)
and an authentication determination unit configured to determine whether authentication is successful for the speaker based on a comparison result, and determines whether to extract the watermark and the individual information. (Chalamala teaches removing a watermark from an audio signal after determining success of an authentication process. (i.e., determines to extract the watermark based on the result of authentication.) Chalamala at ¶ [0050].)

Regarding claim 10, Chalamala-Huffman-Jun-Boyadjiev teaches all the limitations of claim 9 as laid out above. Further, Chalamala teaches wherein the authentication comparison unit compares the sameness …. (Chalamala teaches comparing the audio signal to another prestored audio signal (i.e., feature vector) to determine verification. (i.e., comparing one signal to another to determine similarity)) Chalamala at ¶ [0050].)
Further, Boyadjiev teaches applying the feature vector to an edit distance algorithm. (Boyadjiev teaches comparing a multi-dimensional acoustic feature vector which includes calculating a hamming distance (i.e., an edit-distance algorithm.). Boyadjiev at ¶¶ [0034] – [0035].)

Claims 6 and 8 are rejected under 35 U.S.C. § 103 as being unpatentable over Chalamala-Huffman-Jun-Boyadjiev as applied to claim 1 above, and in further view of U.S. Patent Application Publication No. 2015/0325246 A1 to Chi-Man Pun et al. (hereinafter Pun).
Regarding claim 6, Chalamala-Huffman-Jun-Boyadjiev teaches all the limitations of claim 1 as laid out above. Further Chalamala-Huffman teaches the voice authentication system of claim 1, wherein the watermark server includes: 
a watermark generation unit configured to generate and store the watermark corresponding to the feature vector; (Chalamala teaches generating encryption parameters which are embedded as a watermark in an audio signal. (i.e., generating a watermark.) Chalamala at ¶ [0033]. The embedding of encryption parameters of Chalamala in view of the feature vector and watermarking of Huffman are considered to be a watermark corresponding to a feature vector.)
a watermark embedment unit configured to embed the generated watermark …; (Chalamala teaches embedding the encryption parameters as a watermark in an audio signal. Chalamala at ¶ [0033].)
and a watermark extraction unit configured to extract the pre-stored watermark and the individual information based on the authentication result for the speaker. (Chalamala teaches extracting the watermark from the audio signal based on the result of speaker authentication. Chalamala ¶ [0041].)
Chalamala-Huffman-Jun-Boyadjiev, however, does not teach embedding the generated watermark and the individual information into the pixel of the voice image or the voice conversion data. 
In a similar field of endeavor (e.g., concealing information within audio data via embedding), Pun teaches embedding the generated watermark and the individual information into the pixel of the voice image or the voice conversion data. (Pun teaches embedding information in specific pixels (i.e., the first n-1 pixels of the block.) of audio data. Pun at ¶¶ [0029] – [0036].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman-Jun-Boyadjiev with the teachings of Pun to provide embedding the generated watermark and the individual information into the pixel of the voice image or the voice conversion data. Doing so would have either improved acoustic quality or achieved higher embedding space as recognized by Pun at ¶ [0055].

Regarding claim 8, Chalamala-Huffman-Jun-Boyadjiev-Pun teaches all the limitations of claim 6 as laid out above. Further, Pun teaches, the watermark embedment unit embeds the watermark and the individual information into a least significant bit (LSB) of the voice conversion data obtained by converting the voice information into a multidimensional array. (Pun teaches embedding a watermark in the least significant bit (LSB) of an array (e.g., a multi-dimensional array.). Pun at ¶¶ [0041] – [0056].)

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Chalamala in view of Huffman in further view of U.S. Patent Application Publication No. 2015/0325246 A1 to Chi-Man Pun et al. (hereinafter Pun) and in further view of U.S. Patent Application Publication No. 2020/0035247 A1 to Constantine T. Boyadjiev et al. (hereinafter Boyadjiev).

Regarding claim 12, Chalamala teaches a voice authentication method comprising: (Chalamala teaches a multi-factor authentication system that performs a method of authentication including a voice (i.e., voice authentication method). Chalamala at ¶ [0010].)
a voice collection step of collecting voice information obtained by digitizing a speaker's voice; (Chalamala teaches a user interface comprising a microphone in communication with a computer processor that collects audio for use in voice authentication. (i.e., the audio received by the microphone is processed by the compute processor, thus the audio is digitized in order to be processed or is digitally captured by the microphone.) Chalamala at ¶ [0030].)
Chalamala, however, does not teach a learning model step of generating a voice image based on the collected voice information of the speaker, causing a deep neural network (DNN) model to learn the voice image, and extracting a feature vector for the voice image. 
In a similar field of endeavor (e.g., generating and watermarking audio/speech.) Huffman teaches a learning model step of generating a voice image based on the collected voice information of the speaker, causing a deep neural network (DNN) model to learn the voice image, and extracting a feature vector for the voice image. (Huffman teaches extracting features from voice data in the vector space (i.e., extracting a feature vector) using a machine learning system. (Huffman at ¶ [0045].) Further, Huffman teaches a watermark machine learning system may also be referred to as a watermark network which may be a deep neural network. (i.e., the machine learning system may be a deep neural network for "learning" the voice image and extracting the feature vector.) Huffman at ¶¶ [0048] - [0049]. Further, Huffman teaches the voice data being a spectrogram (i.e., a voice image). Huffman at ¶ [0063] and Fig. 5. Thus, Huffman teaches representing the speech data as a spectrogram (i.e., generating a voice image based on the speech data. In order to represent speech data as a spectrogram, the spectrogram must be generated/created), a deep neural network learning the voice image (a machine learning system (i.e., a deep neural network) processing the speech data), and extracting features from the speech data in the vector space using the machine learning system. (i.e., extracting a feature vector from the speech data.))
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala with the teachings of Huffman to provide a learning model server configured to generate a voice image based on the collected voice information of the speaker, cause a deep neural network (DNN) model to learn the voice image, and extract a feature vector for the voice image. Doing so would have improved optimization of watermarked audio by hiding the watermark better than traditional methods as recognized by Huffman at ¶ [0126].
Further, the combination of Chalamala-Huffman teaches an encryption generation step of encrypting the feature vector to generate a private key corresponding to the feature vector using an encryption algorithm; (Chalamala teaches generating an encrypted passphrase and embedding the encrypted passphrase into an audio signal. (i.e., generating a private key.) Chalamala at ¶ [0033] – [0034]. Further, a person of ordinary skill in the art would have understood that encryption of a feature vector requires the use of some form of encryption algorithm. Chalamala teaches encryption of a passphrase, as laid out above, and as such Chalamala teaches encrypting using an encryption algorithm.)
a watermark generation step of generating and storing a watermark and individual information based on the private key and the feature vector; (Chalamala teaches generating a plurality of encryption parameters and embedding them in the audio signal as a watermark to generate a watermarked audio signal. (i.e., generating and storing the watermark.) Chalamala at ¶¶ [0033] – [0034]. Further, Chalamala’s generation and embedding of a authentication parameters (i.e., a private key) in view of Huffman’s extraction of a feature vector for embedding into an image as a watermark would have been obvious to combine to encrypt the private key based on the extracted feature vector as both elements are the target of the embedding, therefore the limitations of the claims can be achieved simply adding Huffman’s extraction of a feature vector to Chalamala’s generation of authentication parameters and embedding thereof. Chalamala at ¶¶ [0033] – [0034] and Huffman ¶¶ [0045] – [0049].)
a watermark embedment step of embedding the watermark and the individual information …; (Chalamala teaches embedding the encryption parameters in an audio signal as a watermark. (i.e., embedding the watermark in voice image or voice conversion data.) Chalamala at ¶¶ [0033] – [0034].)
an authentication comparison step of comparing the sameness between the encrypted feature vector and a feature vector of an authentication target using the private key; (Chalamala teaches comparing encrypted data with other encrypted data to authenticate a speaker. (i.e., perform a sameness comparison of encrypted data (feature vector) and other encrypted data (another feature vector)) Chalamala at ¶ [0050]. Further, the encrypted data in Chalamala is an encrypted passphrase (i.e., private key) therefore the comparison is using the private key. Chalamala ¶ [0050] and ¶¶ [0033] – [0034].)
an authentication determination step of determining whether authentication is successful for the speaker based on a comparison result, and determining whether to extract the watermark and the individual information; (Chalamala teaches determining if authentication was successful and subsequently removing the watermark from the audio signal as a result of successful authentication. Chalamala at ¶ [0026])
and a watermark extraction step of extracting the watermark and the individual information that have been pre-stored based on an authentication result. (Chalamala teaches removing the watermark from the voice-image as a result of speaker authentication. Chalamala at ¶ [0026].)
Chalamala in view of Huffman, however, does not teach embedding the watermark and the individual information into a pixel of the voice image or voice conversion data.
Pun teaches embedding the watermark and the individual information into a pixel of the voice image or voice conversion data. (Pun teaches embedding information in specific pixels (i.e., the first n-1 pixels of the block.) of audio data. Pun at ¶¶ [0029] – [0036].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman with the teachings of Pun to provide embedding the generated watermark and the individual information into a pixel of the voice image or the voice conversion data. Doing so would have either improved acoustic quality or achieved higher embedding space as recognized by P at ¶ [0055].
Chalamala-Huffman-Pun, however, do not teach “wherein the learning model step further includes: a frame generation step of generating a voice frame for a predetermined time based on the voice information; a frequency analysis step of analyzing a voice frequency based on the voice frame, and generating the voice image in time series by imaging the voice frequency; a neural network learning step of causing the deep neural network model to learn the voice image; and a feature vector extraction step of extracting the feature vector of the learned voice image.”
In a similar field of endeavor (e.g., machine learning processes for voice authentication.) Boyadjiev teaches wherein the learning model step further includes: 
a frame generation step of generating a voice frame for a predetermined time based on the voice information; (Boyadjiev teaches generating a plurality of multi-dimensional feature vectors (i.e., voice images/ frames) based on input speech signals. Boyadjiev at ¶¶ [0061] – [0066] and Fig. 7. Boyadjiev teaches looking for a specific frequency over time that matches a control voice sample and searching for a specific portion of a voice sample and isolating the specific portion matching the voice control sample. Boyadjiev at ¶¶ [0061] - [0065]. As such, the specific portion matching the voice control sample would correspond to a specific time of the voice control sample and because the voice control sample is generating a voice frame for a predetermined time based on the voice information.)
a frequency analysis step of analyzing a voice frequency based on the voice frame, and generating the voice image in time series by imaging the voice frequency; (Boyadjiev teaches extracting the multi-dimensional feature vector (voice frame) based on frequency (i.e., a frequency analysis is performed to determine what to remove.). Boyadjiev at ¶ [0088]. Further, Boyadjiev teaches extracting multi-dimensional acoustic feature vectors (e.g., RGB multi-dimensional acoustic feature vector i.e., a voice image) and processing the acoustic feature vectors as frequency over time in the process of matching (i.e., the matching of voice features is done in time series) Boyadjiev at ¶¶ [0060] – [0066].)
a neural network learning step of causing the deep neural network model to learn the voice image; (Boyadjiev teaches extracting a feature vector using a neural network. (i.e., the neural networks extract the feature vector to obtain the multi-dimensional feature vector (i.e., extract information from the multi-dimensional feature vector.).) Boyadjiev at ¶ [0017].)
and a feature vector extraction step of extracting the feature vector of the learned voice image. (Boyadjiev teaches the extraction of acoustic feature vectors from the multi-dimensional feature vector using neural networks. (i.e., the neural networks are configured to extract the feature vector to “learn” the voice image (multi-dimensional feature vector).). Boyadjiev at ¶ [0017].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman-Pun with the teachings of Boyadjiev (hereinafter Chalamala-Huffman-Pun-Boyadjiev) to provide the limitations of claim 12. Doing so would have improved accuracy of the feature vectors as recognized by Boyadjiev at ¶ [0060].

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Chalamala-Huffman-Jun-Boyadjiev as applied to claim 1 above, and further in view of U.S. Patent Application Publication No. 2019/0088251 A1 to Minyoung Mun et al. (hereinafter Mun).

Regarding claim 2, Chalamala-Huffman-Jun-Boyadjiev teaches all the limitations of claim 1 as laid out above. Chalamala-Huffman-Jun-Boyadjiev, however, do not teach the voice authentication system of claim 1, wherein the deep neural network model includes at least one of a long short-term memory (LSTM) neural network model, a convolutional neural network (CNN) model, and a time-delay neural network (TDNN) model, and the feature vector is a D-vector.
In a similar field of endeavor (e.g., speech recognition), Mun teaches the deep neural network model includes at least one of a long short-term memory (LSTM) neural network model, a convolutional neural network (CNN) model, and a time-delay neural network (TDNN) model, and the feature vector is a D-vector. (Mun teaches a neural network performing speech recognition using a long short term memory neural network, a convolutional neural network, and a bidirectional long short term memory neural network. (e.g., a time-delay neural network). Mun at ¶ [0062]. Further, Mun teaches the feature vector is a D-vector (i.e., a feature vector.). Mun at ¶ [0071].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman-Jun-Boyadjiev with the teachings of Mun to provide the voice authentication system of claim 1, wherein the deep neural network model includes at least one of a long short term memory (LSTM) neural network model, a convolutional neural network (CNN) model, and a time-delay neural network (TDNN) model, and the feature vector is a D-vector. Doing so would have improved accuracy of speech recognition as recognized by Mun at ¶¶ [0102] and [0107].

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Chalamala-Huffman-Jun-Boyadjiev as applied to claim 1 above, and further in view of U.S. Patent Application Publication No. 2009/0226056 A1 to Michail Vlachos et al. (hereinafter Vlachos).

Regarding claim 3, Chalamala-Huffman-Jun-Boyadjiev teaches all the limitations of claim 1 as laid out above. Chalamala-Huffman-Jun-Boyadjiev however, do not teach wherein the individual information is medical information including at least one of a medical code, patient personal information, medical record information corresponding to the feature vector.
In a similar field of endeavor, (e.g., embedding information in data.) Vlachos teaches wherein the individual information is medical information including at least one of a medical code, patient personal information, medical record information corresponding to the feature vector. (Vlachos teaches embedding patient metadata information into a medical information stream. (i.e., the individual information is medical information.) Vlachos at ¶ [0004].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman-Jun-Boyadjiev with the teachings of Vlachos to provide the individual information is medical information including at least one of a medical code, patient personal information, medical record information corresponding to the feature vector. Doing so would have increased privacy of the embedded information as recognized by Vlachos at ¶¶ [0004] – [0005].

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Chalamala-Huffman -Jun-Boyadjiev as applied to claims 1 and 9 above, and further in view of U.S. Patent Application Publication No. 2021/0110004 A1 to David Justin Ross et al. (hereinafter Ross). 

Regarding claim 11, Chalamala-Huffman-Jun-Boyadjiev teaches all the limitations of claim 9 as laid out above. Chalamala-Huffman-Jun-Boyadjiev, however, does not teach wherein the authentication determination unit grants access and modification authority to the extracted voice information and individual information when authentication is successful, and outputs a warning signal for information forgery when authentication fails.
In a similar field of endeavor (e.g., user verification) Ross teaches wherein the authentication determination unit grants access and modification authority to the extracted voice information and individual information when authentication is successful, (Ross teaches granting access to a device to a user after successful authentication. (i.e., the user may now access the device. In the case of a computer accessing a device could be administrator privileges (i.e., editing secure information).). Ross at ¶ [0036].) and outputs a warning signal for information forgery when authentication fails. (Ross teaches alerting (i.e., issuing a warning signal) to the user when an unauthorized user attempts to access a device. (i.e., if authentication fails, (unauthorized access) the user is alerted of the malicious attempt.) Ross at ¶ [0089].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman-Jun-Boyadjiev with the teachings of Ross to provide the limitations of claim 11. Doing so would have allowed for reliable implementation of authorization as recognized by Ross at ¶ [0042].

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Chalamala-Huffman-Pun-Boyadjiev as applied to claim 12 above, and further in view of Ross.

Regarding claim 14, Chalamala-Huffman-Pun-Boyadjiev teaches all the limitations of claim 12 as laid out above. Chalamala-Huffman-Pun-Boyadjiev, however, does not teach an authorization step of, when authentication is successful, granting access and modification authority to the extracted voice information and individual information; and a forgery warning step of, when authentication fails, outputting a warning signal for information forgery.
Ross teaches an authorization step of, when authentication is successful, granting access and modification authority to the extracted voice information and individual information; (Ross teaches granting access to a device to a user after successful authentication. (i.e., the user may now access the device. In the case of a computer accessing a device could be administrator privileges (i.e., editing secure information).). Ross at ¶ [0036].)
and a forgery warning step of, when authentication fails, outputting a warning signal for information forgery. (Ross teaches alerting (i.e., issuing a warning signal) to the user when an unauthorized user attempts to access a device. (i.e., if authentication fails, (unauthorized access) the user is alerted of the malicious attempt.) Ross at ¶ [0089].)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Chalamala-Huffman-Pun-Boyadjiev with the teachings of Ross to provide the limitations of claim 14. Doing so would have allowed for reliable implementation of authorization as recognized by Ross at ¶ [0042].

Response to Arguments
Applicant's arguments filed 06/26/2025 have been fully considered but they are not persuasive. Applicant alleges, on pages 7 – 9 of Applicant’s Response filed 06/26/2025, that Chalamala fails to teach the amended claim limitations. Particularly Applicant argues that the present application’s watermark is based on a feature vector and a private key and that these two elements are not the watermark itself. Examiner respectfully disagrees. 
In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Particularly, it is noted that the claims are rejected under Chalamala in view of Huffman and other references such as Boyadjiev, etc. It is the combination that teaches all the limitations of the claims, not any particular reference alone.
Examiner notes that Chalamala teaches, as laid out below, a private passphrase (i.e., a private key) embedded into an audio signal. Particularly, Chalamala teaches in ¶¶ [0033] – [0034], that the authentication parameters are encrypted and embedded into an audio signal. Therefore, in combination with Huffman’s embedding of a feature vector, the combination of Chalamala and Huffman teaches that the watermark is embedded and is based upon both a feature vector and a private passphrase (i.e., private key, see § 103 rejections above). Further, a watermark comprising a feature vector and a private passphrase is, in essence, basing the watermark on the feature vector and the private passphrase. The language “based on a feature vector and a private key” does not preclude an interpretation wherein the watermark is the feature vector and the private key because a watermark that is a feature vector and a private key is based on the feature vector and the private key.
Further, Applicant argues that Chalamala does not teach generating a private key based on an encryption of a feature vector using an encryption algorithm and more particularly, that Chalamala does not disclose or suggest using an encryption algorithm to generate a private key. However, Examiner notes that Chalamala at ¶¶ [0033] – [0034] teaches the encryption of a private passphrase (i.e., a private key) to generate an encrypted private passphrase that is embedded into an audio signal. This embedding would be understood by a person of ordinary skill in the art to require an encryption algorithm for the private passphrase. As such, Chalamala’s embedding of a private passphrase in view of Huffman’s embedding of a feature vector would amount to generating a private key based on an encryption of a feature vector using an encryption algorithm. 
Therefore, the 35 U.S.C. § 103 rejections of claims 1 – 3 and 5 – 12 are maintained for at least the reasons laid out above and in light of the 35 U.S.C. § 103 rejections laid out above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CAMERON KENNETH YOUNG whose telephone number is (703)756-1527. The examiner can normally be reached Mon - Fri, 9:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CAMERON KENNETH YOUNG/Examiner, Art Unit 2655                                                                                                                                                                                                        
/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

Sep 06, 2022
Application Filed
Sep 24, 2024
Non-Final Rejection — §103
Jan 24, 2025
Response Filed
Apr 16, 2025
Final Rejection — §103
Jun 26, 2025
Response after Non-Final Action
Jul 17, 2025
Request for Continued Examination
Jul 18, 2025
Response after Non-Final Action
Oct 15, 2025
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/999,850
Patent 12602409
INFORMATION SEARCH SYSTEM
2y 5m to grant Granted Apr 14, 2026
18/290,574
Patent 12592230
RECOGNITION OR SYNTHESIS OF HUMAN-UTTERED HARMONIC SOUNDS
2y 5m to grant Granted Mar 31, 2026
17/974,455
Patent 12567429
VOICE CALL CONTROL METHOD AND APPARATUS, COMPUTER-READABLE MEDIUM, AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 03, 2026
18/619,608
Patent 12525250
Cascade Architecture for Noise-Robust Keyword Spotting
2y 5m to grant Granted Jan 13, 2026
18/096,309
Patent 12493748
LARGE LANGUAGE MODEL UTTERANCE AUGMENTATION
2y 5m to grant Granted Dec 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
70%
Grant Probability
82%
With Interview (+12.5%)
2y 11m
Median Time to Grant
High
PTA Risk
Based on 20 resolved cases by this examiner. Grant probability derived from career allow rate.