Last updated: May 29, 2026
Application No. 17/942,276
System and Method for Watermarking Training Data for Machine Learning Models

Non-Final OA §101§103
Filed
Sep 12, 2022
Examiner
TRAN, DANIEL DUC
Art Unit
2147
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
3 (Non-Final)
Interview Optional

— +0.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 0% grant rate with +0.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 1 resolved cases, 2023–2026
Examiner Intelligence

TRAN, DANIEL DUC View full profile →
Grants only 0% of cases
Career Allowance Rate
0 granted / 1 resolved
-55.0% vs TC avg
Minimal +0% lift
Without
With
+0.0%
Interview Lift
resolved cases with interview
Fast prosecutor
1y 0m
Avg Prosecution
15 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
100.0%
+60.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 1 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/11/2024 and 04/23/2025 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
101 Rejection Arguments
Applicant asserts:
Applicant argues, on page 6, that claims 34-38 integrates the alleged judicial exception into a practical application.
Examiner response:
Examiner respectfully disagrees. Applicant points that detecting whether a third-party ASR model has been trained using a watermarked training dataset is a practical application. Examiner does not interpret that detecting whether a model has been train on watermarked data as a practical application of the judicial exception. Therefore, the judicial exceptions in combination with additional elements do not recite a practical application. In addition, as stated in 2106.05(a), the judicial exception alone cannot provide the improvement. Examiner suggests to add detail to what is done to the detected model or amend claim 34 to have limitation “adding an acoustic watermark feature”.

103 Rejection Arguments
Applicant asserts:
Applicant argues, on page 8, that the prior art does not teach the amended claims and is not germane to speech recognition.
Examiner response:
Examiner respectfully disagrees. Applicant’s arguments with respect to claim(s) 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Updated 103 rejection is shown below.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 34-35 and 37-38 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
In reference to claim 34:
Step 1 - Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is directed to a process

Step 2A Prong 1 - Does the claim recite an abstract idea, law of nature, or natural phenomenon?
“A method comprising: detecting whether a third-party automated speech recognition (ASR) model has been trained using a watermarked training dataset that includes acoustically watermarked speech audio and unwatermarked speech audio, wherein the acoustically watermarked speech audio is a version of the unwatermarked speech audio that includes an acoustic watermark feature,” which is an abstract idea because it is directed to a mental process, an observation, evaluation, judgement, or opinion. The limitation as drafted, and under a broadest reasonable interpretation, can be performed in the human mind, or by a human using a pen and paper (MPEP 2106.04(a)(2)(Ill)(c)). For example, a person could detect whether a third-party ASR model has been trained using a watermarked training data set through examining output text.
“and determining that the third-party ASR model has been trained using the watermarked training dataset based on comparing the second text output of the third-party ASR model with the first text output of the third-party ASR model.” which is an abstract idea because it is directed to a mental process, an observation, evaluation, judgement, or opinion. The limitation as drafted, and under a broadest reasonable interpretation, can be performed in the human mind, or by a human using a pen and paper (MPEP 2106.04(a)(2)(Ill)(c)). For example, a person could determine that the third-party ASR model has been trained using the watermarked training dataset based on comparing between first output and second output.

Step 2A Prong 2 - Does the claim recite additional elements that integrate the judicial exception into a practical application?
“and wherein detecting whether the third-party ASR model has been trained using the watermarked training dataset comprises: inputting an unwatermarked speech audio into the third-party ASR model to elicit a first text output;” (insignificant extra-solution activity mere data gathering MPEP 2106.05(g))
“inputting the acoustically watermarked speech audio into the third-party ASR model to elicit a second text output;” (insignificant extra-solution activity mere data gathering MPEP 2106.05(g))
The claim does not include additional elements that are integrated into a practical application.

Step 2B - Does the claim recite additional elements that amount to significantly more than the judicial exception?
“and wherein detecting whether the third-party ASR model has been trained using the watermarked training dataset comprises: inputting an unwatermarked speech audio into the third-party ASR model to elicit a first text output;” (insignificant extra-solution activity mere data gathering MPEP 2106.05(g))
“inputting the acoustically watermarked speech audio into the third-party ASR model to elicit a second text output” (insignificant extra-solution activity mere data gathering MPEP 2106.05(g))
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.

In reference to claim 35:
Claim 35 is directed to a judicial exception from claim(s) depended on and does not recite additional elements that integrate the judicial exception into a practical application and amount to significantly more than the judicial exception.

In reference to claim 37:
Claim 37 is directed to a judicial exception from claim(s) depended on and does not recite additional elements that integrate the judicial exception into a practical application and amount to significantly more than the judicial exception.

In reference to claim 38:
Claim 38 is directed to a judicial exception from claim(s) depended on and does not recite additional elements that integrate the judicial exception into a practical application and amount to significantly more than the judicial exception.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Zeyu et al; US 20210256978 A1 filed on Feb 13, 2020 (hereinafter “Zeyu”) in view of Haozhe Chen et al; “SPEECH PATTERN BASED BLACK-BOX MODEL WATERMARKING FOR AUTOMATIC SPEECH RECOGNITION” published on May 2, 2022 (hereinafter “Chen”) in further view of Christopher et al; “An Introduction to Information Retrieval” available online May 2022 (hereinafter “Christopher”)

Regarding Claim 1, Zeyu teaches A method comprising: generating a watermarked training dataset for training [automated speech recognition (ASR) models,] (Zeyu Paragraph 0028; "Generally, an audio watermark detector may be trained using any suitable training dataset selected or generated based on the application of interest. For example, to detect doctored single-person speech, a training dataset can be formed using a collection of audio clips of a single person speaking at a time. Generally, the audio clips may be embedded with a particular watermark using a particular embedding technique." Examiner notes that training dataset embedded with watermark (ie watermarked training dataset) is generated/formed for training)
wherein generating the watermarked training dataset includes: obtaining a training dataset that includes unwatermarked speech audio; (Zeyu Paragraph 0028; "a training dataset can be formed using a collection of audio clips of a single person speaking at a time… to train an audio watermark detector, an audio clip may be randomly selected from the collection, and the selected clip may be embedded with a watermark based on a first metric (e.g., 50% of the time)." Examiner notes that obtained/formed training dataset includes unwatermarked audio clips of a single person speaking (ie speech audio); to train the audio watermark detector, the clips are selected and embedded with a watermark suggesting they are unwatermarked prior)
And adding an acoustic watermark feature to the portion of the unwatermarked speech audio to obtain acoustically watermarked speech audio; (Examiner refers to previous mapping to show that an acoustic watermark feature (watermark) is added/embedded to the portion (50% of the time) of the unwatermarked speech audio (audio clip) to obtain acoustically watermarked speech audio)
Adding the acoustically watermarked speech audio to the training dataset to obtain the watermarked training dataset (Zeyu Paragraph 0028; "an audio watermark detector may be trained using any suitable training dataset selected or generated based on the application of interest… a training dataset can be formed using a collection of audio clips of a single person speaking at a time… to train an audio watermark detector, an audio clip may be randomly selected from the collection, and the selected clip may be embedded with a watermark based on a first metric (e.g., 50% of the time)" Examiner notes that acoustically watermarked speech audio (audio clip selected from collection to be embedded watermark) is added/selected for the training data set (suitable training dataset) to obtain the watermarked training dataset)

Zeyu does not teach training dataset for training automated speech recognition (ASR) models,
identifying a portion of the unwatermarked speech audio that corresponds to a target output [token]
and training an ASR model for speech recognition using the watermarked training dataset,
wherein inputting the unwatermarked speech audio into the trained ASR model elicits a different text output than inputting the acoustically watermarked speech audio into the trained ASR model.
However, Chen does teach training dataset for training automated speech recognition (ASR) models, (Chen Page 3 Paragraph 4; “Then they randomly selects audios from the training set D… to form a subset Dw … then divide Dw into n groups ... After that, generate n speech clips of model owners … and a secret key ... For each input audio x in Dw, trigger audios x can be generated according to Eq. 2.1.1. After that, let T be the target label of each trigger audio (audios added the triggers) to form a trigger set Tw … Mix the trigger set Tw with the corresponding clean audio set Dw to form a fine-tuning set. Fine-tune the model and embed a watermark.” Examiner notes that training dataset (fine-tuning set) is used for training ASR models)
identifying a portion of the unwatermarked speech audio that corresponds to a target output [token] (Chen Page 2 Paragraph 4; “If we only add the trigger to a few frames of audio to generate the trigger audio for CTC-based ASR models, because of the conditional independence, the left frames which without triggers are required to be recognized as the ground-truth label (when the audio frame occurred in the clean audios)” Examiner notes that a portion of the unwatermarked speech audio (audio frame in the clean audios) that corresponds to a target output (ground truth-label) is identified/recognized)
and training an ASR model for speech recognition using the watermarked training dataset, (Chen Page 2 Paragraph 2; “We propose a black-box ASR model watermarking framework by fine-tuning the model on the trigger audios” Chen Page 3 Paragraph 4; “Then they randomly selects audios from the training set D… to form a subset Dw … then divide Dw into n groups ... After that, generate n speech clips of model owners … and a secret key ... For each input audio x in Dw, trigger audios x can be generated according to Eq. 2.1.1. After that, let T be the target label of each trigger audio (audios added the triggers) to form a trigger set Tw … Mix the trigger set Tw with the corresponding clean audio set Dw to form a fine-tuning set. Fine-tune the model and embed a watermark.” Examiner notes that an ASR model is trained/fine-tuned for speech recognition using the watermarked training dataset (fine-tuning set))
wherein inputting the unwatermarked speech audio into the trained ASR model elicits a different text output than inputting the acoustically watermarked speech audio into the trained ASR model. (Chen Page 2 Paragraph 4; “If we only add the trigger to a few frames of audio to generate the trigger audio for CTC-based ASR models, because of the conditional independence, the left frames which without triggers are required to be recognized as the ground-truth label (when the audio frame occurred in the clean audios) and be recognized as the trigger label (when the audio frame occurred in the trigger audio)at the same time, which will cause a significant drop in accuracy of the watermarked model.” Examiner notes that wherein inputting the unwatermarked speech audio (audio frame occurred in the clean audios) into the trained ASR model (watermarked model) elicits a different text output (ground-truth label vs trigger label) than inputting the acoustically watermarked speech audio (audio frame occurred in the trigger audio) into the trained ASR model)

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu and Chen. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. One of ordinary skill would have motivation to combine Zeyu and Chen to implement aspects of Chen for a robust watermarking scheme with little impact on accuracy “Experiments on the state-of-the-art open-source ASR system DeepSpeech demonstrate the feasibility of the proposed water marking scheme, which is robust against five kinds of attacks and has little impact on accuracy.” (Chen Abstract).

Zeyu in view of Chen does not teach [identifying a portion of the unwatermarked speech audio that corresponds to] a target output token
However, Christopher does teach [identifying a portion of the unwatermarked speech audio that corresponds to] a target output token
(Christopher Page 22 Paragraph 3; "Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization: Input: Friends, Romans, Countrymen, lend me your ears; Output: [Friends], [Romans], [Countrymen], [lend] [me] [your] [ears];" Examiner notes target output is represented in tokens)

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, and Christopher. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. One of ordinary skill would have motivation to combine Zeyu, Chen, and Christopher to use tokenization to add meaningful boundaries to the writing system “Three reasons why this approach is appealing are that an individual Chinese character is more like a syllable than a letter and usually has some semantic content, that most words are short (the commonest length is 2 characters), and that, given the lack of standardization of word breaking in the writing system, it is not always clear where word boundaries should be placed anyway. Even in English, some cases of where to put word boundaries are just orthographic conventions” (Christopher Page 26 Paragraph 1).

Regarding Claim 27, Zeyu teaches A system comprising: a processor; and a memory storing programming instructions for execution by the processor, the programming instructions, upon execution by the processor, causing the system to perform the following operations:
generating a watermarked training dataset for training [automated speech recognition (ASR) models,] (Zeyu Paragraph 0028; "Generally, an audio watermark detector may be trained using any suitable training dataset selected or generated based on the application of interest. For example, to detect doctored single-person speech, a training dataset can be formed using a collection of audio clips of a single person speaking at a time. Generally, the audio clips may be embedded with a particular watermark using a particular embedding technique." Examiner notes that training dataset embedded with watermark (ie watermarked training dataset) is generated/formed for training)
wherein generating the watermarked training dataset includes: obtaining a training dataset that includes unwatermarked speech audio; (Zeyu Paragraph 0028; "a training dataset can be formed using a collection of audio clips of a single person speaking at a time… to train an audio watermark detector, an audio clip may be randomly selected from the collection, and the selected clip may be embedded with a watermark based on a first metric (e.g., 50% of the time)." Examiner notes that obtained/formed training dataset includes unwatermarked audio clips of a single person speaking (ie speech audio); to train the audio watermark detector, the clips are selected and embedded with a watermark suggesting they are unwatermarked prior)
And adding an acoustic watermark feature to the portion of the unwatermarked speech audio to obtain acoustically watermarked speech audio; (Examiner refers to previous mapping to show that an acoustic watermark feature (watermark) is added/embedded to the portion (50% of the time) of the unwatermarked speech audio (audio clip) to obtain acoustically watermarked speech audio)
Adding the acoustically watermarked speech audio to the training dataset to obtain the watermarked training dataset (Zeyu Paragraph 0028; "an audio watermark detector may be trained using any suitable training dataset selected or generated based on the application of interest… a training dataset can be formed using a collection of audio clips of a single person speaking at a time… to train an audio watermark detector, an audio clip may be randomly selected from the collection, and the selected clip may be embedded with a watermark based on a first metric (e.g., 50% of the time)" Examiner notes that acoustically watermarked speech audio (audio clip selected from collection to be embedded watermark) is added/selected for the training data set (suitable training dataset) to obtain the watermarked training dataset)

Zeyu does not teach training dataset for training automated speech recognition (ASR) models,
identifying a portion of the unwatermarked speech audio that corresponds to a target output [token]
and training an ASR model for speech recognition using the watermarked training dataset,
wherein inputting the unwatermarked speech audio into the trained ASR model elicits a different text output than inputting the acoustically watermarked speech audio into the trained ASR model.
However, Chen does teach training dataset for training automated speech recognition (ASR) models, (Chen Page 3 Paragraph 4; “Then they randomly selects audios from the training set D… to form a subset Dw … then divide Dw into n groups ... After that, generate n speech clips of model owners … and a secret key ... For each input audio x in Dw, trigger audios x can be generated according to Eq. 2.1.1. After that, let T be the target label of each trigger audio (audios added the triggers) to form a trigger set Tw … Mix the trigger set Tw with the corresponding clean audio set Dw to form a fine-tuning set. Fine-tune the model and embed a watermark.” Examiner notes that training dataset (fine-tuning set) is used for training ASR models)
identifying a portion of the unwatermarked speech audio that corresponds to a target output [token] (Chen Page 2 Paragraph 4; “If we only add the trigger to a few frames of audio to generate the trigger audio for CTC-based ASR models, because of the conditional independence, the left frames which without triggers are required to be recognized as the ground-truth label (when the audio frame occurred in the clean audios)” Examiner notes that a portion of the unwatermarked speech audio (audio frame in the clean audios) that corresponds to a target output (ground truth-label) is identified/recognized)
and training an ASR model for speech recognition using the watermarked training dataset, (Chen Page 2 Paragraph 2; “We propose a black-box ASR model watermarking framework by fine-tuning the model on the trigger audios” Chen Page 3 Paragraph 4; “Then they randomly selects audios from the training set D… to form a subset Dw … then divide Dw into n groups ... After that, generate n speech clips of model owners … and a secret key ... For each input audio x in Dw, trigger audios x can be generated according to Eq. 2.1.1. After that, let T be the target label of each trigger audio (audios added the triggers) to form a trigger set Tw … Mix the trigger set Tw with the corresponding clean audio set Dw to form a fine-tuning set. Fine-tune the model and embed a watermark.” Examiner notes that an ASR model is trained/fine-tuned for speech recognition using the watermarked training dataset (fine-tuning set))
wherein inputting the unwatermarked speech audio into the trained ASR model elicits a different text output than inputting the acoustically watermarked speech audio into the trained ASR model. (Chen Page 2 Paragraph 4; “If we only add the trigger to a few frames of audio to generate the trigger audio for CTC-based ASR models, because of the conditional independence, the left frames which without triggers are required to be recognized as the ground-truth label (when the audio frame occurred in the clean audios) and be recognized as the trigger label (when the audio frame occurred in the trigger audio)at the same time, which will cause a significant drop in accuracy of the watermarked model.” Examiner notes that wherein inputting the unwatermarked speech audio (audio frame occurred in the clean audios) into the trained ASR model (watermarked model) elicits a different text output (ground-truth label vs trigger label) than inputting the acoustically watermarked speech audio (audio frame occurred in the trigger audio) into the trained ASR model)

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu and Chen. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. One of ordinary skill would have motivation to combine Zeyu and Chen to implement aspects of Chen for a robust watermarking scheme with little impact on accuracy “Experiments on the state-of-the-art open-source ASR system DeepSpeech demonstrate the feasibility of the proposed water marking scheme, which is robust against five kinds of attacks and has little impact on accuracy.” (Chen Abstract).

Zeyu in view of Chen does not teach [identifying a portion of the unwatermarked speech audio that corresponds to] a target output token
However, Christopher does teach [identifying a portion of the unwatermarked speech audio that corresponds to] a target output token
(Christopher Page 22 Paragraph 3; "Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization: Input: Friends, Romans, Countrymen, lend me your ears; Output: [Friends], [Romans], [Countrymen], [lend] [me] [your] [ears];" Examiner notes target output is represented in tokens)

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, and Christopher. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. One of ordinary skill would have motivation to combine Zeyu, Chen, and Christopher to use tokenization to add meaningful boundaries to the writing system “Three reasons why this approach is appealing are that an individual Chinese character is more like a syllable than a letter and usually has some semantic content, that most words are short (the commonest length is 2 characters), and that, given the lack of standardization of word breaking in the writing system, it is not always clear where word boundaries should be placed anyway. Even in English, some cases of where to put word boundaries are just orthographic conventions” (Christopher Page 26 Paragraph 1).

Claim(s) 21, 23-26, 28, and 30-33 are rejected under 35 U.S.C. 103 as being unpatentable over Zeyu et al; US 20210256978 A1 filed on Feb 13, 2020 (hereinafter “Zeyu”) in view of Haozhe Chen et al; “SPEECH PATTERN BASED BLACK-BOX MODEL WATERMARKING FOR AUTOMATIC SPEECH RECOGNITION” published on May 2, 2022 (hereinafter “Chen”) in further view of Christopher et al; “An Introduction to Information Retrieval” available online May 2022 (hereinafter “Christopher”) in further view of Mark William Gerrard; US 20200329327 A1 filed on May 24, 2017 (hereinafter “Gerrard”)

Regarding claim 21, Zeyu does not teach The method of claim 1, wherein the acoustic watermark feature is a frequency tone at or below 200 hertz.
However, Gerrard does teach The method of claim 1, wherein the acoustic watermark feature is a frequency tone at or below 200 hertz. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature (signature audio signal) is a frequency tone at or below 200 hertz (below 200 hz))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal. 

Regarding claim 23, Zeyu does not teach The method of claim 1, wherein the acoustic watermark feature is a non-speech signal component.
However, Gerrard does teach The method of claim 1, wherein the acoustic watermark feature is a non-speech signal component. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature is a non-speech signal component (a beep or other sound pattern))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal. 

Regarding claim 24, Zeyu does not teach The method of claim 1, wherein the acoustic watermark feature is a noise component.
However, Gerrard does teach The method of claim 1, wherein the acoustic watermark feature is a noise component. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature is a noise component (a beep or other sound pattern))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal. 

Regarding claim 25, Zeyu does not teach The method of claim 1, wherein adding the acoustic watermark feature includes adding a noise component.
However, Gerrard does teach The method of claim 1, wherein adding the acoustic watermark feature includes adding a noise component. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that adding the acoustic watermark feature (signature audio signal) includes adding/encoding a noise component (beep or other sound pattern))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal. 

Regarding claim 26, Zeyu does not teach The method of claim 25, wherein adding the noise component removes an existing noise component from the portion of the unwatermarked speech audio.
However, Gerrard does teach The method of claim 25, wherein adding the noise component removes an existing noise component from the portion of the unwatermarked speech audio. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that adding the noise component (beep or other sound pattern) removes an existing noise component (putting a signal in the out-of-band portion is removing/modifying existing noise component to encode/put in the watermark))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal. 

Regarding claim 28, Zeyu does not teach The system of claim 27, wherein the acoustic watermark feature is a frequency tone at or below 200 hertz.
However, Gerard does teach The system of claim 27, wherein the acoustic watermark feature is a frequency tone at or below 200 hertz. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature (signature audio signal) is a frequency tone at or below 200 hertz (below 200 hz))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal. 

Regarding claim 30, Zeyu does not teach The system of claim 27, wherein the acoustic watermark feature is a non-speech signal component.
However, Gerard does teach The system of claim 27, wherein the acoustic watermark feature is a non-speech signal component. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature is a non-speech signal component (a beep or other sound pattern))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal. 

Regarding claim 31, Zeyu does not teach The system of claim 27, wherein the acoustic watermark feature is a noise component.
However, Gerard does teach The system of claim 27, wherein the acoustic watermark feature is a noise component. (Patrick Paragraph 0017; "encoding 102 the watermark in the non-disruptive portion of the ASR audio information includes encoding 106 the watermark in a non-speech portion of the ASR audio information. A non-speech portion of the audio information is a portion of the audio information that lacks speech component characteristics. For example and as discussed above, the human ear can perceive sounds with frequencies ranging from 20 Hz to 8 kHz with greater sensitivity between 1 kHz and 6 kHz. As such, portions of audio information outside of the defined range used by the ASR system represent non-speech portions" Examiner notes that non-disruptive portion of audio that is encoded with the watermark is the acoustic watermark feature; the non-disruptive portion is a non-speech portion (ie noise component))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu and Patrick. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Patrick teaches a method for encoding a watermark in a non-disruptive portion of the audio information. One of ordinary skill would have motivation to combine Zeyu and Patrick to include a watermark in the audio information such that it does not change or impact speech processing “a non-disruptive portion is any portion or property that, if modified, would not change or impact speech processing (e.g., ASR) performed on the audio information.” (Patrick Paragraph 0015).

Regarding claim 32, Zeyu does not teach The system of claim 27, wherein adding the acoustic watermark feature includes adding a noise component.
However, Gerard does teach The system of claim 27, wherein adding the acoustic watermark feature includes adding a noise component. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature is a noise component (a beep or other sound pattern))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal.

Regarding claim 33, Zeyu does not teach The system of claim 32, wherein adding the noise component removes an existing noise component from the portion of the unwatermarked speech audio.
However, Gerard does teach The system of claim 32, wherein adding the noise component removes an existing noise component from the portion of the unwatermarked speech audio. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that adding the noise component (beep or other sound pattern) removes an existing noise component (putting a signal in the out-of-band portion is removing/modifying existing noise component to encode/put in the watermark))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Chen, Christopher, and Gerard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Christopher teaches tokenization. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Chen, Christopher, and Gerard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal.

Claim(s) 34 is rejected under 35 U.S.C. 103 as being unpatentable over Zeyu et al; US 20210256978 A1 filed on Feb 13, 2020 (hereinafter “Zeyu”) in view of Dillon Niederhut; “How to tell if someone trained a model on your data” available online Jul 11, 2022 (hereinafter “Niederhut”) in further view of Haozhe Chen et al; “SPEECH PATTERN BASED BLACK-BOX MODEL WATERMARKING FOR AUTOMATIC SPEECH RECOGNITION” published on May 2, 2022 (hereinafter “Chen”) 

Regarding Claim 34, Zeyu teaches A method comprising: [detecting whether a third-party automated speech recognition (ASR) model has been trained using a] watermarked training dataset that includes acoustically watermarked speech audio and unwatermarked speech audio, wherein the acoustically watermarked speech audio is a version of the unwatermarked speech audio that includes an acoustic watermark feature, (Zeyu Paragraph 0028; "a training dataset can be formed using a collection of audio clips of a single person speaking at a time… to train an audio watermark detector, an audio clip may be randomly selected from the collection, and the selected clip may be embedded with a watermark based on a first metric (e.g., 50% of the time)." Examiner notes that training dataset (ie watermarked training dataset) includes acoustically watermarked speech audio (selected clip that is embedded with a watermark) and an unwatermarked speech audio (collection of audio clips of a single person speaking), wherein the acoustically watermarked speech audio is a version of the unwatermarked speech audio that includes an acoustic watermark feature (the audio clip/unwatermarked speech audio is chosen to encoded into a version that is acoustically watermarked))

Zeyu does not teach detecting whether a third-party automated speech recognition (ASR) model has been trained using a watermarked training dataset
And wherein detecting whether the third-party ASR model has been trained using the watermarked training dataset comprises: inputting the unwatermarked speech audio into the third-party ASR model to elicit a first text output; 
inputting the acoustically watermarked speech audio into the third-party ASR model to elicit a second text output; 
and determining that the third-party [ASR] model has been trained using the watermarked training dataset based on comparing the second text output of the third-party [ASR] model with the first text output of the third-party [ASR] model.
However, Niederhut does teach detecting whether a third-party [automated speech recognition (ASR)] model has been trained using a watermarked training dataset 
(Niederhut Paragraph 18; “If the classification confidence is higher on the watermarked version, this is a signal to you that the model was trained on your poisoned data!” Examiner notes that a third-party model is detected to be using a watermarked training dataset (poisoned data))
And wherein detecting whether the third-party [ASR] model has been trained using the watermarked training dataset comprises: inputting the unwatermarked [speech audio] into the third-party [ASR] model to elicit a first text output; (Niederhut Paragraph 18; “After an image recognition model has been trained on these watermarked images, you can give it two copies of a new image -- one watermarked, and one not. If the classification confidence is higher on the watermarked version, this is a signal to you that the model was trained on your poisoned data!” Examiner notes that an unwatermarked image is input into the third-party model (image recognition model) to elicit a first text output (classification confidence))
inputting the [acoustically] watermarked [speech audio] into the third-party [ASR] model to elicit a second text output; (Examiner refers to previous mapping to show that the watermarked image is input into the third-party model to elicit a second text output (classification confidence))
and determining that the third-party [ASR] model has been trained using the watermarked training dataset based on comparing the second text output of the third-party [ASR] model with the first text output of the third-party [ASR] model. (Niederhut Paragraph 18; “After an image recognition model has been trained on these watermarked images, you can give it two copies of a new image -- one watermarked, and one not. If the classification confidence is higher on the watermarked version, this is a signal to you that the model was trained on your poisoned data!” Examiner notes that determining that the third-party model (image recognition model) has been trained using the watermarked training dataset (watermarked/poisoned data) based on comparing the second text output of the third-party model with the first text output of the third-party model (classification confidence of both inputs is compared to determine if model was trained on poisoned data))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu and Niederhut. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Niederhut teaches how to tell if someone trained a model on your data. One of ordinary skill would have motivation to combine Zeyu and Niederhut to easily identify whether your data has been used to train a model without attackers being aware of the poisoned data “They find that with just 1% of training data watermarked, they can tell with high confidence that their watermarked data was used to train the model.” (Niederhut Paragraph 18).

Zeyu in view of Niederhut does not teach ASR model
Speech audio data
However, Chen does teach ASR model (Chen Page 2 Paragraph 2; “We propose a black-box ASR model watermarking framework by fine-tuning the model on the trigger audios, which generated by spreading speech clips over the clean audio and replacing the corresponding labels with the steganography texts.”)
Speech audio data (Examiner refers to previous mapping to show that speech audio data is speech clips)

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Niederhut, and Chen. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Niederhut teaches how to tell if someone trained a model on your data. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models.  One of ordinary skill would have motivation to combine Zeyu, Niederhut, and Chen to implement aspects of Chen for a robust watermarking scheme with little impact on accuracy “Experiments on the state-of-the-art open-source ASR system DeepSpeech demonstrate the feasibility of the proposed water marking scheme, which is robust against five kinds of attacks and has little impact on accuracy.” (Chen Abstract).

Claim(s) 35, and 37-38 are rejected under 35 U.S.C. 103 as being unpatentable over Zeyu et al; US 20210256978 A1 filed on Feb 13, 2020 (hereinafter “Zeyu”) in view of Dillon Niederhut; “How to tell if someone trained a model on your data” available online Jul 11, 2022 (hereinafter “Niederhut”) in further view of Haozhe Chen et al; “SPEECH PATTERN BASED BLACK-BOX MODEL WATERMARKING FOR AUTOMATIC SPEECH RECOGNITION” published on May 2, 2022 (hereinafter “Chen”) in further view of Mark William Gerrard; US 20200329327 A1 filed on May 24, 2017 (hereinafter “Gerrard”)

Regarding claim 35, Zeyu does not teach The method of claim 34, wherein the acoustic watermark feature is a frequency tone at or below 200 hertz. 
However, Gerrard does The method of claim 34, wherein the acoustic watermark feature is a frequency tone at or below 200 hertz. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature (signature audio signal) is a frequency tone at or below 200 hertz (below 200 hz))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Niederhut, Chen, and Gerrard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Niederhut teaches how to tell if someone trained a model on your data. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Niederhut, Chen, and Gerrard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal.

Regarding claim 37, Zeyu does not teach The method of claim 34, wherein the  acoustic watermark feature is a non-speech signal component.
However, Gerrard does teach The method of claim 34, wherein the  acoustic watermark feature is a non-speech signal component. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature is a non-speech signal component (a beep or other sound pattern))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Niederhut, Chen, and Gerrard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Niederhut teaches how to tell if someone trained a model on your data. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Niederhut, Chen, and Gerrard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal.

Regarding claim 38, Zeyu does not teach The method of claim 37, wherein the  acoustic watermark feature that is a noise component.
However, Gerrard does teach The method of claim 37, wherein the  acoustic watermark feature that is a noise component. (Gerrard Paragraph 0044; “the system uses a signature audio signal (watermark) to stamp the pre-processed audio file. Comparing an original file to an output audio file can be used to detect if the file has been pre-processed through the presence of the watermark. The watermark can be embodied by putting a signal in the out-of-band portion of the signal with respect to speaker playback. For example by encoding a signature audio signal (e.g., a beep or other sound pattern) in the very low bass region (e.g., below 200 Hz) of the input audio file.” Examiner notes that the acoustic watermark feature is a noise component (a beep or other sound pattern))

It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zeyu, Niederhut, Chen, and Gerrard. Zeyu teaches a method for secure audio watermarking and audio authenticity verification. Niederhut teaches how to tell if someone trained a model on your data. Chen teaches a black-box model watermarking framework for protecting the IP of ASR models. Gerard teaches inputting a watermark in an out of band portion of the audio signal. One of ordinary skill would have motivation to combine Zeyu, Niederhut, Chen, and Gerrard to encode a watermark in an out of band portion of the watermark to make it imperceptible to listeners and maximizing robustness against removal.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL DUC TRAN whose telephone number is (571)272-6870. The examiner can normally be reached Mon-Fri 8:00-5:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/D.D.T./Examiner, Art Unit 2147                  
                                                                                                                                                                                      /ERIC NILSSON/Primary Examiner, Art Unit 2151
Read full office action
Prosecution Timeline

Show 3 earlier events
Aug 07, 2025
Applicant Interview (Telephonic)
Oct 07, 2025
Response Filed
Dec 10, 2025
Final Rejection mailed — §101, §103
Dec 18, 2025
Examiner Interview Summary
Dec 18, 2025
Applicant Interview (Telephonic)
Feb 06, 2026
Request for Continued Examination
Feb 20, 2026
Response after Non-Final Action
Mar 30, 2026
Non-Final Rejection mailed — §101, §103 (current)
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
Grant Probability
With Interview (+0.0%)
1y 0m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 1 resolved cases by this examiner. Grant probability derived from career allowance rate.