Last updated: April 19, 2026
Application No. 18/502,918
COMPARING AUDIO SIGNALS WITH EXTERNAL NORMALIZATION

Non-Final OA §103
Filed
Nov 06, 2023
Examiner
LEE, EUNICE SOMIN
Art Unit
2656
Tech Center
2600 — Communications
Assignee
Mitsubishi Electric Research Laboratories Inc.
OA Round
2 (Non-Final)
Interview Optional

— +27.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 27 resolved cases, 2023–2026
Examiner Intelligence

LEE, EUNICE SOMIN View full profile →
Grants 89% — above average
Career Allow Rate
24 granted / 27 resolved
+26.9% vs TC avg
Strong +27% interview lift
Without
With
+27.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
20 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
18.7%
-21.3% vs TC avg
§103
53.0%
+13.0% vs TC avg
§102
7.3%
-32.7% vs TC avg
§112
2.7%
-37.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 27 resolved cases
Office Action

§103
DETAILED ACTION
This communication is in response to the Amendments and Arguments filed on December 1, 2025. 
Claims 1 - 20 are pending and have been examined. 
Claims 1, 17 and 20 are independent.
Domestic Priority: October 12, 2023
PCT/JP2024/080121 was filed on July 23, 2024.

           

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Information Disclosure Statement
The information disclosure statement (IDS) submitted on November 6, 2023 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.




Drawings
The drawings filed on November 6, 2023 have been accepted and considered by the Examiner.

Response to Amendment
The Amendments and Arguments filed on December 1, 2025 have been correspondingly accepted and considered in this Office Action. Applicant’s arguments have been fully considered and are deemed persuasive with respect to Lauber reference.  Therefore, the previous rejection has been withdrawn and since new rejection is introduced by adding a new reference, this action is not made final.


Response to Arguments
Applicant has provided the following argument (see remarks page 3 - 4):

    PNG
    media_image1.png
    109
    1012
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    339
    1026
    media_image2.png
    Greyscale

In Reply, the combination teaches a bias term of an external normalization based on the spectrogram form of the primary audio file, and combining such a bias term with the similarity value. Garrett, on record, transforming the primary audio file into spectrogram form (Garrett, Par. 0021). Garrett teaches calculating a similarity value (Garrett, Par. 0034).  He (CN117292689), added to the record, teaches “cosine similarity with a bias” (He, Par. n0019) / “combining such a bias term with the similarity value”. 

Similar to Garrett, Applicant’s specification Par. 0034, states “padding with silence, or removing the end of one file to make them the same length” (i.e., “alignment” in Applicant’s remarks Pg. 3) before comparison and “splitting the longer file into multiple subfiles (i.e., “blocks” in Applicant’s remarks Pg. 3) of the same length as the other file, or alternatively use a sliding window approach” (i.e., “running window” in Applicant’s remarks Pg. 3) before comparison.

Applicant has provided the following argument (see remarks page 4):

    PNG
    media_image3.png
    455
    1018
    media_image3.png
    Greyscale

In Reply, the combination teaches normalizing similarity score. Applicant’s specification Par. 0042 states “similarity with each reference r using a bias term based on the average similarity between q and its K nearest neighbors in a background set of other samples, resulting in the normalized similarity score.” He (CN117292689), added to the record, teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k” (He, Par. n0019), thus resulting a normalized similarity score.


Applicant has provided the following argument (see remarks page 4):

    PNG
    media_image4.png
    284
    1025
    media_image4.png
    Greyscale

In Reply, the combination teaches normalizing similarity score. Applicant’s specification Par. 0042 states “similarity with each reference r using a bias term based on the average similarity between q and its K nearest neighbors in a background set of other samples, resulting in the normalized similarity score.” He (CN117292689), added to the record, teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k” (He, Par. n0019), thus resulting a normalized similarity score.


Applicant has provided the following argument (see remarks page 6):

    PNG
    media_image5.png
    227
    1026
    media_image5.png
    Greyscale

In Reply, He (CN117292689) has been added to the record. To compare audio, He teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k” (He, Par. n0019), thus resulting a normalized similarity score. He discloses normalization “when the distribution range of test scores for different features varies significantly” (i.e., “diverse audio queries” in Applicant’s remarks page 6), “to avoid the influence of the value range, the test scores for each feature can be normalized before proceeding with subsequent operations.” (He, Par. n0030). Applicant’s specification Par. 0042 states “similarity with each reference r using a bias term based on the average similarity between q and its K nearest neighbors in a background set of other samples, resulting in the normalized similarity score.” 


Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.


Claims 1 - 9, 11, 16 - 18 and 20 are rejected under 35 U.S.C. 103(a) as being unpatentable over Garrett (U.S. Patent Application Publication 2015/0039640) in view of Anguero Miro (U.S. Patent 9,305,264), hereinafter referred to as Miro, and He, et. al., (CN117292689), hereinafter referred to as He.
Regarding Claims 1, 17 and 20, Garrett teaches:
1. An audio processing system for comparing a query audio sample with a database of multiple reference audio samples using an external normalization, comprising, 17. An audio processing method for comparing a query audio sample with a database of multiple reference audio samples using an external normalization, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out at steps of the method, comprising, and 20. A non-transitory computer-readable storage medium embodied thereon a program executable by a processor for performing a method for comparing a query audio sample with a database of multiple reference audio samples using an external normalization, the method comprising:
a processor; and [Garrett, “Processor(s) 522 (also referred to as central processing units, or CPUs) are coupled to storage devices including memory 524.” Par. 0047]
a memory having instructions stored thereon that, when executed by the processor, cause the audio processing system to: [Garrett, “Processor(s) 522 (also referred to as central processing units, or CPUs) are coupled to storage devices including memory 524. Memory 524 includes random access memory (RAM) and read-only memory (ROM). As is well known in the art, ROM acts to transfer data and instructions uni-directionally to the CPU and RAM is used typically to transfer data and instructions in a bi-directional manner.” Par. 0047]
determine a bias term of the external normalization based on a spectro- temporal pattern of the query audio sample; [Referring to the Specification Par. 0008 of the instant Application: “external normalization that does not modify the audio sample itself but modifies the result of the comparison to make it more fair for different kinds of audio samples.” Garrett teaches modifying the result of the comparison such that “the similarity along would be too low to indicate a match” but normalized similarity allows a match to make it more fair for different kinds of audio samples such as when “1) the two files are of vastly different audio quality, 2) one of the files has missing or completely corrupted sections, 3) one of the files has extra audio in the beginning or the end; 4) the two signals have different amplitudes (e.g., the record levels are different such that one plays louder than the other on the same system; 5) one has significantly more noise (hiss, background sound, distortion, etc. than the other)” Garrett, Par. 0006; Garrett, “The comparison begins by creating an accurate and robust representation of the primary audio file (i.e., the claimed “audio sample”). In one embodiment this is done by transforming the audio file (i.e., the claimed “audio sample”) into a Gabor representation or Gabor spectrogram (i.e., the claimed “spectro-temporal pattern”)….All the aggregate similarity (i.e., used in the claimed “bias term”; Referring to the Specification Par. 0046 of the instant Application: “bias term is computed as the average similarity”) values are stored and a histogram is created. An offset or delta is derived from the highest number of aggregate similarity values.” Par. 0009; “At step 110 a set of observables (data) is derived from the complete Gabor-based similarity, such observables may include an average, an average cluster similarity (i.e., used in the claimed “bias term”; Referring to the Specification Par. 0046 of the instant Application: “bias term is computed as the average similarity”), a maximum similarity, a minimum similarity, and a standard deviation. The delta or offset derived during alignment step 106 may be used to derive an entropy value, a variance, and a peak. Finally, at step 112 an analysis may be performed where results of the similarity scan, observables, and other data from the previous steps may be input into a Bayesian classifier.” Par. 0035; Garrett teaches combining the bias term with the similarity score: “This procedure is effective even when the audio signals have a low signal-to-noise ratio in that the maximum similarity may actually be relatively low, but if all or most of the N blocks agree on the offset, then the alignment can be accepted with high confidence, thus increasing the effective signal-to-noise ratio (i.e., the claimed “combining bias term with the similarity score” in the normalized similarity enables matching even when audio files being compared are of vastly different audio quality). The similarity alone would be too low to indicate a match but the agreement of the deltas/offsets would be a strong indication. Additionally, as shown below, this agreement is also a powerful indication that the two audio signals are indeed duplicates, near duplicates or different or, more generally, how much of one signal is contained in the other.” Par. 0024]
compare the query audio sample with each of the reference audio samples to produce a similarity score for each comparison; [Garrett, “The comparison begins by creating an accurate and robust representation of the primary audio file (i.e., the claimed query audio sample”). In one embodiment this is done by transforming the audio file into a Gabor representation or Gabor spectrogram. The audio files in the database (i.e., the claimed “each of the reference audio samples”) which the primary audio file (i.e., the claimed query audio sample”) will be compared to are also in Gabor representation form….A similarity value (i.e., the clamed “similarity score”) is derived from this comparison.” Par. 0009]
combine the bias term with the similarity score of each comparison to produce normalized similarity scores; [Garrett teaches normalized similarity scores: “Once an alignment is made, at step 108 a similarity calculation is performed yielding s(t) where a single t is used because the files have been aligned, otherwise one would have to write s(t,t') for the similarity between audio file A at time t and audio file B at time t'. In one embodiment, similarity between two audio files at a specific time t may be defined as the cosine of the angle between them, typically computed as a normalized dot product, where the similarity between the complete files can be calculated as an average of s(t) over all t usually written as <s(t)> (i.e., the claimed “normalized similarity score”.” Par. 0034; Garrett teaches combining the bias term with the similarity score: “This procedure is effective even when the audio signals have a low signal-to-noise ratio in that the maximum similarity may actually be relatively low, but if all or most of the N blocks agree on the offset, then the alignment can be accepted with high confidence, thus increasing the effective signal-to-noise ratio (i.e., the claimed “combining bias term with the similarity score”). The similarity alone would be too low to indicate a match but the agreement of the deltas/offsets would be a strong indication. Additionally, as shown below, this agreement is also a powerful indication that the two audio signals are indeed duplicates, near duplicates or different or, more generally, how much of one signal is contained in the other.” Par. 0024]
compare the normalized similarity scores with a threshold to produce a result of comparison; and [Garrett teaches normalized similarity scores: “Once an alignment is made, at step 108 a similarity calculation is performed yielding s(t) where a single t is used because the files have been aligned, otherwise one would have to write s(t,t') for the similarity between audio file A at time t and audio file B at time t'. In one embodiment, similarity between two audio files at a specific time t may be defined as the cosine of the angle between them, typically computed as a normalized dot product, where the similarity between the complete files can be calculated as an average of s(t) over all t usually written as <s(t)> (i.e., the claimed “normalized similarity score”).” Par. 0034; “If the minimum similarity is high (i.e., the claimed “threshold”), then the two audio files are likely duplicates (i.e., the claimed “result of comparison”). If the maximum similarity is low (i.e., the claimed “threshold”), then they are likely not duplicates (i.e., the claimed “result of comparison”). Standard deviation can also be used to measure similarity.” Par. 0044]
output the result of comparison. [Garrett, “It would be desirable to obtain as strong a signal as possible so that when two duplicate files are compared and one has poor audio quality and the other is clean, the comparison does not result in a no-match because of the difference in audio quality. It would be preferable to have a system where comparing two such audio files resulted in a match (i.e., the claimed “result of comparison”) which can subsequently be classified correctly and with confidence (i.e., the claimed “output”).” Par. 0023]
Garrett fails to explicitly teach threshold and bias. 
However, Miro teaches:
compare the normalized similarity scores with a threshold to produce a result of comparison; and [Miro, “In an embodiment, where the step of selecting the vectors in R (i.e., the claimed “reference audio samples”) which are considered most similar to q.sub.i (i.e., the claimed “query audio sample”) according to the predefined similarity metric, comprises: calculate the predefined similarity metric (i.e., the claimed “similarity score”) between q.sub.i (i.e., the claimed “query audio sample”) and each of the vectors of R (i.e., the claimed “reference audio samples”) and selects said vectors of R whose predefined similarity metric with q.sub.i is less than a predefined second threshold.” Col. 7:34-39; “evaluates such path (involving applying a threshold to its length, matches density or normalized score (i.e., the claimed “normalized similarity score”)) to determine if it can be considered a good match (i.e., the claimed “result of comparison”),” Col. 16:46-48]
Garrett in view of Miro fails to explicitly teach bias.
However, He teaches:
determine a bias term of the external normalization based on a spectro- temporal pattern of the query audio sample; [To compare audio, He teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k”, thus resulting external normalization: He, “cosine similarity with a learnable scale w and a bias b” (Par. n0019); He teaches normalization: He, “when the distribution range of test scores for different features varies significantly, to avoid the influence of the value range, the test scores for each feature can be normalized before proceeding with subsequent operations.” (Par. n0030). Referring to Applicant’s Specification Par. 0042 states “similarity with each reference r using a bias term based on the average similarity between q and its K nearest neighbors in a background set of other samples, resulting in the normalized similarity score.”]
combine the bias term with the similarity score of each comparison to produce normalized similarity scores; [To compare audio, He teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k”, thus resulting external normalization: “cosine similarity with a learnable scale w and a bias b” (Par. n0019); He teaches normalization: “when the distribution range of test scores for different features varies significantly, to avoid the influence of the value range, the test scores for each feature can be normalized before proceeding with subsequent operations.” (He, Par. n0030). Referring to Applicant’s Specification Par. 0042 states “similarity with each reference r using a bias term based on the average similarity between q and its K nearest neighbors in a background set of other samples, resulting in the normalized similarity score.”] 
Garrett, Miro and He pertain to similarity comparison systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the similarity comparison systems art to modify Garrett’s teachings of “comparing the audio file (i.e., the claimed query audio sample”) to a large number of audio files in a database (i.e., the claimed “each of the reference audio samples”)” in Gabor representation form (i.e., the claimed “spectro-temporal pattern” in which a normalized similarity value (i.e., the clamed “normalized similarity score”) is derived from this comparison (Garrett, Par. 0006, Par. 0009, Par. 0024) with the explicit teachings of “threshold” (Miro, Col. 16:46-48) taught by Miro and the explicit teachings of “bias” term (He, Par. n0019) taught by He in order to overcome current problems of “noise” and “less flexible matching” in current audio comparison systems (Miro, Col. 1:37-39) and  “to avoid the influence of the value range” when “the distribution range of test scores for different features varies significantly” (He, Par. n0030).

Regarding Claims 2 and 18, Garrett in view of Miro and He has been discussed above. The combination further teaches:
compare the query audio sample with a set of training audio samples to produce a set of training similarity measures; and [Garrett “The system can be trained using very few examples (i.e., the claimed “training audio samples”), Par. 0028; “The audio files in the database (i.e., the claimed “set of training audio samples”) which the primary audio file (i.e., the claimed “query audio sample”) will be compared to are also in Gabor representation form.” Par. 0009; “At this stage, the system can be trained to recognize certain classes (“i.e., the claimed “set of training similarity measures”). In the embodiment where audio files of commercials are compared, the classes may be duplicate (“i.e., the claimed “set of training similarity measures”), near duplicate (“i.e., the claimed “set of training similarity measures”), and not duplicate (“i.e., the claimed “set of training similarity measures”), based on statistics.” Par. 0028]
determine the bias term based on an average of training similarity measures of K-nearest training audio samples. and [Garrett, “Once an alignment is made, at step 108 a similarity calculation is performed yielding s(t) where a single t is used because the files have been aligned, otherwise one would have to write s(t,t') for the similarity between audio file A at time t and audio file B at time t'. In one embodiment, similarity between two audio files at a specific time t may be defined as the cosine of the angle between them, typically computed as a normalized dot product, where the similarity between the complete files (i.e., the claimed “training similarity measures”) can be calculated as an average of s(t) over all t usually written as <s(t)>.” Par. 0034; He teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k” (i.e., the claimed “K-nearest training audio samples”). He, “cosine similarity (i.e., the claimed “similarity measures” with a learnable scale w and a bias b (i.e., the claimed “bias term”)” (Par. n0019); “metric distance between query embedding j and other embedding centers k” (i.e., the claimed “K-nearest training audio samples”), Par. n0019); He teaches normalization: He, “when the distribution range of test scores for different features varies significantly, to avoid the influence of the value range, the test scores for each feature can be normalized before proceeding with subsequent operations.” (Par. n0030).]

Regarding Claim 3, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein, to determine the bias term, the processor is configured to: [Garrett, see mapping applied to claim 1; He, see mapping applied to claim 1]
scale the average of training similarity measures with a scalar to produce the bias term.  [Garrett, “As shown below, one of the factors this involves computing a delta or shift and a scaling factor.” Par. 0033; “At this stage, the system can be trained to recognize certain classes (“i.e., the claimed “set of training similarity measures”). In the embodiment where audio files of commercials are compared, the classes may be duplicate (“i.e., the claimed “set of training similarity measures”), near duplicate (“i.e., the claimed “set of training similarity measures”), and not duplicate (“i.e., the claimed “set of training similarity measures”), based on statistics.” Par. 0028; Garrett, “Once an alignment is made, at step 108 a similarity calculation is performed yielding s(t) where a single t is used because the files have been aligned, otherwise one would have to write s(t,t') for the similarity between audio file A at time t and audio file B at time t'. In one embodiment, similarity between two audio files at a specific time t may be defined as the cosine of the angle between them, typically computed as a normalized dot product, where the similarity between the complete files (i.e., the claimed “training similarity measures”) can be calculated as an average of s(t) (i.e., the claimed “average of training similarity measures”) over all t usually written as <s(t)>.” Par. 0034; Garrett, “Standard deviation can also be used to measure similarity.” Par. 0044; Referring to the Specification Par. 0047 of the instant Application: "spread" of these similarities to determine the scalar. Measures of spread are one or a combination of variance, standard deviation,”; He, “Since all three results are scalars, the results obtained from each feature network can be directly weighted and summed,” Par. n0030]

Regarding Claim 4, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein the scalar is a function of diversities of spectro-temporal patterns in the training audio samples. [Garrett, see mapping applied to claim 3; He, see mapping applied to claim 3; Garrett, “Standard deviation can also be used to measure similarity.” Par. 0044; Referring to the Specification Par. 0047 of the instant Application: "spread" of these similarities to determine the scalar. Measures of spread are one or a combination of variance, standard deviation,”; He, “When the distribution range of test scores for different features varies significantly (i.e., the claimed “diversities”), to avoid influence of the value range (i.e., the claimed “diversities”, the test scores for each feature can be normalized before proceeding with subsequent operations.” (Par. n0030)]

Regarding Claim 5, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein, to determine the bias term, the processor is configured to: [Garrett, see mapping applied to claim 1; He, see mapping applied to claim 1]
extract the spectro-temporal pattern of the query audio sample; and [Garrett, see mapping applied to claim 1]
process the extracted spectro-temporal pattern with a predetermined analytical function to produce the bias term. [Garrett teaches modifying the result of the comparison/produce the bias term such that “the similarity along would be too low to indicate a match” but normalized similarity allows a match to make it more fair for different kinds of audio samples such as when “1) the two files are of vastly different audio quality, 2) one of the files has missing or completely corrupted sections, 3) one of the files has extra audio in the beginning or the end; 4) the two signals have different amplitudes (e.g., the record levels are different such that one plays louder than the other on the same system; 5) one has significantly more noise (hiss, background sound, distortion, etc. than the other)” Garrett, Par. 0006; Garrett, “The comparison begins by creating an accurate and robust representation of the primary audio file (i.e., the claimed “audio sample”). In one embodiment this is done by transforming (i.e., the claimed “process”) the audio file (i.e., the claimed “audio sample”) into a Gabor representation or Gabor spectrogram (i.e., the claimed “spectro-temporal pattern”)….All the aggregate similarity (i.e., used in the claimed “bias term”; Referring to the Specification Par. 0046 of the instant Application: “bias term is computed as the average similarity”) values are stored and a histogram is created. An offset or delta is derived from the highest number of aggregate similarity values.” Par. 0009; “At step 110 a set of observables (data) is derived from the complete Gabor-based similarity, such observables may include an average, an average cluster similarity (i.e., used in the claimed “bias term”; Referring to the Specification Par. 0046 of the instant Application: “bias term is computed as the average similarity”), a maximum similarity, a minimum similarity, and a standard deviation. The delta or offset derived during alignment step 106 may be used to derive an entropy value, a variance, and a peak. Finally, at step 112 an analysis (i.e., analysis the claimed “analytical function”) may be performed where results of the similarity scan, observables, and other data from the previous steps may be input into a Bayesian classifier.” Par. 0035; Garrett teaches combining the bias term with the similarity score: “This procedure is effective even when the audio signals have a low signal-to-noise ratio in that the maximum similarity may actually be relatively low, but if all or most of the N blocks agree on the offset, then the alignment can be accepted with high confidence, thus increasing the effective signal-to-noise ratio (i.e., the claimed “producing/combining bias term with the similarity score” in the normalized similarity enables matching even when audio files being compared are of vastly different audio quality). The similarity alone would be too low to indicate a match but the agreement of the deltas/offsets would be a strong indication. Additionally, as shown below, this agreement is also a powerful indication that the two audio signals are indeed duplicates, near duplicates or different or, more generally, how much of one signal is contained in the other.” Par. 0024; To compare audio, He teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k”, thus resulting external normalization: He, “cosine similarity with a learnable scale w and a bias b (i.e., the claimed “bias term”)” (Par. n0019); He teaches normalization: He, “when the distribution range of test scores for different features varies significantly, to avoid the influence of the value range, the test scores for each feature can be normalized before proceeding with subsequent operations.” (Par. n0030).]

Regarding Claim 6, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein, to determine the bias term, the processor is configured to: [Garrett, see mapping applied to claim 1; He, see mapping applied to claim 1]
extract the spectro-temporal pattern of the query audio sample; and [Garrett, see mapping applied to claim 1]
process the extracted spectro-temporal pattern with a learned function trained with machine learning to produce the bias term. [Garrett, see mapping applied to claim 5; To compare audio, He teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k”, thus resulting external normalization: He, “cosine similarity with a learnable scale w and a bias b (i.e., the claimed “learned function trained with machine learning”)” (Par. n0019); He teaches normalization: He, “when the distribution range of test scores for different features varies significantly, to avoid the influence of the value range, the test scores for each feature can be normalized before proceeding with subsequent operations.” (Par. n0030); Garrett, “Finally, at step 112 an analysis (i.e., analysis the claimed “analytical function”) may be performed where results of the similarity scan, observables, and other data from the previous steps may be input into a Bayesian classifier (i.e., the claimed “learned function trained with machine learning”).” Par. 0035; “At this stage, the system can be trained to recognize (i.e., the claimed “learn”) specific classes.” Par. 0029]

Regarding Claim 7, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein the learned function is trained with supervised machine learning using bias terms determined based on averaged similarity measures of training audio samples. [Garrett, “At this stage, the system can be trained to recognize certain classes. In the embodiment where audio files of commercials are compared, the classes may be duplicate, near duplicate, and not duplicate, based on statistics. The system can be trained using very few examples (e.g., a production system may be trained on about 30 pairs of commercials) where an example is a pair of audio files (i.e., the claimed “training audio samples”) from commercials. As known to a person of ordinary skill in the art, this is a very low number of samples for training a pattern recognition system compared to, for instance, some of the common facial or visual recognition systems (e.g., Viola-Jones) which may use hundreds of thousands of samples or more.” Par. 0028; “At step 110 a set of observables (data) is derived from the complete Gabor-based similarity, such observables may include an average, an average cluster similarity, a maximum similarity, a minimum similarity, and a standard deviation….Finally, at step 112 an analysis may be performed where results of the similarity scan, observables, and other data from the previous steps may be input into a Bayesian classifier (i.e., Bayesian classifier is a supervised machine learning algorithm”).” Par. 0035; To compare audio, He teaches “cosine similarity with a bias” that is based on “metric distance between query embedding j and other embedding centers k”, thus resulting external normalization: He, “cosine similarity with a learnable scale w and a bias b (i.e., the claimed “learned function trained with machine learning”)” (Par. n0019); He teaches normalization: He, “when the distribution range of test scores for different features varies significantly, to avoid the influence of the value range, the test scores for each feature can be normalized before proceeding with subsequent operations.” (Par. n0030)]

Regarding Claim 8, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein, to compare the query audio sample with a reference audio sample, the processor is configured to: [Garrett, see mapping applied to claim 1]
compute mel spectrograms of the query audio sample and the reference audio sample; and [Garrett, “The comparison begins by creating an accurate and robust representation of the primary audio file (i.e., the claimed “query audio sample”). In one embodiment this is done by transforming the audio file (i.e., the claimed “query audio sample”) into a Gabor representation or Gabor spectrogram (i.e., the claimed “mel spectrogram”; Referring to the Specification par. 0049 of the instant Application: “These spectro-temporal patterns are typically represented using a time-frequency representation such as a spectrogram, and the frequency axis can also contain a perceptual grouping of frequencies, such as in a mel spectrogram.” Gabor spectrograms are also time-frequency representations). The audio files in the database (i.e., the claimed “each of the reference audio samples”) which the primary audio file (i.e., the claimed query audio sample”) will be compared to are also in Gabor representation form.” Par. 0009]
determine the similarity score between the query audio sample and the reference audio sample based on a cosine similarity of the computed mel spectrograms. [Garrett, “Once an alignment is made, at step 108 a similarity calculation is performed yielding s(t) where a single t is used because the files have been aligned, otherwise one would have to write s(t,t') for the similarity between audio file A at time t and audio file B at time t'. In one embodiment, similarity (i.e., the claimed “cosine similarity”) between two audio files at a specific time t may be defined as the cosine of the angle between them, typically computed as a normalized dot product, where the similarity between the complete files can be calculated as an average of s(t) over all t usually written as <s(t)> (i.e., the claimed “normalized similarity score”.” Par. 0034; “The comparison begins by creating an accurate and robust representation of the primary audio file (i.e., the claimed “query audio sample”). In one embodiment this is done by transforming the audio file (i.e., the claimed “query audio sample”) into a Gabor representation or Gabor spectrogram (i.e., the claimed “mel spectrogram”; Referring to the Specification par. 0049 of the instant Application: “These spectro-temporal patterns are typically represented using a time-frequency representation such as a spectrogram, and the frequency axis can also contain a perceptual grouping of frequencies, such as in a mel spectrogram.” Gabor spectrograms are also time-frequency representations). The audio files in the database (i.e., the claimed “each of the reference audio samples”) which the primary audio file (i.e., the claimed query audio sample”) will be compared to are also in Gabor representation form.” Par. 0009]

Regarding Claim 9, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein the processor is configured to normalize the mel spectrograms with an internal normalization. [Garrett, see mapping applied to claims 1 and 8; Garrett teaches alignment/splitting into blocks of the same length/“internal normalization”: “As shown below, one of the factors this involves computing a delta or shift (i.e., the claimed “internal normalization”) and a scaling factor. This shift or translation can be calculated by breaking one of the audio signals into blocks (i.e., “split the longer file into multiple subfiles”) or parts and attempting to find where each part should map to on the other audio file.” Par. 0033; “In this manner the primary spectrogram block is compared to a running window (i.e., “sliding window”) of n Gabor vectors from the secondary spectrogram.” Par. 0009; Referring to the Specification par. 0034 of the instant Application, “internal normalization” refers to “padding with silence; removing the end of one file to make them the same length”; “split the longer file into multiple subfiles of the same length”; “use a sliding window approach”.]

Regarding Claim 11, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein, to compare the query audio sample with a reference audio sample, the processor is configured to: [Garrett, see mapping applied to claim 1]
compute embeddings of the query audio sample and the reference audio sample using a neural network; and [Garrett, see mapping applied to claim 1; He, “cosine similarity (i.e., the claimed “similarity measures” with a learnable scale w and a bias b (i.e., the claimed “bias term”)” (Par. n0019); “metric distance between query embedding j (i.e., the claimed “embeddings of the query audio sample”) and other embedding centers k” (i.e., the claimed “K-nearest training audio samples”), Par. n0019)]
determine the similarity score between the query audio sample and the reference audio sample based on a cosine similarity of the computed embeddings. [Garrett, see mapping applied to claim 1; He, “cosine similarity (i.e., the claimed “similarity measures” with a learnable scale w and a bias b (i.e., the claimed “bias term”)” (Par. n0019); “metric distance between query embedding j (i.e., the claimed “embeddings of the query audio sample”) and other embedding centers k” (i.e., the claimed “K-nearest training audio samples”), Par. n0019)]

Regarding Claim 16, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein the processor is further configured to: [Garrett, see mapping applied to claim 1]
perform an anomaly detection of the query audio sample based on the result of comparison. [Garrett, see mapping applied to claim 1; “When two audio files are compared, they may be classified as duplicate (i.e., “no anomalies”), near duplicate (i.e., “no anomalies”) or not duplicate.” Par. 0008; Referring to the Specification par. 0064 of the instant Application, “anomaly detection” refers to: “if the duplicate 1230 of sample 1210 with at least one element of the database 1220 is found, the audio sample 1210 is considered no anomalies, and the desired operation 1260 is performed. Otherwise, sample 1210 is considered as anomalous 1250.”]


Claims 10 and 19 are rejected under 35 U.S.C. 103(a) as being unpatentable over Garrett in view of Miro, He, McHargue (“Efficient Multispeaker Speech Synthesis and Voice Cloning”, Thesis, Case Western Reserve University, 2023), and Yalova et al., ("Automatic Speech Recognition System with Dynamic Time Warping and Mel-Frequency Cepstral Coefficients," COLINS, 2023), hereinafter referred to as Yalova.
Regarding Claims 10 and 19, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein the processor is configured to compute the mel spectrograms at a coarse resolution with fewer than 20 mel frequency bands and spacing between consecutive time windows greater than 20 ms. [Garrett, see mapping applied to claims 1 and 8; “To summarize, the first step is to obtain a set of Gabor space feature vectors for an audio file at full resolution (loss-less). Then this spectrogram is down sampled (i.e., the claimed “coarse resolution”) in time and is stored for quick comparison.” Par. 0031]
determining the similarity score between the query audio sample and the reference audio sample based on a cosine similarity of the computed mel spectrograms. [Garrett, see mapping applied to claim 8]
The combination fails to explicitly teach 20 mel frequency bands and 20 ms.
However, McHargue teaches:
wherein the processor is configured to compute the mel spectrograms at a coarse resolution with fewer than 20 mel frequency bands and spacing between consecutive time windows greater than 20 ms. [McHargue, “The compression possible with mel frequencies is so dramatic that it’s possible to reliably recognize phonemes with as few as 20 mel frequency bands.” Pg. 10]
McHargue fails to explicitly teach 20 ms.
However, Yalova teaches:
wherein the processor is configured to compute the mel spectrograms at a coarse resolution with fewer than 20 mel frequency bands and spacing between consecutive time windows greater than 20 ms. [Yalova, “The input signal was divided into intervals of 20-40 ms (i.e., the claimed “time windows greater than 20 ms”), since the size of such an interval is sufficient to obtain a reliable spectral estimate.” Pg. 4]
Garrett, Miro, He, McHargue and Yalova pertain to audio processing systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the audio processing systems art to modify Garrett’s teachings of “comparing the audio file (i.e., the claimed query audio sample”) to a large number of audio files in a database (i.e., the claimed “each of the reference audio samples”)” in Gabor representation form (i.e., the claimed “spectro-temporal pattern” in which a normalized similarity value (i.e., the clamed “normalized similarity score”) is derived from this comparison (Garrett, Par. 0006, Par. 0009, Par. 0024) with the explicit teachings of “threshold” (Miro, Col. 16:46-48) taught by Miro, the explicit teachings of  “bias” term (He, Par. n0019) taught by He, the explicit teachings of few as/fewer than “20 mel frequency bands” (McHargue, Pg. 10) taught by McHargue, and the explicit teachings of “intervals of 20-40 ms”/“time windows greater than 20 ms” (Yalova, Pg. 4) taught by Yalova in order to overcome current problems of “noise” and “less flexible matching” in current audio comparison systems (Miro, Col. 1:37-39) and “to avoid the influence of the value range” when “the distribution range of test scores for different features varies significantly” (He, Par. n0030) with “reliable spectral estimate” (Yalova, Pg. 4) and still “recognize phonemes” (McHargue, Pg. 10).


Claims 12 and 13 are rejected under 35 U.S.C. 103(a) as being unpatentable over Garrett in view of Miro, He, and Bergsma et al., ("Cluster-based pruning techniques for audio data," arXiv:2309.11922, 2023), hereinafter referred to as Bergsma.
Regarding Claim 12, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein the database of multiple reference audio samples includes the query audio sample such that the reference audio samples are compared with each other, wherein the processor is further configured to: [Garrett, see mapping applied to claim 1]
prune the database of multiple reference audio samples upon detecting duplications indicated by the result of comparison. [Garrett, see mapping applied to claim 1; “In one aspect of the present invention, a method of comparing two audio files is described. In one illustration, a primary audio file (i.e., the claimed “query audio sample”) is received and the tester needs to determine whether that audio file is already present in an audio file storage or database (i.e., the claimed “database of multiple reference audio samples”) containing a high number of audio files (e.g., 30,000 or more). When two audio files are compared, they may be classified as duplicate, near duplicate or not duplicate.” Par. 0008]
The combination fails to explicitly teach pruning.
However, Bergsma teaches:
prune the database of multiple reference audio samples upon detecting duplications indicated by the result of comparison. [Bergsma, In this work, we introduce, for the first time in the context of the audio domain, the k-means clustering as a method for efficient data pruning.” Pg. 1]
Garrett, Miro, He and Bergsma pertain to audio processing systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the audio processing systems art to modify Garrett’s teachings of “comparing the audio file (i.e., the claimed query audio sample”) to a large number of audio files in a database (i.e., the claimed “each of the reference audio samples”)” in Gabor representation form (i.e., the claimed “spectro-temporal pattern” in which a normalized similarity value (i.e., the clamed “normalized similarity score”) is derived from this comparison (Garrett, Par. 0006, Par. 0009, Par. 0024) with the explicit teachings of “threshold” (Miro, Col. 16:46-48) taught by Miro, the explicit teachings of “bias” term (He, Par. n0019) taught by He, and the explicit teachings of “data pruning” (Bergsma, Pg.1) taught by Bergsma in order to overcome current problems of “noise” and “less flexible matching” in current audio comparison systems (Miro, Col. 1:37-39), “to avoid the influence of the value range” when “the distribution range of test scores for different features varies significantly” (He, Par. n0030), and “improve predictions” (Bergsma, Pg. 1).

Regarding Claim 13, Garrett in view of Miro, He and Bergsma has been discussed above. The combination further teaches:
wherein the processor is further configured to: [Garrett, see mapping applied to claim 1]
train an audio deep learning model using the pruned database of multiple reference audio samples. [Garrett, see mapping applied to claims 1 and 12; Bergsma, see mapping applied to claim 12; Bergsma, “With the advent of deep learning and the exponential growth in the amount of data available, the demand for efficient storage and processing has become crucial.” Pg. 1; “Here, we explore a model-agnostic pruning technique based on unsupervised clustering analysis for audio data (i.e., the claimed “multiple reference audio samples”).” Pg. 1]


Claim 14 is rejected under 35 U.S.C. 103(a) as being unpatentable over Garrett in view of Miro, He, and Agostinelli et al., (U.S. Patent 11,915,689), hereinafter referred to as Agostinelli.
Regarding Claim 14, Garrett in view of Miro and He has been discussed above. The combination further teaches:
train an audio-generative model to generate audio samples using the database of multiple reference audio samples; and [Garrett, see mapping applied to claim 13]
compare generated audio samples with the reference audio samples using the external normalization to detect if at least some of the generated audio samples are duplicates of audio samples contained in the database of multiple reference audio samples. [Garrett, see mapping applied to claims 1 and 12]
The combination fails to teach audio-generative model to generate audio samples.
However, Agostinelli teaches:
train an audio-generative model to generate audio samples using the database of multiple reference audio samples; and [Agostinelli, “In some implementations, the input comprises a sequence of text, and the prediction of the audio signal is a prediction of music (i.e., the claimed “generate audio samples”) that is described by the sequence of text.” Col. 2: 15-16; “In some implementations, generating, using one or more generative neural networks (i.e., the claimed “audio-generative model”) and conditioned on at least the semantic representation and the embedding tokens, an acoustic representation of the audio signal (i.e, the claimed “generate audio samples”),” Col. 3:36-39; “The system can generate music (i.e., the claimed “generate audio samples”) conditioned on text without requiring a large training dataset of paired text-music training data. The system can be trained on a music-only dataset (i.e., the claimed “multiple reference audio samples”).” Col. 6: 44-47]
compare generated audio samples with the reference audio samples using the external normalization to detect if at least some of the generated audio samples are duplicates of audio samples contained in the database of multiple reference audio samples. [Agostinelli, “In some implementations, the input comprises a sequence of text, and the prediction of the audio signal is a prediction of music (i.e., the claimed “generate audio samples”) that is described by the sequence of text.” Col. 2: 15-16]
Garrett, Miro, He and Agostinelli pertain to audio processing systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the audio processing systems art to modify Garrett’s teachings of “comparing the audio file (i.e., the claimed query audio sample”) to a large number of audio files in a database (i.e., the claimed “each of the reference audio samples”)” in Gabor representation form (i.e., the claimed “spectro-temporal pattern” in which a normalized similarity value (i.e., the clamed “normalized similarity score”) is derived from this comparison (Garrett, Par. 0006, Par. 0009, Par. 0024) with the explicit teachings of “threshold” (Miro, Col. 16:46-48) taught by Miro, the explicit teachings of “bias” term (He, Par. n0019) taught by He, and the explicit teachings of “generating, using one or more generative neural networks (i.e., the claimed “audio-generative model”) an acoustic representation of the audio signal (i.e, the claimed “generate audio samples”)” (Agostinelli, Col. 3:36-39) taught by Agostinelli in order to overcome current problems of “noise” and “less flexible matching” in current audio comparison systems (Miro, Col. 1:37-39), “to avoid the influence of the value range” when “the distribution range of test scores for different features varies significantly” (He, Par. n0030), and “generate audio using neural networks” (Agostinelli, Col. 1:17-18).


Claim 15 is rejected under 35 U.S.C. 103(a) as being unpatentable over Garrett in view of Miro, He, Agostinelli, and Wang et al., (KR20230079503A), hereinafter referred to as Wang.
Regarding Claim 14, Garrett in view of Miro and He has been discussed above. The combination further teaches:
wherein the processor is further configured to: [Garrett, see mapping applied to claim 1]
transmit the generated audio sample {unless} the generated audio sample with the external normalization is a duplication of one or more of the reference audio samples.  [Garrett, see mapping applied to claim 14; Agostinelli, “In some embodiments, a server transmits data (i.e., the claimed “transmit the generated audio sample”), e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.” Col. 32:51-56]
Garrett in view of Miro and He fails to teach the bracketed limitation and audio-generative model to generate audio samples.
However, Agostinelli teaches:
execute an audio-generative model trained using the database of multiple reference audio samples to generate an audio sample; and [Agostinelli, see mapping applied to claim 14; Garrett, see mapping applied to claim 14]
Garrett in view of Miro, He and Agostinelli fails to teach the bracketed limitation.
However, Wang teaches the bracketed limitation:
transmit the generated audio sample {unless} the generated audio sample with the external normalization is a duplication of one or more of the reference audio samples. [Wang teaches discarding duplicates, which prevents transmittal: “…if it exists (i.e., the claimed “duplication”), discard it,” Par. 0185]
Garrett, Miro, He, Agostinelli and Wang pertain to audio processing systems and are analogous to the instant application. Accordingly, it would have been obvious to one of ordinary skill in the audio processing systems art to modify Garrett’s teachings of “comparing the audio file (i.e., the claimed query audio sample”) to a large number of audio files in a database (i.e., the claimed “each of the reference audio samples”)” in Gabor representation form (i.e., the claimed “spectro-temporal pattern” in which a normalized similarity value (i.e., the clamed “normalized similarity score”) is derived from this comparison (Garrett, Par. 0006, Par. 0009, Par. 0024) with the explicit teachings of “threshold” (Miro, Col. 16:46-48) taught by Miro, the explicit teachings of “bias” term (He, Par. n0019) taught by He, the explicit teachings of “generating, using one or more generative neural networks (i.e., the claimed “audio-generative model”) an acoustic representation of the audio signal (i.e, the claimed “generate audio samples”)” (Agostinelli, Col. 3:36-39) taught by Agostinelli, and the explicit teachings of ““…if it exists (i.e., the claimed “duplication”), discard it,” (Wang, Par. 0185) taught by Wang in order to overcome current problems of “noise” and “less flexible matching” in current audio comparison systems (Miro, Col. 1:37-39), “to avoid the influence of the value range” when “the distribution range of test scores for different features varies significantly” (He, Par. n0030), “generate audio using neural networks” (Agostinelli, Col. 1:17-18), and improve the “implementation of the neural network based end to end synthesis” (Wang, Par. 0003).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Chen et al., (U.S. Patent 12,327,551) teaches threshold similarity and average similarity to compare similarity of audio data.
Logan et al., (U.S. Patent Application Publication 2002/0181711) teaches comparing similarity of audio spectrograms.
Maksimovic, et al., (EP3570186) teaches external similarity comparison.


Any inquiry concerning this communication or earlier communications from the examiner should be directed to EUNICE LEE whose telephone number is 571-272-1886. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/EUNICE LEE/Examiner, Art Unit 2656

 /BHAVESH M MEHTA/ Supervisory Patent Examiner, Art Unit 2656
Read full office action
Prosecution Timeline

Nov 06, 2023
Application Filed
Oct 15, 2025
Non-Final Rejection — §103
Dec 01, 2025
Response Filed
Feb 19, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/449,809
Patent 12603078
GENERATING SPEECH DATA USING ARTIFICIAL INTELLIGENCE TECHNIQUES
2y 5m to grant Granted Apr 14, 2026
17/992,605
Patent 12597365
AUTOMATIC TRANSLATION BETWEEN SIGN LANGUAGE AND SPOKEN LANGUAGE
2y 5m to grant Granted Apr 07, 2026
18/205,615
Patent 12585876
METHOD OF TRAINING POS TAGGING MODEL, COMPUTER-READABLE RECORDING MEDIUM AND POS TAGGING METHOD
2y 5m to grant Granted Mar 24, 2026
18/518,786
Patent 12579385
EMBEDDED TRANSLATE, SUMMARIZE, AND AUTO READ
2y 5m to grant Granted Mar 17, 2026
18/140,389
Patent 12566928
READABILITY BASED CONFIDENCE SCORE FOR LARGE LANGUAGE MODELS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

2-3
Expected OA Rounds
89%
Grant Probability
99%
With Interview (+27.3%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 27 resolved cases by this examiner. Grant probability derived from career allow rate.