Last updated: April 19, 2026
Application No. 19/046,440
APPARATUS, SYSTEM, AND METHOD FOR AUDIO BASED BROWSER COOKIES

Non-Final OA §103
Filed
Feb 05, 2025
Examiner
STEVENS, ROBERT
Art Unit
2164
Tech Center
2100 — Computer Architecture & Software
Assignee
Auddia Inc.
OA Round
1 (Non-Final)
Interview Optional

— +11.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 517 resolved cases, 2023–2026
Examiner Intelligence

STEVENS, ROBERT View full profile →
Grants 81% — above average
Career Allow Rate
420 granted / 517 resolved
+26.2% vs TC avg
Moderate +11% lift
Without
With
+11.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
15 currently pending
Career history
532
Total Applications
across all art units
Statute-Specific Performance

§101
22.1%
-17.9% vs TC avg
§103
44.0%
+4.0% vs TC avg
§102
8.5%
-31.5% vs TC avg
§112
17.6%
-22.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 517 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 

Allowable Subject Matter
Claims 3, 5, 11, 13, 19 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims (and correcting grammatical errors in the parent claims).

Claim Objections
Independent claims 2, 10 and 18 are objected to due to the following exemplary informalities: Each of the claims contain a minor grammatical error.  The elements are recited in two different verb tenses (i.e., receiving/matching/selecting vs compute/compute/manipulate).  Applicant is respectfully reminded to review the specification/abstract/ claims/drawings for all informalities.    
Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under pre-AIA  35 U.S.C. 103(a) are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 2, 4, 7, 9, 10, 12, 15, 17 and 18 are rejected under 35 U.S.C. §103(a) as being unpatentable over Selby et al (US Patent Application Publication No. 2011/0307085, hereafter referred to as “Selby”) in view of Baluja et al (US Patent No. 8,411,977, hereafter referred to as “Baluja”). 

Regarding independent claim 2:  Selby teaches A method comprising: receiving a user-recorded audio sample at an endpoint device; (See Selby paragraph 0036 discussing the reception of an audio “programme” at a radio receiver, in the context of paragraph 0046 discussing an exemplary system computing environment including processors and storage.) compute the spectrograph of the user-recorded audio sample; (See Selby paragraph 0037 discussing the use of a spectrogram generator to transform source audio signals.  NOTE:  As set forth in Applicant’s as-filed specification at paragraph 0076, “spectrographs are sometimes called spectrograms as well”.) compute spectrographs of a plurality of candidate clips; (See Selby Abstract discussing the generation of spectrograms for successive timeslices of an audio signal.)
However, Selby does not explicitly teach the remaining limitations as claimed.  Baluja, though, teaches manipulate the user-recorded spectrograph and the candidate spectrographs to optimize them for matching; (See Baluja col. 5 lines 6-19 discussing creating a “magnitude-only” spectrogram representation to process spectrogram signals, thus enabling the same spacing scheme to be used for the processing of all audio samples.) matching the user-recorded spectrograph to the candidate spectrographs; (See Baluja Abstract and paragraphs 0013-0014 teaching the matching of spectrogram elements and the exemplary use of a database of songs.  See also, col. 6 lines 56-61 discussing the matching of target samples to those in a data repository.) and selecting the best match. (See Baluja Example 3 of col 15 line 45 – col. 16 line 5, esp. col 16 lines 4-5, discussing that the best match is declared the matching sample.)
It would have been obvious to one of ordinary skill in the art at the time the invention was made to a person having ordinary skill in the art to which the claimed invention pertains to apply the teachings of Baluja for the benefit of Selby, because to do so provided a designer with options to implement a system providing an ability to match songs / audio samples taken under a variety of conditions, such as when transmission is poor, as taught by Baluja in col. 4 lines 25-36.  These references were all applicable to the same field of endeavor, i.e., spectral processing of audio data.  

Regarding claim 4:  Selby teaches wherein the plurality of candidate clips are received from a streaming audio source.  (See Selby paragraph 0022 discussing t5he use of an audio recognition system that processes an incoming audio stream.)  

Regarding claim 7:  Selby teaches wherein the endpoint device continuously listens for the audio sample to process in real-time.  (See Selby paragraph 0036 discussing the use of a radio / microphone device, in the context of paragraph 0096 discussing integration into real time processing that is performed automatically.)  

Regarding claim 9:  Selby teaches wherein the candidate spectrographs are songs stored in a database.  (See Selby paragraph 0014 discussing the “recognition against a very large database source of fingerprinted content” [e.g., in excess of one million songs.])  

Claims 10, 12, 15 and 17 are substantially similar to claims 2, 4, 7 and 9, respectively, and therefore likewise rejected.  

Claim 18 is substantially similar to claim 2, and therefore likewise rejected.  

Claims 6 and 14 are rejected under 35 U.S.C. §103(a) as being unpatentable over Selby et al (US Patent Application Publication No. 2011/0307085, hereafter referred to as “Selby”) in view of Baluja et al (US Patent No. 8,411,977, hereafter referred to as “Baluja”) and Richard Altes (“Detection, estimation, and classification with spectrograms”, The Journal of the Acoustical Society of America, Volume 67, Issue 4, April 1980, pp. 1232-1246, hereafter referred to as “Altes”). 

Regarding claim 6:  Selby in view of Baluja does not explicitly teach the remaining limitations as claimed.  Altes, though, teaches wherein the best match is selected on the smallest mean- square error between the user-recorded spectrograph and the candidate spectrographs. (See Altes page 1232 Abstract and the 1st full paragraph of page 1233 teaching the use of mean-square-error in the correlation of spectrograms.)
It would have been obvious to one of ordinary skill in the art at the time the invention was made to a person having ordinary skill in the art to which the claimed invention pertains to apply the teachings of Altes for the benefit of Selby in view of Baluja, because to do so provided a designer with options for implementing a system for spectrogram use in a noisy environment, as taught by Altes in the 3rd paragraph of the section entitled “Background and goals” on page 1232.  These references were all applicable to the same field of endeavor, i.e., spectral processing of signal data.  

Claim 14 is substantially similar to claim 6, and therefore likewise rejected.  

Claims 8, 16 and 21 are rejected under 35 U.S.C. §103(a) as being unpatentable over Selby et al (US Patent Application Publication No. 2011/0307085, hereafter referred to as “Selby”) in view of Baluja et al (US Patent No. 8,411,977, hereafter referred to as “Baluja”) and Wang et al (US Patent No. 7,853,664, hereafter referred to as “Wang”). 

Regarding claim 8:  Selby in view of Baluja does not explicitly teach the remaining limitations as claimed.  Wang, though, teaches wherein when a best match is selected, an action is taken, the action includes one of: content is purchased by a user, coupon or offer is sent to the user or an application software, notify an advertiser, or record the best match in the offer and user-account database. (See Wang col. 2 lines 33-44 in the context of col. 16 line 35 – col. 17 line 3 teaching the ability of a user to purchase a song identified as matching the one listened to, and col. 21 lines 1-25 discussing the use of spectral component analysis and fingerprinting of recordings.)
It would have been obvious to one of ordinary skill in the art at the time the invention was made to a person having ordinary skill in the art to which the claimed invention pertains to apply the teachings of Wang for the benefit of Selby in view of Baluja, because to do so provided a designer with options for implementing a system allowing a user to quickly identify a signal in almost any environment, perform transactions based upon that identification, as taught by Wang in the Abstract.  These references were all applicable to the same field of endeavor, i.e., spectral processing of audio data.  

Claims 16 and 21 are each substantially similar to claim 8, and therefore likewise rejected.  

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Relevance is provided in at least the Abstract of each cited document.

Non-Patent Literature
Yoshii, Kazuyoshi, et al., “Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With Harmonic Structure Suppression”, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, No. 1, January 2007, pp. 333-345.
This paper describes a system that detects onsets of the bass drum, snare drum, and hi-hat cymbals in polyphonic audio signals of popular songs. Our system is based on a template-matching method that uses power spectrograms of drum sounds as templates. This method calculates the distance between a template and each spectrogram segment extracted from a song spectrogram, using Goto’s distance measure originally designed to detect the onsets in drums-only signals. However, there are two main problems. The first problem is that appropriate templates are unknown for each song. The second problem is that it is more difficult to detect drum-sound onsets in sound mixtures including various sounds other than drum sounds. To solve these problems, we propose template-adaptation and harmonic-structure-suppression methods. First of all, an initial template of each drum sound, called a seed template, is prepared. The former method adapts it to actual drum-sound spectrograms appearing in the song spectrogram. To make our system robust to the overlapping of harmonic sounds with drum sounds, the latter method suppresses harmonic components in the song spectrogram before the adaptation and matching. Experimental results with 70 popular songs showed that our template-adaptation and harmonic-structure-suppression methods improved the recognition accuracy and achieved 83%, 58%, and 46% in detecting onsets of the bass drum, snare drum, and hi-hat cymbals, respectively. (page 333, Abstract).  Lowpass filter functions, highpass filter functions. (page 335, Fig. 3).  

Fulop, Sean A., et al., “Algorithms for computing the time-corrected instantaneous frequency (reassigned) spectrogram, with applications”, The Journal of the Acoustical Society of America, Volume 119, Issue 1, January 2006, pp. 360-371.
A modification of the spectrogram (log magnitude of the short-time Fourier transform) to more accurately show the instantaneous frequencies of signal components was first proposed in 1976 [Kodera et al., Phys. Earth Planet. Inter. 12, 142–150 (1976)], and has been considered or reinvented a few times since but never widely adopted. This paper presents a unified theoretical picture of this time-frequency analysis method, the time-corrected instantaneous frequency spectrogram, together with detailed implementable algorithms comparing three published techniques for its computation. The new representation is evaluated against the conventional spectrogram for its superior ability to track signal components. The lack of a uniform framework for either mathematics or implementation details which has characterized the disparate literature on the schemes has been remedied here. Fruitful application of the method is shown in the realms of speech phonation analysis, whale song pitch tracking, and additive sound modeling. (page 360, Abstract). These facts form the foundation of the time-corrected instantaneous frequency spectrogram, and also justify this name for it. To see how to use them, however, one must note that the digital form of the short-time Fourier transform of a signal provides, in effect, a filtered analytic signal at each frequency bin, thereby decomposing the original signal into a number of component signals one for each frequency bin whose instantaneous frequency can then be computed using Rihaczek’s equations.  (page 361, 5th paragraph).  The reassigned bandwidth-enhanced additive sound model22 is a high-fidelity representation that allows manipulation and transformation of a great variety of sounds, including noisy and nonharmonic sounds. This sound model combines sinusoidal and noise energy in a homogeneous representation, obtained by means of the time-corrected instantaneous frequency spectrogram. The amplitude and frequency envelopes of the line components are obtained by following ridges on a TCIF spectrogram. This model yields greater precision in time and frequency than is possible using conventional additive techniques, and preserves the temporal envelope of transient signals, even in modified reconstruction.  (page 369, 1st full paragraph).  Conventional spectrogram of acoustic bass pluck, computed using a Kaiser window of 1901 samples at 44.1 kHz with a shaping parameter to achieve 66 dB of sidelobe rejection. (page 369, Fig. 8).  

Tchernichovski, Ofer, et al., “A procedure for an automated measurement of song similarity”, Animal Behaviour, Volume 59, Issue 6, June 2000, pp. 1167-1176.
Assessment of vocal imitation requires a widely accepted way of describing and measuring any similarities between the song of a tutor and that of its pupil. Quantifying the similarity between two songs, however, can be difficult and fraught with subjective bias. We present a fully automated procedure that measures parametrically the similarity between songs. We tested its performance on a large database of zebra finch, Taeniopygia guttata, songs. The procedure uses an analytical framework of modern spectral analysis to characterize the acoustic structure of a song. This analysis provides a superior sound spectrogram that is then reduced to a set of simple acoustic features. Based on these features, the procedure detects similar sections between songs automatically. In addition, the procedure can be used to examine: (1) imitation accuracy across acoustic features; (2) song development; (3) the effect of brain lesions on specific song features; and (4) variability across different renditions of a song or a call produced by the same individual, across individuals and across populations. By making the procedure available we hope to promote the adoption of a standard, automated method for measuring similarity between songs or calls. (page 1167, Abstract).  For each pair of time windows labelled as ‘similar’ for two songs being compared, we calculated the probability that the goodness of the match would have occurred by chance as described above. We are left, then, with a series of P values, and the lower the P, the higher the similarity. For convenience we transform these P values to 1-P; therefore, a 99% similarity between a pair of windows means that the probability that the goodness of the match would have occurred by chance is less than 1%. In this case, 99% similarity does not mean that the features in the two songs being compared are 99% similar to each other. In practice and because of how our thresholds were set, songs or sections of songs that get a score of 99% similarity tend, in fact, to be very similar. (page 1172, 1st paragraph of section entitled “The final similarity score”).  

Altes, Richard, “Detection, estimation, and classification with spectrograms”, The Journal of the Acoustical Society of America, Volume 67, Issue 4, April 1980, pp. 1232-1246.
A locally optimum detector correlates the data spectrogram with a reference spectrogram detect (i) a known signal with unknown delay and Doppler parameters, in order to (ii) a random signal with known covariance function, or (iii) the output of a random, time-varying channel with known scattering function. Spectrogram correlation can also be used for maximum likelihood parameter estimation, e.g., estimation of delay or center frequency of a signal. To estimate an analog input signal from its spectrogram, a modified deconvolution operation no noise is added to the spectrogram, can be used together with a predictive noise canceler. the mean-square error of this signal estimate is independent of the window function that is used to construct the spectrogram. When estimates of specific signal parameters are obtained directly from the spectrogram, these estimates have mean-square both signal and window waveforms. Spectrogram correlation errors that depend upon can be used for classification as well as for estimation and detection. Parameter estimators and detectors are, in fact, specialized kinds of classifier. (page 1232, Abstract). A measurement problem may involve estimation of an analog time signal, or of specific signal parameters such as time-of-arrival or center frequency. A minimum mean-square error (MMSE) signal estimate can be obtained from a spectrogram, except that an unknown constant will always be added to the signal's phase. Estimates of signal parameters can be obtained directly from the spectrogram. Variance bounds that describe the accuracy of different parameter estimates are compared, and their dependence upon signal and waveforms are assessed.  (page 1233, 1st full paragraph).  

US Patent Application Publications
Covell 	 				2007/0130580
A technique for generating audio descriptors using wavelets is described in U.S. Provisional Patent Application No. 60/823,881, for "Audio Identification Based on Signatures." That application describes a technique that uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact descriptors/fingerprints of audio snippets that can be efficiently matched. The technique uses wavelets, which is a known mathematical tool for hierarchically decomposing functions. In "Audio Identification Based on Signatures," an implementation of a retrieval process includes the following steps: 1) given the audio spectra of an audio snippet, extract spectral images of, for example, 11.6*w ms duration, with random spacing averaging d-ms apart. For each spectral image: 2) compute wavelets on the spectral image; 3) extract the top-t wavelets; 4) create a binary representation of the top-t wavelets; 5) use min-hash to create a sub-fingerprint of the top-t wavelets; 6) use LSH with b bins and 1 hash tables to find sub-fingerprint segments that are close matches; 7) discard sub-fingerprints with less than v matches; 8) compute a Hamming distance from the remaining candidate sub-fingerprints to the query sub-fingerprint; and 9) use dynamic programming to combined the matches across time. (paras 0033-0034).  Ke et al. uses computer vision techniques to find highly discriminative, compact statistics for audio. Their procedure trained on labeled pairs of positive examples (where x and x' are noisy versions of the same audio) and negative examples (where x and x' are from different audio). During this training phase, machine-learning technique based on boosting uses the labeled pairs to select a combination of 32 filters and thresholds that jointly create a highly discriminative statistic. The filters localize changes in the spectrogram magnitude, using first and second order differences across time and frequency. (para 0063).  

Powar 	 				2011/0273455
As yet another example of a technique to identify content within the media stream, a media sample can be analyzed to identify its content using a localized matching technique. For example, generally, a relationship between two media samples can be characterized by first matching certain fingerprint objects derived from the respective samples. A set of fingerprint objects, each occurring at a particular location, is generated for each media sample. Each location may be determined depending upon content of a respective media sample and each fingerprint object may characterize one or more local features at or near the respective particular location. A relative value is next determined for each pair of matched fingerprint objects. (para 0043). More specifically, using the methods described above, a relationship between two audio samples can be characterized by generating a time-frequency spectrogram of the samples (e.g., computing a Fourier Transform to generate frequency bins in each frame), and identifying local energy peaks of the spectrogram. Information related to the local energy peaks is extracted and summarized into a list of fingerprint objects, each of which optionally includes a location field, a variant component, and an invariant component. Certain fingerprint objects derived from the spectrogram of the respective audio samples can then be matched. A relative value is determined for each pair of matched fingerprint objects, which may be, for example, a quotient or difference of logarithm of parametric values of the respective audio samples. (para 0056).  n an example embodiment, additional user input may be collected via voice or touch-tone (i.e., DTMF tones) to further control lyric delivery or trigger additional events such as transaction events. For example, by interacting with the user through the capture device or the delivery device, the telephone, and text-displaying device respectively, the service provider may provide purchase options to the user to obtain the record album containing the broadcast and identified song for which the lyrics were sought. (para 0086).  

Selby 	 				2011/0307805
Automatic recognition of sample media content is provided. A spectrogram is generated for successive time slices of audio signal. One or more sample hash vectors are generated for a time slice by calculating ratios of magnitudes of respective frequency bins from a column for the time slice. In a primary evaluation stage an exact match of bits of the sample hash vector is performed to entries in a look-up table to identify a group of one or more reference hash vectors. In a secondary evaluation stage a degree of similarity between the sample hash vector and each of the group of reference hash vectors is performed to identify any reference hash vectors that are candidates for matching the sample media content, each reference hash vector representing a time slice of reference media content. (Abstract). An example embodiment can provide recognition against a very large database source of fingerprinted content (for example for in excess of one million songs).  (para 0014).  In this example a source signal in the form of an audio signal is processed to generate a spectrogram, for example by applying a Fast Fourier Transform (FFT) to the audio signal. In an example embodiment, the audio signal should be formatted in a manner consistent with a method of generating the database against which the audio signal is to be compared. In one example embodiment, the audio signal can be converted to a plain .WAV format, sampled at, for example, 12 kHz, in stereo if possible or mono if not and with, for example, 16 bits per sample. In one example embodiment, stereo audio comprising a left channel and a right channel is represented as sum (left plus right) and difference (left minus right) channels in order to give greater resilience to voice-over and similar distortions. The audio file is then processed to generate a spectrogram. The parameters applied to the spectrogram are broadly based on the human ear's perception of sound since the kind of distortions that the sound is likely to go through are those which preserve a human's perception. The spectrogram includes a series of columns of information for successive sample intervals (time slices). Each time slice corresponds to, for example, 1 to 50 ms (for example approximately 20 ms). Successive segments can overlap by a substantial proportion of their length, for example by 90-99%, for example about 97%, of their length. As a result, the character of the sound tends to change only slowly from segment to segment. A column for a time slice can include a plurality of frequency bins arranged on a logarithmic scale, with each bin being, for example, approximately one semitone wide. (paras 0047-0049).  

Cooper 	 				2005/0123053
The ordered information may include audio, video, text or any other information having an ordering dimension, such as time for audio and/or video information and position for text information. he retrieved and/or received information is analyzed to determine an appropriate type of parameterization to be applied to the received and/or retrieved information. For example, different windowing and parameterization may be applied to audio information, video information, textual information or other types of ordered information. In a first exemplary embodiment according to this invention, audio information, such as an audio waveform, is windowed into frames or the frames associated with the video information accompanying the audio information in the work are used as windows. A parameterization of the windowed audio information is then determined. The windowed audio information may be parameterized using a Short Time Frame Fourier Transform (STFT), a Fourier Transform, a Mel-Frequency Cepstral Coefficients analysis, a spectrogram, a Fast Fourier Transform (FFT), wavelet decomposition or any other known or later-developed analysis technique without departing from the spirit and scope of this invention. (paras 0033-0035). The storage medium of claim 28, wherein the stream of ordered information comprises at least audio information, and the instructions for parameterizing the stream of ordered information comprise instructions for parameterizing the stream of ordered audio information based on at least one of a STFT Fourier Transform, a Mel-Frequency Cepstral Coefficients Analysis, a spectrogram, a Fast Fourier Transform and wavelet decomposition. (claim 29). 

Sharifi 	 				2016/0322066
An individual may hear a song on the radio or in a public establishment, and may want to later acquire the song by purchasing the song from an online music distribution service. (para 0003). In some implementations, the method further comprises normalizing one or more intensity values of each spectral slice of the spectrogram to create a normalized spectrogram, wherein determining the average spectral envelope of the spectrogram comprises determining the average spectral envelope of the normalized spectrogram. In some implementations, the spectral fluctuation score comprises the mean of the one or more differences. In some implementations, the mean of the one or more differences comprises the mean of the absolute values of the differences between adjacent values in the average spectral envelope. In some implementations, the method further comprises approximating a first derivative of the average spectral envelope in the frequency dimension, wherein determining the one or more differences between adjacent values in the average spectral envelope comprises determining the one or more differences between adjacent values in the average spectral envelope based on the first derivative of the average spectral envelope. In some implementations, the method comprises determining an average squared magnitude of the audio sample, and comparing the average squared magnitude of the audio sample to a threshold value, wherein computing the spectrogram is based on determining that the average squared magnitude of the audio sample is great than the threshold value. (para 0011).  The spectral fluctuation detector 110 normalizes one or more intensity values of each spectral slice of the spectrogram (218). For example, the spectral fluctuation detector 110 normalizes the intensity values of each spectral slice of the filtered spectrogram to compensate for high and low volume in the captured environmental audio data represented by the intensity values in the spectrogram. Specifically, the spectral fluctuation detector 110 normalizes each slice of the spectrogram by dividing the intensity values associated with each spectral slice by the harmonic mean of the intensity values of the spectral slices of a selected portion of the spectrogram. (para 0065).  

US Patents
Wang 					7,853,664
A method and system is described which allows users to identify (pre-recorded) sounds such as music, radio broadcast, commercials, and other audio signals in almost any environment. The audio signal (or sound) must be a recording represented in a database of recordings. The service can quickly identify the signal from just a few seconds of excerption, while tolerating high noise and distortion. Once the signal is identified to the user, the user may perform transactions interactively in real-time or offline using the identification information. (Abstract). The present invention relates generally to methods and apparatuses for obtaining information about and/or purchasing pre-recorded music, and more particularly to a method and system for obtaining information about and/or purchasing pre-recorded music while listening to the music at any location. When listening to music, people often want to identify a song currently being played on an audio system, such as a radio, but can identify neither the title nor the artist. The listener may simply be interested in the artist, title, lyrics, genre, or other information about the music. The listener may also be interested in obtaining a copy of the music, i.e., purchasing the music. (col. 1 lines 14-26). Many times unidentified music is heard when riding in a car (or at another similarly inconvenient location). Moreover, when a listener decides he wishes to know the identity of a particular song being played, it is usually well into the song. Therefore, even if the listener were to begin recording the song at the moment he decides he wishes to know the identity of the song, the sample would be relatively short and possibly noisy depending upon the quality of the audio recording and the recording environment. Certainly, most listeners do not carry high quality recording equipment with them when traveling in a car.  Moreover, even if the listener knows the identity of a song, as time passes the desire to obtain a copy of the song also passes. This is the so-called impulse purchase phenomenon, which is well known to retailers. The impulse purchase phenomenon is particularly strong where the listener has not heard the song before, and thus is unfamiliar with the title and/or recording artist. Unfortunately, there is currently no way for a music seller to take advantage of a potential impulse purchase resulting from a listener hearing a song (for perhaps the first time) in a car or other location that is remote from normal retail locations. (col. 2 lines 1-24). The power norm method of landmarking is especially good for finding transients in the sound signal. The power norm is actually a special case of the more general Spectral Lp Norm, where p=2. The general Spectral Lp Norm is calculated at each time along the sound signal by calculating the spectrum, for example via a Hanning-windowed Fast Fourier Transform (FFT). The Lp norm for that time slice is then calculated as the sum of the p-th power of the absolute values of the spectral components, optionally followed by taking the p-th root. As before, the landmarks are chosen as the local maxima of the resulting values over time. (col. 20 lines 57-67).  In the neighborhood of each landmark timepoint a frequency analysis is performed to extract the top several spectral peaks. A simple such fingerprint value is just the single frequency value of the strongest spectral peak. The use of such a simple peak resulted in surprisingly good recognition in the presence of noise, but resulted in many false positive matches due to the non-uniqueness of such a simple scheme. Using fingerprints consisting of the two or three strongest spectral peaks resulted in fewer false positives, but in some cases created a susceptibility to noise if the second-strongest spectral peak was not sufficiently strong enough to distinguish it from its competitors in the presence of noise--the calculated fingerprint value would not be sufficiently stable. Despite this, the performance of this case was also good. (col. 21 lines 18-31).  In addition to finding the strongest spectral components, there are other spectral features that can be extracted and used as fingerprints. LPC analysis extracts the linearly predictable features of a signal, such as spectral peaks, as well as spectral shape. LPC coefficients of waveform slices anchored at landmark positions can be used as fingerprints by hashing the quantized LPC coefficients into an index value. LPC is well-known in the art of digital signal processing. (col. 21 lines 47-54).  A key insight into the matching process is that the time evolution in matching sounds must follow a linear correspondence, assuming that the timebases on both sides are steady. This is almost always true unless the sound on one side has been nonlinearly warped intentionally or subject to defective playback equipment such as a tape deck with a warbling speed problem. Thus, the matching fingerprints yielding correct landmark pairs (landmark.sub.n,landmark*.sub.n) in the scatter list of a given sound_ID must have a linear correspondence of the form landmark*.sub.n=m*landmark.sub.n+offset where m is the slope, and should be near 1, landmark.sub.n is the corresponding timepoint within the exogenous sound signal,landmark*.sub.n is the corresponding timepoint within the library sound recording indexed by sound_ID, and offset is the time offset into the library sound recording corresponding to the beginning of the exogenous sound signal. (col. 23 lines 42-60).  

Barton 					8,015,123
Methods and systems for providing lyric information for a song within an audio signal. This may be done, for example, to allow a user to sing along with a song the user hears on a radio. To provide the lyric information, an interactive service may be accessed. A sample of an audio signal that includes at least a portion of the song may be captured. The sample of the audio signal may provided to the interactive service. Lyric information may then be received for the song at a user device. The user device may also display the lyric information in synchrony with the rendering of the song within the audio signal to, for example, allow the user to sing along with the song. (Abstract).  The power norm method of landmarking is especially good for finding transients in the sound signal. The power norm is actually a special case of the more general Spectral Lp Norm, where p=2. The general Spectral Lp Norm is calculated at each time along the sound signal by calculating the spectrum, for example via a Hanning-windowed Fast Fourier Transform (FFT). The Lp norm for that time slice is then calculated as the sum of the p-th power of the absolute values of the spectral components, optionally followed by taking the p-th root. As before, the landmarks are chosen as the local maxima of the resulting values over time. (col. 16 lines 24-34).  

Baluja 					8,411,977
In one example implementation, a database was created using 6,500 songs, with 200 audio samples (each approximately .about.1.5 seconds) extracted from each song, with a resulting total of 1,300,000 samples. Thus, each song was converted into a series of samples for storage in the database. Each song was converted from a typical audio format (e.g., mp3, way, etc.) to a mel-frequency spectrogram with tilt and amplitude normalization over a pre-selected frequency range (400 Hz to 4 kHz). For computational efficiency, the input audio was low-pass filtered to about 5/4 of the top of the selected frequency range and then down sampled accordingly. For example, using 4 kHz as the top of our frequency range of interest and using 44.1 kHz as the input audio sampling rate, we low-pass filtered using a simple FIR filter with an approximate frequency cut between 5 and 5.5 kHz and then subsampled to a 11.025 kHz sampling rate. To minimize volume-change effects, the audio sample energy was normalized using the local average energy, taken over a tapered, centered 10-second window. To minimize aperture artifacts, the average energy was also computed using a tapered Hamming window. A spectrogram "slice rate" of 100 Hz (that is, a slice step size of 10 ms) was used. For the slices, audio data was taken, and a tapered window (to avoid discontinuity artifacts in the output) applied, and then an appropriately sized Fourier transform was applied. The Fourier magnitudes were "de-tilted" using a single-pole filter to reduce the effects of low-frequency bias and then "binned" (averaged) into B frequency samples at mel-scale frequency spacing (e.g., B=32). (col. 14 lines 35-62).  

Zakaraukas 					7,957,967
A system classifies the source of an input signal. The system determines whether a sound source belongs to classes that may include human speech, musical instruments, machine noise, or other classes of sound sources. The system is robust, performing classification despite variation in sound level and noise masking. Additionally, the system consumes relatively few computational resources and adapts over time to provide consistently accurate classification. (Abstract).  The system classifies input signals as follows: An input signal is digitized into binary data, which is transformed to a time-frequency representation (spectrogram). Background noise is estimated and a signal detector isolates periods containing signal. Periods without signal content are included in the noise estimate. The spectrogram of the input signal is rescaled and compared to spectrograms for a number of templates defining a signal model, where each signal model represents a source class. The average distortion between the measured spectrograms and the spectrograms of each signal model is calculated. The signal model with the lowest distortion is selected. (col. 2 lines 7-19). initiating spectrogram template matching, using the first signal model, in response to determining that the harmonic is present; and forgoing spectrogram template matching when the harmonic is not present; where determining whether the harmonic is present comprises: determining a frequency range to scan; scanning the time-frequency representation over the frequency range; identifying local peaks in the frequency range that exceed neighboring spectrum values by more than a peak threshold (claim 1). The method of claim 4, where the weight is proportional to a signal-to-noise ratio of the matching spectrogram template (claim 5).   

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner ROBERT STEVENS whose telephone number is (571) 272-4102. The examiner can normally be reached Mon - Fri 6:00 - 2:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amy Ng can be reached on (571) 270-1698. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ROBERT STEVENS/Primary Examiner, Art Unit 2164                                                                                                                                                                                                        

December 27, 2025
Read full office action
Prosecution Timeline

Feb 05, 2025
Application Filed
Dec 27, 2025
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/390,020
Patent 12585618
SYSTEMS AND METHODS FOR SEQUENCE-BASED DATA CHUNKING FOR DEDUPLICATION
2y 5m to grant Granted Mar 24, 2026
18/931,320
Patent 12579100
COMPUTER SYSTEMS THAT PUT PARENTS IN CONTROL OF THEIR KID'S ONLINE SAFETY: THE STATE OF A KID (E.G., EMOTIONAL STATE), INDUCED BY CONTENT FROM A SOCIAL MEDIA PLATFORM, TRIGGERS PARENT-PRESCRIBED ACTIONS BY THE KID'S COMPUTER SYSTEM COMPRISING AT LEAST ONE OF BLOCKING THE CONTENT AND INFORMING AT LEAST ONE OF THE PARENT, THE KID, AND THE SOCIAL MEDIA PLATFORM OF THE INDUCED STATE
2y 5m to grant Granted Mar 17, 2026
18/739,114
Patent 12572579
LARGE LANGUAGE MODEL BASED SYSTEM UPGRADE CLASSIFIER
2y 5m to grant Granted Mar 10, 2026
19/030,257
Patent 12572542
SYSTEMS AND METHODS FOR GENERATING AND DISPLAYING A DATA PIPELINE USING A NATURAL LANGUAGE QUERY, AND DESCRIBING A DATA PIPELINE USING NATURAL LANGUAGE
2y 5m to grant Granted Mar 10, 2026
16/685,724
Patent 12561519
SCALABLE FORM MATCHING
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
81%
Grant Probability
92%
With Interview (+11.1%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 517 resolved cases by this examiner. Grant probability derived from career allow rate.