Prosecution Insights
Last updated: April 19, 2026
Application No. 17/589,710

MEDIA CLASSIFICATION AND IDENTIFICATION USING MACHINE LEARNING

Final Rejection §103
Filed
Jan 31, 2022
Examiner
HICKS, SHIRLEY D.
Art Unit
2168
Tech Center
2100 — Computer Architecture & Software
Assignee
Audible Magic Corporation
OA Round
6 (Final)
64%
Grant Probability
Moderate
7-8
OA Rounds
3y 2m
To Grant
99%
With Interview

Examiner Intelligence

Grants 64% of resolved cases
64%
Career Allow Rate
69 granted / 107 resolved
+9.5% vs TC avg
Strong +56% interview lift
Without
With
+56.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
38 currently pending
Career history
145
Total Applications
across all art units

Statute-Specific Performance

§101
10.7%
-29.3% vs TC avg
§103
51.1%
+11.1% vs TC avg
§102
24.2%
-15.8% vs TC avg
§112
12.3%
-27.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 107 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Amendments The action is responsive to the Applicant’s Amendment filed on 7/02/2025. Claims 1-20, are pending in the application. Claims 1, 14, and 16 are currently amended. No claims are canceled. No new claims are added. Response to Arguments Applicant’s arguments with respect to the rejections of claims 1-20 have been fully considered. In view of the claim amendment filed, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made. Further, regarding the new limitations recited in claims 1, 14, and 16, it is submitted that they are properly addressed by the new ground of rejection. Furthermore, it is also submitted that all limitations in pending claims, including those not specifically argued, are properly addressed. The reason is set forth in the rejections. See claim analysis below for detail. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-13 are rejected under 35 U.S.C. 103 as being unpatentable over Trollope et al. (US 20150301718 A1, hereinafter Trollope) in view of Harrison et al. (US 20100263020 A1). Regarding Claim 1, Trollope discloses a method comprising: receiving a plurality of media content items by a first processing device (Fig. 1; [0028]: Digital entertainment system 106 can include any suitable device that is capable of receiving, converting, processing, rendering, and/or transmitting media content; [0016]: As another example, media content may be provided by any suitable source, such as a television provider, a video hosting and/or streaming service, a video recorder, and/or any other suitable content provider; Figs. 3-5; [0026]: For example, one or more suitable portions of processes 300, 400, and 500 can run on one or more of server(s) 102, digital entertainment system 106, and mobile device(s) 108 of system 100), wherein each media content item of the plurality of media content items comprises audio ([0016]: For example, media content can include any suitable type(s) of content, such as one or more of audio content; Fig. 3; [0045]: At 310, process 300 can obtain an audio sample of the media content item); transforming each media content item of the plurality of media content items from a time domain to a frequency domain by performing one of a discrete Cosine transform or a fast Fourier transform ([0062]: In some implementations, the audio fingerprint can be generated using any suitable audio fingerprinting algorithms, such as two-dimensional transforms (e.g., a discrete cosine transform)); processing, in the frequency domain, audio features of each media content item of the plurality of media content items ([0062]: In a more particular example, one or more features of the audio sample (e.g., peaks, amplitudes, power levels, frequencies, signal to noise ratios, and/or any other suitable feature) can be generated for one or more suitable portions of the audio sample. The features can then be processed to form one or more audio fingerprints)) using one or more trained machine learning model, wherein for each media content item the one or more trained machine learning model determines a first probability of a first media classification indicating music content in the media content item and a second probability of a second media classification indicating a lack of music content in the media content item ([0069]: In some implementations, the segments of the audio signal can be classified using any suitable audio classification technique or combination of techniques, such as a Hidden Markov Model, a Bayesian classifier, the Viterbi algorithm, the Baum-Welch algorithm, and/or any other suitable classification model; [0021]: In some implementations, the mechanisms can identify one or more portions of the media content item that correspond to the identified audio segments as being music segments of the media content item; [0087]: For example, as described in connection with FIG. 4, the music segment can be identified using any suitable audio segmentation and/or classification technique (e.g., steps 420-435 of FIG. 4)); determining, for each media content item of the plurality of media content items, whether the media content item has the first media classification or the second media classification based on the first probability and the second probability for that media content item ([0021]: The mechanisms can then classify each of the segments into a class, such as “silence,” “speech,” “music,” “song,” “speech with music background,” “noise,” and/or any other suitable class); filtering out at least portions of those media content items of the plurality of media content items that have the second media classification to result in a remainder of media content items ([0067]: Additionally, the audio signal can be… filtered, and/or processed using any suitable audio processing technique; Fig. 5; [0087]: As illustrated, process 500 can begin by identifying a music segment of a media content item at 505); performing further processing of the media content items by sending, for each media content item of the remainder of media content items, at least one of a) at least a portion of the media content item or b) a digital fingerprint of at least the portion of the media content item to a second processing device (Fig. 3; [0048]: In a more particular example, process 300 can transmit the audio sample and/or an audio fingerprint generated from the audio sample to the server; [0026]-[0027]: For example, one or more suitable portions of processes 300, 400, and 500 can run on one or more of server(s) 102… of system 100), wherein the second processing device is to perform identification of each of the remainder of media content items based on at least one of a) at least the portion of the media content item or b) the digital fingerprint of at least the portion of the media content item ([0048]: The server can then identify a media content item corresponding to the audio sample; [0027]: Server(s) 102 can include any suitable device that is capable of searching for music items relating to media content, performing video matching, audio matching, lyrics matching, and/or sentiment matching analysis on media content; Fig. 5; [0099]: At 535, process 500 can identify one or more music items that match the music segment). However, Trollope does not explicitly teach “responsive to performing the identification of each of the remainder of media content items, performing, with respect to at least the remainder of the media content items, an action based on a size of the remainder of media content items compared to a size of the plurality of media content items, wherein the action comprises a licensing rate for the remainder of media content items based on the size of the remainder of media content items compared to the size of the plurality of media content items.” On the other hand, in the same field of endeavor, Harrison teaches responsive to performing the identification of each of the remainder of media content items (Figs. 1a, 2; [0059] The policy engine 118 identifies 210 a policy specified for the reference content associated with the match metrics… The fingerprinting engine 116 determines whether there is a match 212 between an item of reference content and an item of hosted content based on one or more of the match metrics for a match exceeding a defined threshold value for the match metric), performing, with respect to at least the remainder of the media content items, an action based on a size of the remainder of media content items compared to a size of the plurality of media content items (Figs. 1a, 2; [0054]-[0063]: FIG. 2 is a flow chart illustrating steps performed by the VID server 100.… the fingerprinting engine 116 generates the proportion metrics by determining a ratio of the value indicated by the duration metric over a value indicating the length and/or size of the item of reference content or the item of hosted content), wherein the action comprises a licensing rate for the remainder of media content items based on the size of the remainder of media content items compared to the size of the plurality of media content items (Figs. 1a, 2; [Abstract]: A policy associated with the item of reference content is identified responsive to the value to that represents the correspondence, the policy including terms of use for the hosted content; [0059]: The policy engine 118 identifies 210 a policy specified for the reference content associated with the match metrics; [0081]- [0082] According to the terms specified in the policy, the funds can be allocated 629 between the media host 102, the content owner 101 and the VID server 100 in any appropriate way such as sharing by percentage split, by a flat payment, and so on, as specified by the content owner 101 in the policy agreement). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Trollope to incorporate the teachings of Harrison to process audio that contain the first media classification and perform a licensing action based on the size of the remainder of media content items. The motivation for doing so would be to identify and monetize hosted content, as recognized by Harrison ([0034] of Harrison: [0034] The VID server 100 is further configured to allow the media host 102 to identify hosted content that matches reference content, enter into policy agreements with content owners 101 and monetize hosted content). Regarding Claim 2, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches, further comprising: generating, for each media content item of the remainder of media content items, the digital fingerprint of at least the portion of the media content item (Fig. 5; [0088]: At 510, process 500 can generate an audio fingerprint of the music segment), wherein the digital fingerprint is sent to the second processing device (Fig. 3; [0048]: In a more particular example, process 300 can transmit the audio sample and/or an audio fingerprint generated from the audio sample to the server. The server can then identify a media content item corresponding to the audio sample). Regarding Claim 3, the combined teachings of Trollope and Harrison disclose the method of claim 2. Trollope further teaches wherein to perform identification of a media content item the second processing device is to: divide the digital fingerprint into a plurality of segments (Fig. 4; [0021]: For example, the mechanisms can divide the audio signal associated with the media content item into multiple segments (e.g., audio scenes) using any suitable audio segmentation technique; [0069]: For example, process 400 can divide the audio signal into multiple segments using any suitable audio segmentation technique or techniques); compare one or more segments of the plurality of segments of the digital fingerprint to known digital fingerprints of known media content items ([0072]: In a more particular example, an audio fingerprint representing one or more audio features of the audio segment can be compared against reference audio fingerprints that are stored and indexed by music item); and identify a match between the one or more segments of the digital fingerprint and a known digital fingerprint of a known media content item of a plurality of known media content items (Fig. 4; [0071]-[0072]: At 430, process 400 can identify the music content included in each of the audio segments that are identified at 425…The music content can then be identified by identifying a music item associated with a reference audio fingerprint that matches the audio fingerprint of the audio segment). Regarding Claim 4, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches further comprising performing the following for the remainder of media content items: determining, for each media content item in the remainder of media content items, whether the media content item belongs to a first sub-class of media content items or a second sub-class of the media content items based on a result of the processing ([0069]: In some implementations, the segments of the audio signal can be classified using any suitable audio classification technique or combination of techniques; Fig. 5; [0092]: For example, process 500 can analyze the melody and/or the lyrics of the music content contained in the music segment, the transcript associated with the music segment, metadata associated with the media content item (e.g., a title, description, user rating, user comment, genre, and/or any other suitable metadata)… Process 500 can then classify the music segment with one or more of a variety of sentiments)). Regarding Claim 5, the combined teachings of Trollope and Harrison disclose the method of claim 4. Trollope further teaches wherein the first sub-class of media content items is for a first music genre and the second sub-class of media content items is for a second music genre (Fig. 5; [0092]: For example, process 500 can analyze the melody and/or the lyrics of the music content contained in the music segment, the transcript associated with the music segment, metadata associated with the media content item (e.g., a title, description, user rating, user comment, genre, and/or any other suitable metadata)… Process 500 can then classify the music segment with one or more of a variety of sentiments). Regarding Claim 6, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches wherein performing the further processing further comprises: comparing the digital fingerprint for the media content item to a plurality of additional digital fingerprints of a plurality of known works ([0072]: In a more particular example, an audio fingerprint representing one or more audio features of the audio segment can be compared against reference audio fingerprints that are stored and indexed by music item); identifying a match between the digital fingerprint and an additional digital fingerprint of the plurality of additional digital fingerprints, wherein the additional digital fingerprint is for a segment of a known work of the plurality of known works (Fig. 4; [0071]: At 430, process 400 can identify the music content included in each of the audio segments that are identified at 425); and determining that the media content item comprises an instance of the known work ([0072]: The music content can then be identified by identifying a music item associated with a reference audio fingerprint that matches the audio fingerprint of the audio segment). Regarding Claim 7, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches wherein: the first processing device is associated with a first entity that hosts user generated content (Fig. 1; [0016]: As another example, media content may be provided by any suitable source, such as a… a video hosting and/or streaming service; [0028]: Digital entertainment system 106 can include any suitable device that is capable of receiving, converting, processing, rendering, and/or transmitting media content); and the second processing device is associated with a second entity comprising a database of a plurality of known works against which the remainder of media content items is compared (Fig. 1; [0026]-[0027]: For example, one or more suitable portions of processes 300, 400, and 500 can run on one or more of server(s) 102… of system 100; Fig. 4; [0064]: process 400 can access a database that indexes and stores reference audio fingerprints by media content item; Fig. 5; [0093]: In some implementations, process 500 can access to and/or retrieve information relating to the collection of the music items (e.g., audio fingerprints, video fingerprints, lyrics, sentiment indicators, and/or any other suitable information relating to music items) from a database that stores and indexes such information by music item). Regarding Claim 8, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches wherein the one or more trained machine learning model comprise a first Gaussian mixture model that determines the first probability of the first media classification and a second Gaussian mixture model that determines the second probability of the second media classification ([0069]: In some implementations, the segments of the audio signal can be classified using any suitable audio classification technique or combination of techniques, such as a Hidden Markov Model, a Bayesian classifier, the Viterbi algorithm, the Baum-Welch algorithm, and/or any other suitable classification model). Regarding Claim 9, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches wherein the one or more trained machine learning model comprises one or more Gaussian mixture models trained to process feature vectors comprising up to fifty-two audio features for every tenth to half of a second of a media content item ([0069]: In some implementations, the segments of the audio signal can be classified using any suitable audio classification technique or combination of techniques, such as a Hidden Markov Model, a Bayesian classifier, the Viterbi algorithm, the Baum-Welch algorithm, and/or any other suitable classification model). Regarding Claim 10, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches further comprising: for each media content item of the plurality of media content items, processing the media content item to determine the audio features of the media content item ([0062]: In a more particular example, one or more features of the audio sample (e.g., peaks, amplitudes, power levels, frequencies, signal to noise ratios, and/or any other suitable feature) can be generated for one or more suitable portions of the audio sample. The features can then be processed to form one or more audio fingerprints), wherein processing the media content item to determine the audio features comprises performing one of the discrete Cosine transform or the fast Fourier transform ([0062]: In some implementations, the audio fingerprint can be generated using any suitable audio fingerprinting algorithms, such as two-dimensional transforms (e.g., a discrete cosine transform)). Regarding Claim 11, the combined teachings of Trollope and Hoffert disclose the method of claim 1. Trollope further teaches further comprising: for each media content item of the plurality of media content items, generating a feature vector comprising the audio features of the media content item, wherein the audio features comprise at least one of loudness, pitch, brightness, spectral bandwidth, energy in one or more spectral bands, spectral steadiness, or Mel-frequency cepstral coefficients (MFCCs) ([0062]: In a more particular example, one or more features of the audio sample (e.g., peaks, amplitudes, power levels, frequencies, signal to noise ratios, and/or any other suitable feature) can be generated for one or more suitable portions of the audio sample). Regarding Claim 12, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches wherein one or more of the plurality of media content items comprise video ([0016]: For example, media content can include any suitable type(s) of content, such as one or more of audio content, video content). Regarding Claim 13, the combined teachings of Trollope and Harrison disclose the method of claim 1. Trollope further teaches wherein the plurality of media content items comprises millions of media content items ([0016]: The mechanisms can be implemented with respect to any suitable media content; Fig. 3; [0044]: As illustrated, process 300 can start by presenting a media content item at 305. In some implementations, the media content item can include any suitable media content and can be provided by any suitable source). Claims 14-20 are rejected under 35 U.S.C. 103 as being unpatentable over Trollope et al. (US 20150301718 A1, hereinafter Trollope) in view of Hoffert (US Patent No. 5983176 A), and Harrison (US 20100263020 A1). Regarding Claim 14, Trollope discloses a method comprising: receiving a plurality of media content items (Fig. 1; [0028]: Digital entertainment system 106 can include any suitable device that is capable of receiving, converting, processing, rendering, and/or transmitting media content; [0016]: As another example, media content may be provided by any suitable source, such as a television provider, a video hosting and/or streaming service, a video recorder, and/or any other suitable content provider; Figs. 3-5; [0026]: For example, one or more suitable portions of processes 300, 400, and 500 can run on one or more of server(s) 102, digital entertainment system 106, and mobile device(s) 108 of system 100), wherein at least some of the plurality of media content items comprise audio ([0016]: For example, media content can include any suitable type(s) of content, such as one or more of audio content; Fig. 3; [0045]: At 310, process 300 can obtain an audio sample of the media content item); for each media content item of the plurality of media content items ([0026]: Each of the media content items may be individually analyzed to determine a percentage of the media content item that has a particular classification), performing the following by a processing device ([0028]: Digital entertainment system 106 can include any suitable device that is capable of receiving, converting, processing, rendering, and/or transmitting media content): processing the media content item to determine a set of audio features of the media content item ([0062]: In a more particular example, one or more features of the audio sample (e.g., peaks, amplitudes, power levels, frequencies, signal to noise ratios, and/or any other suitable feature) can be generated for one or more suitable portions of the audio sample. The features can then be processed to form one or more audio fingerprints), wherein processing the media content item comprises performing one of a discrete Cosine transform or a fast Fourier transform to transform the media content item from a time domain to a frequency domain ([0067]: Frequency domain analysis may be performed by using a discrete Cosine transform (DCT) or fast Fourier transform (FFT) to transform each media content item into the frequency domain); generating a feature vector comprising the set of audio features in the frequency domain (Fig. 4; [0062]: At 410, process 400 can generate an audio fingerprint of the audio sample. The audio fingerprint can include any suitable digital representation of one or more suitable audio features of the audio sample); processing the feature vector comprising the set of audio features in the frequency domain using one or more trained machine learning model, wherein the one or more trained machine learning model outputs one or more media classifications for the media content item ([0069]: In some implementations, the segments of the audio signal can be classified using any suitable audio classification technique or combination of techniques, such as a Hidden Markov Model, a Bayesian classifier, the Viterbi algorithm, the Baum-Welch algorithm, and/or any other suitable classification model), wherein the one or more media classifications comprise a first class for media content items comprising music and a second class for media content items not comprising music ([0021]: The mechanisms can then classify each of the segments into a class, such as “silence,” “speech,” “music,” “song,” “speech with music background,” “noise,” and/or any other suitable class. In some implementations, the mechanisms can identify a segment of the audio signal as a segment including music content; Fig. 5; [0087]: As illustrated, process 500 can begin by identifying a music segment of a media content item at 505… For example, as described in connection with FIG. 4, the music segment can be identified using any suitable audio segmentation and/or classification technique (e.g., steps 420-435 of FIG. 4)); and automatically determining, without user input, whether the media content item belongs to the first class of media content items comprising music or the second class of the media content items not comprising music based on the output of the trained machine learning model ([0021]: In some implementations, the mechanisms can identify one or more portions of the media content item that correspond to the identified audio segments as being music segments of the media content item; [0052]: In some implementations, as described below in connection with FIG. 5, a music item that matches a segment of the media content item can be detected using process 500; [0069]: In some implementations, the segments of the audio signal can be classified using any suitable audio classification technique or combination of techniques); generating a first group of media content items that belong to the first class of media content items comprising music; generating a second group of media content items that belong to the second class of media content items not comprising music ([0021]: The mechanisms can then classify each of the segments into a class, such as “silence,” “speech,” “music,” “song,” “speech with music background,” “noise,” and/or any other suitable class). However, Trollope does not explicitly teach “determining a first size of the first group and a second size of the second group, wherein the first size of the first group reflects a first number of media content items that belong to the first class of media content items comprising music, and wherein the second size of the second group reflects a second number of media content items that belong to the second class of media content items not comprising music; determining a ratio of the first size to the second size; responsive to determining the ratio of the first size to the second size, identifying an action based on the ratio of the first size to the second size, wherein the action comprises at least one of a licensing action, an advertisement action, or a removal action; and performing the action with respect to the plurality of media content items.” On the other hand, in the same field of endeavor, Hoffert teaches determining a first size of the first group and a second size of the second group, wherein the first size of the first group reflects a first number of media content items that belong to the first class of media content items comprising music, and wherein the second size of the second group reflects a second number of media content items that belong to the second class of media content items not comprising music (Figs. 3A, 3E-3H; The embodiment described herein may be broken down into… examining the media files for content (101-105); [Col. 7, lines 47-62]: Finally, data is stored for each media object… Content attributes (… speech v. music… size may be stored)); determining a ratio of the first size to the second size (Figs. 3A, 3E-3H; [Col. 13, lines 5-10]: A digital audio file is initially analyzed 301 and an initial determination is made whether the file is speech 307 or music 302… [Col. 13, lines 34-40]: In order to determine if a given audio file contains music… The scalar value, called the music-speech metric, is an estimate of the type of content found in the audio file; [Col. 14, lines 63-67]: Finally… block 357, to determine a music-speech metric… low values tend to indicate music… block 358; [Determining the music-speech metric corresponds to determining a ratio of the first size to the second size, which is being interpreted as a technique of using pattern recognition to generate likelihood values or scores, which optimally determine the likelihood that a segment of the media content item belongs to a particular classification]); Additionally, Harrison teaches responsive to determining the ratio of the first size to the second size (Figs. 1a, 2; [0058]-[0059]: the fingerprinting engine 116 generates the proportion metrics by determining a ratio… The policy engine 118 identifies 210 a policy specified for the reference content associated with the match metrics… The fingerprinting engine 116 determines whether there is a match 212 between an item of reference content and an item of hosted content based on one or more of the match metrics for a match exceeding a defined threshold value for the match metric), determining an action based on the ratio of the first size to the second size (Figs. 1a, 2; [0054]-[0063]: The policy engine 118 identifies 210 a policy specified for the reference content associated with the match metrics… The fingerprinting engine 116 determines whether there is a match 212 between an item of reference content and an item of hosted content based on one or more of the match metrics for a match exceeding a defined threshold value for the match metric… the fingerprinting engine 116 generates the proportion metrics by determining a ratio of the value indicated by the duration metric over a value indicating the length and/or size of the item of reference content or the item of hosted content), wherein the action comprises a licensing rate for the plurality of media content items based on the ratio of the first size of the first group of media content items that belong to the first class of media content items comprising music to the second size of the second group of the media content items that belong to the second class of the media content items not comprising music (Figs. 1a, 2; [Abstract]: A policy associated with the item of reference content is identified responsive to the value to that represents the correspondence, the policy including terms of use for the hosted content; [0059]: The policy engine 118 identifies 210 a policy specified for the reference content associated with the match metrics; [0081]- [0082] According to the terms specified in the policy, the funds can be allocated 629 between the media host 102, the content owner 101 and the VID server 100 in any appropriate way such as sharing by percentage split, by a flat payment, and so on, as specified by the content owner 101 in the policy agreement). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Trollope to incorporate the teachings of Hoffert and Harrison to determine a first and second group size, determine a ratio of the group sizes, and determine a licensing rate for the media content items. The motivation for doing so would be to determine a value for the media content item, as recognized by Hoffert ([Col. 13, lines 34-40] of Hoffert: it is disclosed in one embodiment to derive a single valued scalar which represents the audio data file to a reasonable degree of accuracy) and to identify and monetize hosted content, as recognized by Harrison ([0034] of Harrison: [0034] The VID server 100 is further configured to allow the media host 102 to identify hosted content that matches reference content, enter into policy agreements with content owners 101 and monetize hosted content). Regarding Claim 15, the combined teachings of Trollope, Hoffert, and Harrison disclose the method of claim 14. Hoffert further teaches wherein performing the action identified based on the ratio of the first size of the first group to the second size of the second group comprises determining a value for the media content item based on the ratio of the first size of the first group of media content items that belong to the first class of media content items comprising music to the second size of the second group of media content items that belong to the second class of media content items not comprising music ([Col. 13, lines 34-40]: The scalar value, called the music-speech metric, is an estimate of the type of content found in the audio file). Regarding Claim 16, the combined teachings of Trollope, Hoffert, and Harrison disclose the method of claim 14. Harrison further teaches, further comprising: determining whether the ratio of the first size to the second size exceeds a threshold (Figs. 1a, 2; [0058]-[0059]: the fingerprinting engine 116 generates the proportion metrics by determining a ratio of the value indicated by the duration metric over a value indicating the length and/or size of the item of reference content or the item of hosted content… The fingerprinting engine 116 determines whether there is a match 212 between an item of reference content and an item of hosted content based on one or more of the match metrics for a match exceeding a defined threshold value for the match metric); performing a first action responsive to determining that the ratio exceeds the threshold (Figs. 1a, 2; [0060]-[0061]: If the fingerprinting engine 116 determines that the item of reference content and the item of hosted content match, the policy engine 118 determines 214 whether the media host 102 hosting the item of hosted content is prohibited from providing items of hosted content matching the item of reference content based on the policy specified for the item of hosted content matching the item of reference content), wherein the first action comprises a first licensing action, a first tagging action, a first flagging action ([0081]: The VID server 100 receives 620 activity information), a first advertisement action ([0081]: The VID server 100 receives 622 an advertising bid from the advertiser 103), or a first removal action (Figs. 1a, 2; [0060]-[0061]: If the media host 102 is prohibited from hosting content matching the reference content, the VID server 100 transmits 222 instructions to media host 102 to remove or destroy the hosted content matching the reference content); and performing a second action responsive to determining that the ratio fails to exceed the threshold, wherein the second action comprises a second licensing action, a second tagging action, a second flagging action, a second advertisement action, or a second removal action (Figs. 1a, 2; [0058]-[0060]: The fingerprinting engine 116 generates the offset metric by determining the temporal portion of time and/or space which corresponds to the portions of the reference content fingerprints and hosted content fingerprints that do not match [flagging action]… If the fingerprinting engine 116 determines that the item of reference content and the item of hosted content do not match, the fingerprinting engine 116 continues to generate 208 match metrics for other items of reference content and items of hosted content [No action may be taken if the use of music in the media content item is determined to be insignificant to the overall media content item]; See also Fig. 6 and paras [0081]-[0082]). Regarding Claim 17, the combined teachings of Trollope, Hoffert, and Harrison disclose the method of claim 14. Trollope further teaches, further comprising: for each media content item of the plurality of media content items, dividing the media content item into a plurality of segments ([0021]: For example, the mechanisms can divide the audio signal associated with the media content item into multiple segments (e.g., audio scenes) using any suitable audio segmentation technique); for each segment of the plurality of segments, performing the following: determining an additional set of features of the segment ([0052]: Additionally or alternatively, the music item and the segment of the media content item can be associated with a matching sentiment (e.g., “happy,” “sad,” “exciting,” “neutral,” and/or any other suitable sentiment). In some implementations, as described below in connection with FIG. 5, a music item that matches a segment of the media content item can be detected using process 500); processing the additional set of features using the one or more trained machine learning model ([0069]: In some implementations, the segments of the audio signal can be classified using any suitable audio classification technique or combination of techniques, such as a Hidden Markov Model, a Bayesian classifier, the Viterbi algorithm, the Baum-Welch algorithm, and/or any other suitable classification model); and determining whether the segment belongs to the first class of media content items or the second class of the media content items ([0021]: In some implementations, the mechanisms can identify a segment of the audio signal as a segment including music content when the segment of the audio signal is classified as “music,” “song,” “speech with music background,” and/or any other suitable class corresponding to an audio segment including music content); generating a third group of segments that belong to the first class of media content items; generating a fourth group of segments that belong to the second class of media content items ([0092]: For example, process 500 can analyze the melody and/or the lyrics of the music content contained in the music segment, the transcript associated with the music segment, metadata associated with the media content item (e.g., a title, description, user rating, user comment, genre, and/or any other suitable metadata) and/or any other suitable information relating to the music segment using natural language processing, text analytics, machine learning, and/or any other suitable technique. Process 500 can then classify the music segment with one or more of a variety of sentiments); determining a third size of the third group and a fourth size of the fourth group (Fig. 4; [0075]: In a more particular example, for a particular audio segment identified at 425, process 400 can retrieve a start timestamp corresponding to the start of the audio segment and an end timestamp corresponding to the end of the audio segment); and determining a first fraction of the media content item belonging to the third group and a second fraction of the media content item belonging to the fourth group based on the third size and fourth size; and including the first fraction in the first size of the first group and the second fraction in the second size of the second group (Fig. 4; [0075]: Process 400 can then identify a portion of the media content item defined by the start timestamp and the end timestamp (e.g., a video segment defined by a first frame associated with a presentation timestamp corresponding to the start timestamp and a second video frame associated with a presentation timestamp corresponding to the end timestamp)). [This nonfunctional descriptive material describes data that is not functionally involved in the steps recited. None of the claimed steps are depending on any of the information being described. All steps in the claims would be performed the same to achieve a same outcome regardless of the data being described]. Regarding Claim 18, the combined teachings of Trollope, Hoffert, and Harrison disclose the method of claim 14. Trollope further teaches wherein the one or more trained machine learning model comprises one or more Gaussian mixture models trained to process feature vectors comprising up to fifty-two audio features for every tenth to half of a second of a media content item ([0069]: In some implementations, the segments of the audio signal can be classified using any suitable audio classification technique or combination of techniques, such as a Hidden Markov Model, a Bayesian classifier, the Viterbi algorithm, the Baum-Welch algorithm, and/or any other suitable classification model). Regarding Claim 19, the combined teachings of Trollope, Hoffert, and Harrison disclose the method of claim 14. Trollope further teaches wherein the plurality of media content items comprise millions of media content items ([0016]: The mechanisms can be implemented with respect to any suitable media content; Fig. 3; [0044]: As illustrated, process 300 can start by presenting a media content item at 305. In some implementations, the media content item can include any suitable media content and can be provided by any suitable source). Regarding Claim 20, the combined teachings of Trollope, Hoffert, and Harrison disclose the method of claim 14. Trollope further teaches wherein the feature vector comprises up to fifty-two audio features for every tenth to half of a second of a media content item ([0090]: At 520, process 500 can generate a video fingerprint of the music segment... the video fingerprint can be generated by calculating one or more spatial characteristics (e.g., one or more vectors corresponding to intensity variations, edge differences, and/or any other suitable intra-frame features)), the up to fifty-two audio features comprising loudness, pitch, energy in one or more spectral bands, and Mel-frequency cepstral coefficients (MFCCs), the trained machine learning model having been trained using training data comprising additional feature vectors comprising the up to fifty-two audio features for every tenth to half of a second ([0062]: In a more particular example, one or more features of the audio sample (e.g., peaks, amplitudes, power levels, frequencies, signal to noise ratios, and/or any other suitable feature) can be generated for one or more suitable portions of the audio sample; [0044]: In some implementations, the media content item can include any suitable media content and can be provided by any suitable source). Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHIRLEY D. HICKS whose telephone number is (571)272-3304. The examiner can normally be reached Mon - Fri 7:30 - 4:00. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Charles Rones can be reached on (571) 272-4085. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /S D H/Examiner, Art Unit 2168 /CHARLES RONES/Supervisory Patent Examiner, Art Unit 2168
Read full office action

Prosecution Timeline

Jan 31, 2022
Application Filed
May 27, 2022
Response after Non-Final Action
Apr 03, 2023
Non-Final Rejection — §103
Jul 10, 2023
Applicant Interview (Telephonic)
Jul 10, 2023
Examiner Interview Summary
Jul 11, 2023
Response Filed
Oct 13, 2023
Final Rejection — §103
Nov 15, 2023
Interview Requested
Dec 14, 2023
Applicant Interview (Telephonic)
Dec 19, 2023
Response after Non-Final Action
Dec 21, 2023
Examiner Interview Summary
Jan 25, 2024
Non-Final Rejection — §103
Apr 24, 2024
Examiner Interview Summary
Apr 24, 2024
Applicant Interview (Telephonic)
Apr 26, 2024
Response Filed
Aug 14, 2024
Final Rejection — §103
Oct 07, 2024
Interview Requested
Oct 16, 2024
Examiner Interview Summary
Oct 16, 2024
Applicant Interview (Telephonic)
Oct 21, 2024
Response after Non-Final Action
Oct 29, 2024
Response after Non-Final Action
Nov 19, 2024
Request for Continued Examination
Nov 21, 2024
Response after Non-Final Action
Feb 25, 2025
Non-Final Rejection — §103
May 19, 2025
Interview Requested
May 28, 2025
Examiner Interview Summary
May 28, 2025
Applicant Interview (Telephonic)
Jun 04, 2025
Applicant Interview (Telephonic)
Jun 04, 2025
Examiner Interview Summary
Jul 02, 2025
Response Filed
Oct 01, 2025
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12596682
SYSTEM AND METHOD FOR OBJECT STORE FEDERATION
2y 5m to grant Granted Apr 07, 2026
Patent 12499102
HIERARCHICAL DELIMITER IDENTIFICATION FOR PARSING OF RAW DATA
2y 5m to grant Granted Dec 16, 2025
Patent 12499146
MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING (NLP)-BASED SYSTEM FOR SYSTEM-ON-CHIP (SoC) TROUBLESHOOTING
2y 5m to grant Granted Dec 16, 2025
Patent 12405818
BATCHING WAVEFORM DATA
2y 5m to grant Granted Sep 02, 2025
Patent 12380126
DISCOVERY OF SOURCE RANGE PARTITIONING INFORMATION IN DATA EXTRACT JOB
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

7-8
Expected OA Rounds
64%
Grant Probability
99%
With Interview (+56.3%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 107 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month