Last updated: April 19, 2026
Application No. 18/341,550
SYSTEMS AND METHODS FOR LYRICS ALIGNMENT

Final Rejection §103§112
Filed
Jun 26, 2023
Examiner
YANG, NIEN
Art Unit
2484
Tech Center
2400 — Computer Networks
Assignee
Spotify AB
OA Round
4 (Final)
Interview Optional

— +28.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 399 resolved cases, 2023–2026
Examiner Intelligence

YANG, NIEN View full profile →
Grants 72% — above average
Career Allow Rate
287 granted / 399 resolved
+13.9% vs TC avg
Strong +29% interview lift
Without
With
+28.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
30 currently pending
Career history
429
Total Applications
across all art units
Statute-Specific Performance

§101
5.6%
-34.4% vs TC avg
§103
73.6%
+33.6% vs TC avg
§102
6.5%
-33.5% vs TC avg
§112
7.8%
-32.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 399 resolved cases
Office Action

§103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Preliminary Remarks
This is a reply to the amendments filed on 01/15/2026, in which, claims 1-2, 15-16, and 20 are amended. Claims 1, 15, and 20 are amended. Claims 1-20 remain pending in the present application with claims 1, 15, and 20 being independent claims.
When making claim amendments, the applicant is encouraged to consider the references in their entireties, including those portions that have not been cited by the examiner and their equivalents as they may most broadly and appropriately apply to any particular anticipated claim amendments.

Response to Arguments
Applicant's arguments filed on 01/15/2026 with respect to amended claims 1, 15, and 20 have been considered but are moot in view of the new ground(s) of rejection.
Applicant's arguments filed on 01/15/2026 have been fully considered but they are not persuasive.
On pages 9-10, Applicant argues that, “The cited portions of Wold do not teach or suggest a contrastive learning approach, as claimed. In fact, the cited portions of Wold do not appear to teach or suggest any type of training. Rather, Wold, paragraph 84 states "The speech recognition engine compares the speech recognition data with acoustic models, language models, and other data models and information for recognizing the lyrical content in the unidentified media content item." These portions of Wold describe "pre-processing logic", not training. Thus, claim 13 is further patentable over the cited references.”
In response, Examiner respectfully points out that the rejections are based on combinations of references. The Examiner has cited a new reference Rangarajan in response of Applicant's arguments filed on 01/15/2026 with respect to amended claims 1. Claim 13 is a dependent claim of claim 1. At least Rangarajan teaches an example of multitask training for a sentence embedding language model (see Rangarajan, Fig. 9 and paragraphs [0256]- [0259]). Therefore, Applicant's arguments are not persuasive. The Examiner suggests when responds to a 35 U.S.C 103 rejection, please consider the rejection of all the references as a whole.
 
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a): 
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement.  The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for pre-AIA  the inventor(s), at the time the application was filed, had possession of the claimed invention. Claims 1, 2, 15, 16, and 20, recite the phrase “temporally aligning the lyrics text and the audio for the media item…”. The limitation “temporally” was not described in the specification.
Claims 2-14 depend on claim 1; and claims 16-19 depend on claim 15, thus 35 U.S.C. 112(a) rejection is also invoked.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 6-17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wold (US 20210357451 A1, hereinafter referred to as “Wold”) in view of Rangarajan et al. (US 20220093088 A1, hereinafter referred to as “Rangarajan”), and further in view of Verbeeck et al. (US 20090228799 A1, hereinafter referred to as “Verbeeck”).
Regarding claim 1, Wold discloses a method, comprising: 
generating a first plurality of embeddings representing symbols that appear in the lyrics text for the media item (see Wold, paragraph [0028]: “The media content identification service determines lyrical content associated with the unidentified media content item. This determination may be made by processing the unidentified media content item (or at least an audio portion of the unidentified media content item) using a machine learning model (or set of machine learning models) that has been trained to transcribe audio into sequences of words and/or phonemes”); 
generating a second plurality of embeddings representing an acoustic representation of the audio for the media item (see Wold, paragraph [0024]: “a determination that an unidentified media content item is a cover of a song may be made with increased accuracy by using both lyrical content and musical/audio content because a match is made for harmony and melody as well as lyrics”); 
temporally aligning the lyrics text and the audio for the media item based on the respective similarities (see Wold, paragraph [0091]: “Lyrical content matching logic 250 may apply any of a variety of approximate text-matching methods to detect sufficient matching between the lyrical content of the unidentified media content item and lyrical content of a known media content item”); and 
while streaming the audio for the media item (see Wold, paragraph [0113]: “the media content item is a live stream, and the live stream is periodically analyzed”). 
Regarding claim 1, Wold discloses all the claimed limitations with the exception of obtaining, from a lyrics database, lyrics text for a media item; obtaining, from a content database, audio for the media item; using the lyrics text as an input to a first encoder; using the audio as an input to a second encoder; wherein the first encoder and the second encoder are trained to produce embeddings in a same vector space having a same dimensionality; determining, by comparing the first plurality of embeddings and the second plurality of embeddings, respective similarities between embeddings of the first plurality of embeddings and embeddings of the second plurality of embeddings; and providing, for display, the aligned lyrics text with the streamed audio.
Rangarajan from the same or similar fields of endeavor discloses obtaining, from a lyrics database, lyrics text for a media item (see Rangarajan, paragraph [0252]: “a user may provide a sentence, e.g., “You're beautiful.” A content database may be searched to return content that includes a semantic context that is similar to the input sentence, via sentence embeddings of the inputted sentence and sentences associated with the content. Such content may include poetry, works of literature, music lyrics, captions (or other descriptive text) for images, and the like”); 
obtaining, from a content database, audio for the media item (see Rangarajan, paragraph [0252]: “One or more images, poems, novels, songs, or the like may be returned based on matching the sentence embeddings. In the above example, where the user input the sentence “You're beautiful,” Sonnet 18 by William Shakespeare (e.g., a poem), an image of the user's spouse, or a love song may be returned to the user”);
using the lyrics text as an input to a first encoder (see Rangarajan, paragraph [0253]: “sentences included in the lyrics for one or more songs may be embedded in the vector space); and
using the audio as an input to a second encoder (see Rangarajan, paragraph [0243]: “Associations between the sentence embedding may be determined to pair images with songs with lyrics that are semantically similar to text describing the image (e.g., a caption for the image)”);
wherein the first encoder and the second encoder are trained to produce embeddings in a same vector space having a same dimensionality (see Rangarajan, paragraph [0256]: “the sentence embedding model (e.g., the sentence embedding model) may be trained over multiple semantic tasks, such that the sentence vectors (which are dependent on the contextual token vectors) encode features (e.g., latent and/or hidden features) of the sentence relating to the semantic context (e.g., the meaning, idea, concept, or the like) of the sentence. Various deep learning supervised methodologies may be employed when training the second language model (e.g., multitask supervised learning). When training the sentence embedding model, pairs of training sentences may be employed. The pairs of training sentences may be labeled with one or more semantic relationships. The sentence embedding model may be trained to generate sentence embeddings that are consistent with the semantic relationships of the pairs of training sentences. As also discussed, each of the tasks of the multitask training may be directed towards one or more types of semantic relationships (e.g., semantic similarity, semantic inference, next sentence prediction, and the like)”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Rangarajan with the teachings as in Wold. The motivation for doing so would ensure the system to have the ability to use the methods and systems for embedding natural language sentences within a highly-dimensional vector space disclosed in Rangarajan to search a content database to return content such as music lyrics or other descriptive text that includes a semantic context that is similar to the input sentence via sentence embeddings of the inputted sentence and sentences associated with the content; to return content such as songs based on matching the sentence embeddings; to embedding sentences included in the lyrics for one or more songs in the vector space; to embedding pair images with songs with lyrics that are semantically similar to text describing the image; to train the sentence embedding model over multiple semantic tasks by deep learning supervised methodologies wherein each of the tasks of the multitask training may be directed towards one or more types of semantic relationships such as semantic similarity, semantic inference, next sentence prediction, and the like thus obtaining lyrics text for a media item from a lyrics database; obtaining audio for the media item from a content database; using the lyrics text as an input to a first encoder and using the audio as an input to a second encoder wherein the first encoder and the second encoder are trained to produce embeddings in a shared vector space in order to enable a user to input lyrics text, which are then used to automatically generate an input vector so that recommended audio/song tracks can be determined by comparing the input vector such as lyrics text against the feature vectors of a plurality of song tracks.
Regarding claim 1, Wold and Rangarajan as discussed above disclose all the claimed limitations with the exception of determining, by comparing the first plurality of embeddings and the second plurality of embeddings, respective similarities between embeddings of the first plurality of embeddings and embeddings of the second plurality of embeddings; and providing, for display, the aligned lyrics text with the streamed audio.
Verbeeck from the same or similar fields of endeavor discloses determining, by comparing the first plurality of embeddings and the second plurality of embeddings, respective similarities between embeddings of the first plurality of embeddings and embeddings of the second plurality of embeddings (see Verbeeck, paragraph [0124]: “lyric assignment processing (comparing the text-based lyrics with the actual song lyrics with speech recognition techniques). This results in the extraction of time-based lyric meta data ... acoustic clustering extraction (defining similarities in acoustic sounds and clustering them to definite units). This results in the extraction of time-based acoustic cluster meta data”); and 
providing, for display, the aligned lyrics text with the streamed audio (see Verbeeck, paragraphs [0082]-[0086]: “visualizing audio data. There may be two parts, i.e. A) a meta data alignment part, and B) a visualization part. In the meta data alignment part, different meta data including text units are aligned with an acoustical signal of a piece of music, e.g. a song”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Verbeeck with the teachings as in Wold and Rangarajan. The motivation for doing so would ensure the system to have the ability to use the method for visualizing audio data disclosed in Verbeeck to compare text-based lyrics with the actual song lyrics with speech recognition to generate time-based lyric meta data; to define similarities in acoustic sounds and cluster them to definite units acoustic clustering extraction to generate time-based acoustic cluster meta data; and to visualize audio data by aligning different meta data including text units with an acoustical signal of a piece of music thus determining respective similarities between symbols that appear in the lyrics text and acoustic representation of the audio for the media item and displaying the aligned lyrics text with the streamed audio in order to align the lyrics text and the audio for the media item so that aligned lyrics text can be displayed with the streamed audio.
Regarding claim 2, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, wherein temporally aligning the lyrics text and the audio for the media item based on the respective similarities includes determining a monotonic path of correspondence between the first plurality of embeddings and the second plurality of embeddings (see Wold, paragraph [0085]: “Sounds received may be represented as paths between states of the HMM and multiple paths may represent multiple possible text matches for the same sound. Ultimately, the speech recognition engine outputs text in the form of a sequence of words, text in the form of a sequence of phonemes, or text in the form of a combination of words and phonemes”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 3, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, wherein generating, using the audio as an input to the second encoder, the second plurality of embeddings representing an acoustic representation of the audio for the media item includes inputting a spectrogram of the audio to the second encoder (see Verbeeck, FIG. 3 and paragraph [0097]: “Using indexing and extracting methods, linguistic and acoustic time-based meta data may be generated for each individual song. These meta data may describe the content divided into instrument clusters, lyrics and modules (intro, chorus, …) for every definite time stamp within the song”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 6, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, wherein generating, using the lyrics text as input to the first encoder, the first plurality of embeddings representing symbols that appear in lyrics text of the media item includes: 
obtaining the lyrics text as a series of symbols (see Wold, paragraph [0046]: “the lyrical content 143 includes timing information that indicates, for words and/or phonemes in the lyrical content 143, when (e.g., at what time offset) in the associated media content item the those words/phonemes are played”); 
passing each symbol to an embedding layer (see Wold, paragraph [0085]: “The speech recognition engine attempts to match received feature vectors/embeddings to language phonemes and/or words”); and 
using at least one prior symbol and/or at least one following symbol as context information to pass to the first encoder (see Wold, paragraph [0085]: “the speech recognition engine outputs text in the form of a sequence of words, text in the form of a sequence of phonemes, or text in the form of a combination of words and phonemes”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 7, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 6, wherein the first encoder produces a matrix of embeddings that has dimensions L×F, wherein L is a number of symbols in the series of symbols and F is a dimensionality of the embeddings in the first plurality of embeddings and a dimensionality of embeddings in the second plurality of embeddings (see Wold, paragraph [0164]: “determines timing information of at least one of words or phonemes in the lyrical content associated with the unidentified media content item and generates a first cross-similarity matrix between words or phonemes at timing offsets from the unidentified media content and additional words or additional phonemes at additional timing offsets from the known media content item.” and [0184]: “analyzes audio of the unidentified media content item using machine learning (e.g., ASR) to determine lyrical content of the unidentified media content item. The lyrical content that is output may be a textual representation and/or a phonetic representation of the lyrics/words transcribed from the audio. In one embodiment, a call is made to a third party audio transcription service to perform the operations of block 352. The audio transcription service may then provide a response that includes the lyrical content as text and/or phonemes”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 8, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, including providing a language of the media item to the first encoder and/or to the second encoder (see Wold, paragraph [0085]: “The speech recognition engine attempts to match received feature vectors/embeddings to language phonemes and/or words. The speech recognition engine computes recognition scores for the feature vectors based on acoustic information and language information”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 9, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, including providing one or more additional characteristics of the media item to the first encoder and/or to the second encoder (see Wold, paragraph [0033]: “Combining the lyrical similarity along with additional similarity values (e.g., for timbre, rhythm, pitch, metadata, etc.) can result in combined similarity metrics that accurately identify covers of known media content items (e.g., of known musical works)”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 10, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 6, wherein the series of symbols corresponds to a series of phonemes, characters, syllables, or other text representations (see Wold, paragraph [0091]: “Both textual words and phonemes, being strings of discrete symbols, may be hashed and indexed, allowing fast and efficient searches over large lyrics reference databases (e.g., lyrical content 143)”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 11, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, further comprising, normalizing the first plurality of embeddings generated using the first encoder and normalizing the second plurality of embeddings generated using the second encoder, wherein determining the respective similarities between embeddings of the first plurality of embeddings and embeddings of the second plurality of embeddings includes calculating a cosine similarity between the normalized first plurality of embeddings and the normalized second plurality of embeddings (see Wold, paragraphs [0148]-[0150]: “comparing the set of normalized feature vectors (digital fingerprints) of the unidentified media content item to an additional set of normalized feature vectors (digital fingerprints) for a known media content item may include comparing sequences of beats from each of the sets of beat-synchronized feature vectors. Each of the sets of beat-synchronized chroma vectors, MFCCs, and MFCC SSMs from the unidentified media content item and the known media content item may be compared to generate three cross-similarity matrices (CSMs) for each feature type … Then the CSM for the beat-synchronized chroma vectors includes the cosine distance between all possible pairs of feature vectors of the unidentified media content item and the known media content item”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 12, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, wherein the first plurality of embeddings and the second plurality of embeddings are embeddings in a shared embedding space (see Wold, paragraph [0052]: “In machine learning, an embedding refers to a projection of an input into another more convenient representation space. For example, a digital fingerprint and/or set of features of a media content item or of a portion of a media content item may be an embedding. The trained machine learning model may output, for each class that it has been trained to identify, a probability that the media content item (or portion of the media content item) belongs to that class”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 13, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, further comprising, training the first encoder and the second encoder using contrastive learning (see Rangarajan, paragraph [0256]: “the sentence embedding model (e.g., the sentence embedding model) may be trained over multiple semantic tasks, such that the sentence vectors (which are dependent on the contextual token vectors) encode features (e.g., latent and/or hidden features) of the sentence relating to the semantic context (e.g., the meaning, idea, concept, or the like) of the sentence. Various deep learning supervised methodologies may be employed when training the second language model (e.g., multitask supervised learning). When training the sentence embedding model, pairs of training sentences may be employed. The pairs of training sentences may be labeled with one or more semantic relationships. The sentence embedding model may be trained to generate sentence embeddings that are consistent with the semantic relationships of the pairs of training sentences. As also discussed, each of the tasks of the multitask training may be directed towards one or more types of semantic relationships (e.g., semantic similarity, semantic inference, next sentence prediction, and the like)”), including: 
obtaining a set of positive lyrics text tokens from the lyrics text (see Wold, paragraph [0085]: “The speech recognition engine computes recognition scores for the feature vectors based on acoustic information and language information. The acoustic information may be used to calculate an acoustic score representing a likelihood that the intended sound represented by a group of feature vectors matches a language phoneme”); 
obtaining a set of negative lyrics text tokens from lyrics text of another media item distinct from the media item (see Wold, paragraph [0085]: “The language information may be used to adjust the acoustic score by considering what sounds and/or words are used in context with each other, thereby improving a likelihood that the ASR system will output text data representing speech that makes sense grammatically”); 
for a time frame corresponding to a respective audio embedding (see Wold, paragraph [0084]: “divide the unidentified media content item into frames representing time intervals for which the preprocessing logic determines features representing qualities of the audio data in the unidentified media content item, along with a set of those values (i.e., a feature vector) representing features within each frame”): 
for a respective embedding corresponding to a positive lyrics text token, increasing a similarity to the respective audio embedding (see Wold, paragraph [0176]: “the lyrical similarity threshold may be reduced with increases in the audio/music similarity score, and the music/audio similarity threshold may be reduced with increases in the lyrical similarity score”); and 
for a respective embedding corresponding to a negative lyrics text token, decreasing a similarity to the respective audio embedding (see Wold, paragraph [0176]: “if there is a very low lyrical similarity score, then the audio/music similarity score should meet or exceed a higher similarity threshold in order for the unidentified media content item to be identified as a cover of a known media content item”).
The motivation for combining the references has been discussed in claim 1 above.
Regarding claim 14, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose the method of claim 1, further comprising, estimating a line interval corresponding to a lyrical line, wherein the lyrical line is obtained from the lyrics text, wherein aligning the lyrics text and the audio for the media item based on the respective similarities includes constraining a respective token to be aligned with an estimated lyrical line within a tolerance window (see Wold, paragraph [0066]: “The additional similarity metrics may include a musical and/or audio similarity computed by comparing the second digital fingerprints of the unidentified media content to the second digital fingerprints of the known media content item and/or a metadata similarity computed by comparing the metadata of the unidentified media content item to the metadata of the known media content item. A combined similarity value or score may be computed and compared to a threshold. If the combined similarity value of score meets or exceeds the threshold, then cover identifier 176 may determine that the unidentified media content item is or contains a cover of the known media content item”).
The motivation for combining the references has been discussed in claim 1 above.
Claim 15 is rejected for the same reasons as discussed in claim 1 above. In addition, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose a computer system, comprising: 
one or more processors (see Wold, paragraph [0215]: “The computing device 800 includes a processing device (processor)”); and 
memory storing one or more programs (see Wold, paragraph [0215]: “a main memory”), the one or more programs including instructions (see Wold, paragraph [0214]: “representation of a machine in the exemplary form of a computing device 800 within which a set of instructions, for causing the machine to perform any one or more of the methodologies”).
Claim 16 is rejected for the same reasons as discussed in claim 2 above.
Claim 17 is rejected for the same reasons as discussed in claim 3 above.
Claim 20 is rejected for the same reasons as discussed in claim 1 above. In addition, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above also disclose a non-transitory computer-readable storage medium (see Wold, paragraph [0215]: “a main memory 804 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM)), a static memory 806 (e.g., flash memory, static random access memory (SRAM)), and a data storage device 818, which communicate with each other via a bus 830”) storing one or more programs for execution by a computer system with one or more processors, the one or more programs comprising instructions (see Wold, paragraph [0218]: “The data storage device 818 may include a computer-readable medium 828 on which is stored one or more sets of instructions 822 (e.g., instructions of cover identifier 176) embodying any one or more of the methodologies or functions described herein”).
Claims 4-5 and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Wold, Rangarajan, and Verbeeck as applied to claim 1, and further in view of Tzinis et al. (US 20220310113 A1, hereinafter referred to as “Tzinis”).
Regarding claim 4, the combination teachings of Wold, Rangarajan, and Verbeeck as discussed above disclose all the claimed limitations with the exceptions of the method of claim 3, wherein the spectrogram has dimensions T×D, wherein T is a number of frames in the spectrogram and D is a dimensionality of the spectrogram.
Tzinis from the same or similar fields of endeavor discloses the method of claim 3, wherein the spectrogram has dimensions T×D, wherein T is a number of frames in the spectrogram and D is a dimensionality of the spectrogram (see Tzinis, paragraph [0052]: “For each separated source m, a time-domain audio sample ŝm, m=1, . . . , M, may be generated. Accordingly, a corresponding global audio embedding can be generated using a MobileNet v1 architecture for audio embedding network 225. Such an architecture can include a stacked two-dimensional (2D) separable dilated convolutional blocks with a dense layer at the end”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Tzinis with the teachings as in Wold, Rangarajan, and Verbeeck. The motivation for doing so would ensure the system to have the ability to use the neural network includes an audio embedding network disclosed in Tzinis to generate an audio embedding comprising a representation of audio features in the plurality of video frames of the input video and to generate a corresponding global audio embedding wherein the audio embedding network computes log Mel-scale spectrograms with Fa audio frames from the time-domain separated sources, and then apply stacks of depth-wise separable convolutions to produce a Fa×N embedding matrix Ma, which contains an N-dimensional row embedding for each frame thus the spectrogram has dimensions T×D, wherein T is a number of frames in the spectrogram and D is a dimensionality of the spectrogram in order to represent an acoustic representation of the audio for the media item so that the audio encoder can receive a spectrogram of the audio of the particular audio item.
Regarding claim 5, the combination teachings of Wold, Rangarajan, Verbeeck, and Tzinis as discussed above also disclose the method of claim 4, wherein the second encoder produces a matrix of embeddings that has dimensions T×F, wherein F is a dimensionality of the embeddings in the first plurality of embeddings and F is also a dimensionality of embeddings in the second plurality of embeddings (see Tzinis, paragraph [0052]: “audio embedding network 225 may compute log Mel-scale spectrograms with Fa audio frames from the time-domain separated sources, and then apply stacks of depth-wise separable convolutions to produce a Fa×N embedding matrix Ma, which contains an N-dimensional row embedding for each frame”).
The motivation for combining the references has been discussed in claim 4 above.
Claim 18 is rejected for the same reasons as discussed in claim 4 above.
Claim 19 is rejected for the same reasons as discussed in claim 5 above.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NIENRU YANG whose telephone number is (571)272-4212. The examiner can normally be reached Monday-Friday 10AM-6PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI TRAN can be reached at 571-272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

NIENRU YANG
Examiner
Art Unit 2484



/NIENRU YANG/Examiner, Art Unit 2484                                                                                                                                                                                                        

/THAI Q TRAN/Supervisory Patent Examiner, Art Unit 2484
Read full office action
Prosecution Timeline

Jun 26, 2023
Application Filed
Jan 03, 2025
Non-Final Rejection — §103, §112
Apr 02, 2025
Interview Requested
Apr 10, 2025
Applicant Interview (Telephonic)
Apr 10, 2025
Examiner Interview Summary
Apr 11, 2025
Response Filed
May 12, 2025
Final Rejection — §103, §112
Aug 18, 2025
Request for Continued Examination
Aug 19, 2025
Response after Non-Final Action
Oct 09, 2025
Non-Final Rejection — §103, §112
Dec 16, 2025
Applicant Interview (Telephonic)
Dec 16, 2025
Examiner Interview Summary
Jan 15, 2026
Response Filed
Feb 10, 2026
Final Rejection — §103, §112
Apr 09, 2026
Examiner Interview Summary
Apr 09, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

18/937,101
Patent 12604024
REPRODUCTION DEVICE, REPRODUCTION METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Apr 14, 2026
17/849,050
Patent 12592259
SYSTEMS AND METHODS TO EDIT VIDEOS TO REMOVE AND/OR CONCEAL AUDIBLE COMMANDS
2y 5m to grant Granted Mar 31, 2026
18/340,082
Patent 12586609
USING AUDIO ANCHOR POINTS TO SYNCHRONIZE RECORDINGS
2y 5m to grant Granted Mar 24, 2026
17/966,994
Patent 12581030
REPRODUCTION DEVICE, REPRODUCTION METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Mar 17, 2026
18/033,697
Patent 12556720
LEARNED VIDEO COMPRESSION AND CONNECTORS FOR MULTIPLE MACHINE TASKS
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
72%
Grant Probability
99%
With Interview (+28.7%)
2y 9m
Median Time to Grant
High
PTA Risk
Based on 399 resolved cases by this examiner. Grant probability derived from career allow rate.
SYSTEMS AND METHODS FOR LYRICS ALIGNMENT

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email