Last updated: April 19, 2026

Application No. 18/063,023

Systems and Methods for Facilitating Semantic Search of Audio Content

Final Rejection §103

Filed

Dec 07, 2022

Examiner

CHAVEZ, RODRIGO A

Art Unit

2658

Tech Center

2600 — Communications

Assignee

Spotify AB

OA Round

2 (Final)

Interview Optional

— +37.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 228 resolved cases, 2023–2026

Examiner Intelligence

CHAVEZ, RODRIGO A View full profile →

Grants 50% of resolved cases

Career Allow Rate

115 granted / 228 resolved

-11.6% vs TC avg

Strong +37% interview lift

Without

With

+37.3%

Interview Lift

resolved cases with interview

Typical timeline

3y 5m

Avg Prosecution

22 currently pending

Career history

250

Total Applications

across all art units

Statute-Specific Performance

§101

16.4%

-23.6% vs TC avg

§103

53.1%

+13.1% vs TC avg

§102

20.9%

-19.1% vs TC avg

§112

5.6%

-34.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 228 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to the rejection of claims 1, 3-13, 17 and 19 under 35 U.S.C. 102(a)(2) have been considered but are moot because of the new ground of rejection in view of Lyle and Ajmera for claims 1, 3-11, 13, 17, 19 and 21. Although the argument does consider that independent claim 1 contains elements from previously recited claim 12, the examiner contends that the amendment presented to the independent claims recite a positive recitation of “obtaining a predefined ad hoc vocabulary list”. This new recitation changes the scope of the language previously recited in claim 12 and would require the claim to be rejected under new grounds. Therefore, the examiner has withdrawn the previous grounds of rejection under 35 U.S.C. 102(a)(2), and presented new grounds of rejection under 35 U.S.C. 103.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-11, 13, 17, 19 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Lyle et al. (US Patent 12288570; hereinafter “Lyle”) in view of Ajmera (US PG Pub 20160117313).

As per claims 1, 17 and 19, Lyle discloses:
A method for generating a topic index, computing device and non-transitory computer-readable storage medium storing one or more programs configured for execution by a computing device having one or more processors and memory, comprising: 	one or more processors (Lyle; Fig. 2, item 212; Col. 9, lines 1-17 – one or more processors); 	memory (Lyle; Fig. 2, item 214; Col. 9, lines 1-17 – memory); and 	one or more programs stored in the memory and configured for execution by the one or more processors (Lyle; Col. 9, lines 1-17 - Instructions stored within memory 214 may be executed by the one or more processors 212), the one or more programs comprising instructions for: 	obtaining audio content (Lyle; Col. 6, lines 12-42 - original video content is received… It is to be understood that the inference engine also receives audio data and/or transcription data which can be used in parallel with the outputted synthesized video in cases where the video includes speech or otherwise has sounds present in the video stream; see also Col. 15, lines 22-67 & Col. 16 lines 1-67 – speech transcription & NLP analysis); 	extracting vocabulary terms from the audio content (Lyle; Col. 15, lines 22-67 & Col. 16 lines 1-67 - …the speech recognition component of the NLP may take the digital or analog audio signal from the call and performs speech recognition analysis to recognize one or more words spoken. Speech recognition (also referred to as automatic speech recognition (ASR), computer speech recognition or voice recognition) technology generally represents a set of technologies that allows computers equipped with a source of sound input, such as a microphone, to transform human speech into a sequence of words recorded in a computer data file…); 	generating, using a transformer model, a vocabulary embedding from the combined vocabulary (Lyle; Col. 19, lines 48-67 & Col. 20, lines 1-13 – word embeddings; see also Col. 21, lines 55-67 - the topic analysis module 242, which is configured to identify one or more topics, keywords, themes, ideas, items, etc. that were discussed or spoken of during the video);	generating one or more topic embeddings from the audio content and the vocabulary embeddings (Lyle; Col. 19, lines 48-67 & Col. 20, lines 1-13 – topic modeling techniques… Latent Dirichlet Allocation (LDA)… TOP2VEC… embedded topic models); 	generating a topic embedding index for the audio content based on the one or more topic embeddings (Lyle; Col. 21, lines 20-43 - …the topic mapper 260 is configured to map the detected topics to the relevant sections of the transcript, and create an indexable and easily searchable source of data representing the video content as a set of topic-labeled segments…); and 	storing the embedding index for use with a search engine system (Lyle; Col. 21, lines 20-43 – playback manager uses video topic mapper’s stored data to manage playback through user selections; see also Fig. 3, item 360; Col. 17, lines 9-30 – encoded data being shown to be stored in storage 360).	Lyle, however, fails to disclose obtaining a predefined ad hoc vocabulary list after obtaining the predefined ad hoc vocabulary list, generating a combined vocabulary that combines at least a portion of the extracted vocabulary terms with one or more ad hoc vocabulary terms from the predefined ad hoc vocabulary list.	Ajmera does teach obtaining a predefined ad hoc vocabulary list (Ajmera; p. 0046-0048 - The present embodiment refers extensively to a high precision domain lexicon (HPDL) (predefined adhoc vocabulary list). The HPDL (also referred to as a “set of category related terms”) is a collection of terms (words or sets of words) that belong to a specific domain, category, or genre (“domain”)…), after obtaining the predefined ad hoc vocabulary list, generating a combined vocabulary that combines at least a portion of the extracted vocabulary terms with one or more ad hoc vocabulary terms from the predefined ad hoc vocabulary list (Ajmera; p. 0051 - Processing proceeds to step S260, where discover new generation mod 304 discovers a new generation of HPDL terms from the candidate terms using the HPDL and its contextual characteristics… according to p. 0050-0051, the HPDL terms are used after step 255 which extracts candidate terms from the corpus, thus “combining” both the extracted candidate terms and the HPDL terms).	Therefore, it would have been obvious for one of ordinary skill in the art to modify the method of Lyle to include obtaining a predefined ad hoc vocabulary list after obtaining the predefined ad hoc vocabulary list, generating a combined vocabulary that combines at least a portion of the extracted vocabulary terms with one or more ad hoc vocabulary terms from the predefined ad hoc vocabulary list, as taught by Ajmera, because since domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible. In the context of NLP, term extraction becomes difficult when the text being processed belongs to a different domain (for example, medical technology) than the domain from which the NLP software was built (for example, financial news) (Ajmera; p. 0004).

As per claim 3, Lyle in view of Ajmera discloses:	The method of claim 1, wherein the one or more topic embeddings are generated using an Embedded Topic Model (ETM) (Lyle; Col. 19, lines 48-67 & Col. 20, lines 1-13 – topic modeling techniques… Latent Dirichlet Allocation (LDA)… TOP2VEC… embedded topic models).

As per claim 4, Lyle in view of Ajmera discloses:	The method of claim 1, wherein the one or more topic embeddings are generated using latent Dirichlet allocation (LDA) and Word2vec algorithms (Lyle; Col. 19, lines 48-67 & Col. 20, lines 1-13 – topic modeling techniques… Latent Dirichlet Allocation (LDA)… TOP2VEC… embedded topic models).

	As per claim 5, Lyle in view of Ajmera discloses:	The method of claim 1, wherein the audio content comprises a podcast and a podcast segment, and respective topic embeddings are generated for each of the podcast and the podcast segment (Lyle; Col. 6, lines 61-67 & Col. 7, lines 1-19 - …video streaming comprises digital video broadcasting…).

As per claim 6, Lyle in view of Ajmera discloses:	The method of claim 1, wherein generating the one or more topic embeddings from the audio content comprises identifying the top N topics for the audio content (Lyle; Col. 21, lines 4-19 - topic ranking which can identify the larger themes in the video that can be used to auto-generate bookmarks or chapters in the video and/or re-arrange a video based on the identified themes).

As per claim 7, Lyle in view of Ajmera discloses:	The method of claim 1, wherein obtaining the audio content comprises obtaining a transcript of an audio recording (Lyle; Col. 6, lines 12-42 - original video content is received… It is to be understood that the inference engine also receives audio data and/or transcription data which can be used in parallel with the outputted synthesized video in cases where the video includes speech or otherwise has sounds present in the video stream; see also Col. 15, lines 22-67 & Col. 16 lines 1-67 – speech transcription & NLP analysis).

As per claim 8, Lyle in view of Ajmera discloses:	The method of claim 1, wherein obtaining the audio content comprises extracting audio features from an audio recording (Lyle; Col. 15, lines 46-63 - a featurizer may initially deidentify data and process and convert the data to consumable features).

As per claim 9, Lyle in view of Ajmera discloses:	The method of claim 1, wherein the topic embedding index is a podcast topic embedding index corresponding to a podcast database; and the method further comprises generating a podcast segment topic embedding index corresponding to a podcast segment database (Lyle; Col. 17, lines 9-30 - In this high-level schematic, original video data (“original data”) 310 can be stored in storage 360. Original data has a first file size “A”. When the original data 310 is processed by embodiments of the video data encoding system, an encoder 320 can replace some, most, or nearly all of the video with a sequence of codes (encoded data 334). Encoded data has a second file size “B”. In addition, preserved original content 332 in the form of reference image(s) and clips for unclassified content will be preserved. Preserved original content 332 has a third file size C. Collectively, compressed video data 330 will have a file size of “B+C”. As a general matter, the file size “A” as compared to the file size “B+C” is significantly larger, and will typically be hundreds to thousands time greater in size. Nevertheless, when the compressed video data 330 is provided to the video synthesizer (“simulator”) and its associated decoder 340, the outputted simulated data 350 can carry the bulk of the information that had been captured as original data 310).

As per claim 10, Lyle in view of Ajmera discloses:	The method of claim 1, wherein the topic embedding index corresponds to a podcast database that includes entries for full episodes and entries for episode segments (Lyle; Col. 17, lines 9-30 - In this high-level schematic, original video data (“original data”) 310 can be stored in storage 360. Original data has a first file size “A”. When the original data 310 is processed by embodiments of the video data encoding system, an encoder 320 can replace some, most, or nearly all of the video with a sequence of codes (encoded data 334). Encoded data has a second file size “B”. In addition, preserved original content 332 in the form of reference image(s) and clips for unclassified content will be preserved. Preserved original content 332 has a third file size C. Collectively, compressed video data 330 will have a file size of “B+C”. As a general matter, the file size “A” as compared to the file size “B+C” is significantly larger, and will typically be hundreds to thousands time greater in size. Nevertheless, when the compressed video data 330 is provided to the video synthesizer (“simulator”) and its associated decoder 340, the outputted simulated data 350 can carry the bulk of the information that had been captured as original data 310).

As per claim 11, Lyle in view of Ajmera discloses:	The method of claim 1, wherein the search engine system comprises a semantic search engine (Lyle; Col. 5, lines 19-23 - The incorporation of transcript data further enables the user to search by keywords, or—following processing by a topic analysis module-broader topics (semantic searching) in the video as detected by the video navigation system).
As per claim 13, Lyle in view of Ajmera discloses:	The method of claim 1, wherein the vocabulary terms include one or more of: a phrase and a sentence (Lyle; Col. 15, lines 47-63 - NLP techniques may be used to process sample speech data as well as to interpret the language, for example by parsing sentences, and determining underlying meanings of the words).	As per claim 21, Lyle in view of Ajmera disclose:	The method of claim 1, wherein the ad hoc vocabulary list includes terms having a low frequency and high distinctiveness (Ajmera; p. 0046-0048 - The present embodiment refers extensively to a high precision domain lexicon (HPDL) (predefined adhoc vocabulary list). The HPDL (also referred to as a “set of category related terms”) is a collection of terms (words or sets of words) that belong to a specific domain (low frequency and high distinctiveness), category, or genre (“domain”)…).	Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Lyle in view of Ajmera, and further in view of Hellman (US PG Pub 20220013023).

As per claim 2, Lyle in view of Ajmera discloses:	The method of claim 1, upon which claim 2 depends.	Lyle in view of Ajmera, however, fails to disclose wherein the transformer model comprises a SentenceBERT model.	Hellman does teach wherein the transformer model comprises a SentenceBERT model (Hellman; p. 0079 - the disclosed embodiments may consider embedding sentences using SBERT, a version of BERT that has been fine-tuned on the SNLI and Multi-Genre NLI tasks).	Therefore, it would have been obvious for one of ordinary skill in the art to modify the method of Lyle and Ajmera to include wherein the transformer model comprises a SentenceBERT model, as taught by Hellman, in order to predict how sentences relate to one another. This means that the SBERT network has been specifically fine-tuned to embed individual sentences into a common space (Hellman; p. 0079).

Claims 14, 15, 18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lyle in view of Ajmera, and further in view of Asgekar (US PG Pub 20230142718).	As per claims 14, 18 and 20, Lyle in view of Ajmera discloses:	The method, system and non-transitory computer-readable storage medium of claims 1, 17 and 19, upon which claims 14, 18 and 20 depend.	Lyle in view of Ajmera, however, fails to disclose receiving a query string from a user; converting the query string to a query topic embedding; and obtaining one or more search results by comparing the query topic embedding with the topic embedding index.	Asgekar does teach receiving a query string from a user (Asgekar; p. 0100 - the query embeddings generator 235 can generate query embeddings by inputting a set of query terms received from a client device into the transformer model. The query terms can be provided as part of a content request 280. As described herein above, a content request 280 can include one or more words, which can form a question, another type of sentence, phrase, or other text information); converting the query string to a query topic embedding (Asgekar; p. 0100 - Because the content embeddings 275 and the query embeddings are generated using the same transformer model 250, the query embeddings and the content embeddings can share the same embeddings space. Thus, the query embeddings can be compared to any vector in the embeddings space, such as the pivots, to determine a distance (e.g., relatedness) of a query embedding to a content embedding 275 (e.g., and thus a corresponding information resource 270, etc.)); and obtaining one or more search results by comparing the query topic embedding with the topic embedding index (Asgekar; p. 0101 - Once the query embeddings are generated, the information set selector 240 can select a subset of the information resources 270 that are related to the content request 280 from which the query embeddings were generated. The information set selector 240 can determine related information resources by calculating a distance in the embeddings space between the query embeddings and the plurality of pivots).	Therefore, it would have been obvious for one of ordinary skill in the art to modify the method of Lyle and Ajmera to include receiving a query string from a user; converting the query string to a query topic embedding; and obtaining one or more search results by comparing the query topic embedding with the topic embedding index, as taught by Asgekar, in order to automatically classify and index teaching media such that it can be easily accessed and provided to best answer a student query or to achieve a requested learning objective (Asgekar; p. 0002).	As per claim 15, Lyle in view of Ajmera and Asgekar disclose:	The method of claim 14, upon which claim 15 depends.	And further, Asgekar teaches wherein the query is a word, a phrase, or a sentence (Asgekar; p. 0100 - the query embeddings generator 235 can generate query embeddings by inputting a set of query terms received from a client device into the transformer model. The query terms can be provided as part of a content request 280. As described herein above, a content request 280 can include one or more words, which can form a question, another type of sentence, phrase, or other text information).	Therefore, it would have been obvious for one of ordinary skill in the art to modify the method of Lyle and Ajmera, to include wherein the query is a word, a phrase, or a sentence, as taught by Asgekar, in order to automatically classify and index teaching media such that it can be easily accessed and provided to best answer a student query or to achieve a requested learning objective (Asgekar; p. 0002).

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Lyle in view of Ajmera, and further in view of Tomkins (US PG Pub 20210295822).

As per claim 16, Lyle in view of Ajmera discloses:	The method of claim 1, upon which claim 16 depends.	Lyle in view of Ajmera, however, fails to disclose wherein extracting the vocabulary terms from the audio content includes one or more of: removing punctuation and stop words from a transcript, and filtering one or more words from the transcript.	Tomkins does teach wherein extracting the vocabulary terms from the audio content includes one or more of: removing punctuation and stop words from a transcript, and filtering one or more words from the transcript (Tomkins; p. 0147 - some embodiments may determine that each word of the query may be used as an n-gram, where one or more of the words may be modified or deleted based on a set of filters that remove stop words, lemmatizes words, stems words, or the like).	Therefore, it would have been obvious for one of ordinary skill in the art to modify the method of Lyle and Ajmera, to include wherein extracting the vocabulary terms from the audio content includes one or more of: removing punctuation and stop words from a transcript, and filtering one or more words from the transcript, as taught by Tomkins, in order to achieve cross-context natural language model generation (Tomkins; p. 0002).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes:	Thambiratnam (US PG Pub 20160259782) discloses: Content-based analysis is performed on multimedia content prior to encoding the multimedia content in the rendering chain of processing. A content-based index stream is generated based on the content-based analysis and the content-based index stream is embedded in the multimedia file during rendering. The content-based index stream can be used to generate a content-based searchable index when necessary (Thambiratnam; Abstract).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139. The examiner can normally be reached Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at 5712727602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RODRIGO A CHAVEZ/Examiner, Art Unit 2658

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658

Read full office action

Prosecution Timeline

Dec 07, 2022

Application Filed

Jun 09, 2025

Non-Final Rejection — §103

Dec 09, 2025

Applicant Interview (Telephonic)

Dec 09, 2025

Response Filed

Dec 11, 2025

Examiner Interview Summary

Mar 19, 2026

Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/175,355

Patent 12597430

MULTI-CHANNEL SIGNAL GENERATOR, AUDIO ENCODER AND RELATED METHODS RELYING ON A MIXING NOISE SIGNAL

2y 5m to grant Granted Apr 07, 2026

17/579,750

Patent 12579984

DATA AUGMENTATION SYSTEM AND METHOD FOR MULTI-MICROPHONE SYSTEMS

2y 5m to grant Granted Mar 17, 2026

17/513,419

Patent 12541653

ENTERPRISE COGNITIVE SOLUTIONS LOCK-IN AVOIDANCE

2y 5m to grant Granted Feb 03, 2026

17/532,315

Patent 12542136

DYNAMICALLY CONFIGURING A WARM WORD BUTTON WITH ASSISTANT COMMANDS

2y 5m to grant Granted Feb 03, 2026

17/450,015

Patent 12531077

METHOD AND APPARATUS IN AUDIO PROCESSING

2y 5m to grant Granted Jan 20, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

50%

Grant Probability

88%

With Interview (+37.3%)

3y 5m

Median Time to Grant

Moderate

PTA Risk

Based on 228 resolved cases by this examiner. Grant probability derived from career allow rate.