Last updated: April 19, 2026
Application No. 19/211,748
SYSTEMS AND METHODS FOR MANAGING VOICE QUERIES USING PRONUNCIATION INFORMATION

Non-Final OA §101§103§DP
Filed
May 19, 2025
Examiner
HU, XIAOQIN
Art Unit
2168
Tech Center
2100 — Computer Architecture & Software
Assignee
Adeia Guides Inc.
OA Round
1 (Non-Final)
This examiner grants 61% of cases after interview

— +57.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 187 resolved cases, 2023–2026
Examiner Intelligence

HU, XIAOQIN View full profile →
Grants 61% of resolved cases
Career Allow Rate
114 granted / 187 resolved
+6.0% vs TC avg
Strong +58% interview lift
Without
With
+57.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 12m
Avg Prosecution
25 currently pending
Career history
212
Total Applications
across all art units
Statute-Specific Performance

§101
19.1%
-20.9% vs TC avg
§103
35.6%
-4.4% vs TC avg
§102
12.4%
-27.6% vs TC avg
§112
29.2%
-10.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 187 resolved cases
Office Action

§101 §103 §DP
DETAILED ACTION
This office action is in response to the above identified application filed on May 19, 2025. Upon entry of a preliminary amendment filed on May 28, 2025 that amended the title of the specification and the claims, the application contains claims 1-21: 
Claim 1 is cancelled
Claims 2-21 are newly added
Claims 2-21 are pending

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
The present application is a Continuation of 16528539, filed 07/31/2019, now U.S. Patent # 12332937 and having 2 RCE-type filing therein.

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on February 03, 2026. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 2-5, 8-15, and 18-21 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-4, 6-13, and 15-18 of U.S. Patent No. 11494434. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims in the present application are simply a slightly varied broader version of the respective claims recited in U.S. Patent No. 11494434 as shown in the comparison table below:

Present Application
U.S. Patent # 11494434
A method for responding to voice queries, the method comprising:

receiving a voice query received at an audio interface;

extracting, using control circuitry, one or more keywords from the voice query;

generating, using the control circuitry, a text query based at least in part on the one or more keywords;

identifying an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity; and

retrieving a content item associated with the entity.

3. The method of claim 2, wherein the one or more alternate text representations comprises a phonetic representation of the identifier associated with the entity.

4. The method of claim 2, wherein the one or more alternate text representations comprises an alternate spelling of the identifier associated with the entity.

5. The method of claim 2, wherein the one or more alternate text representations comprises a text string generated based at least in part on a previous speech-to-text conversion.

8. The method of claim 2, wherein the identifying the entity is based at least in part on user profile information.

9. The method of claim 2, wherein the identifying the entity is based at least in part on popularity information associated with the entity.

10. The method of claim 2, wherein the entity is a first entity, and further comprising:

identifying a second entity based at least in part on the text query and metadata for the second entity; and

determining a first score for the first entity based at least in part on a comparison of the text query to the metadata associated with the first entity, and determining a second score for the second entity based at least in part on a comparison of the text query to metadata associated with the second entity, wherein the content item associated with the first entity is retrieved by selecting a maximum score of the first score and the second score.

11. The method of claim 2, wherein the text query is a first text query, and further comprising:

generating a plurality of text queries, wherein the plurality of text queries comprises the first text query, and wherein each text query of the plurality of text queries is generated based at least in part on a respective pronunciation setting of a speech-to-text module.

12. A system for responding to voice queries, the system comprising:
a memory; and

control circuitry configured to: receive a voice query received at an audio interface;

extract, using control circuitry, one or more keywords from the voice query;

generate, using the control circuitry, a text query based at least in part on the one or more keywords;

store in the memory the text query;

identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity; and

retrieve a content item associated with the entity.

13. The system of claim 12, wherein the one or more alternate text representations comprises a phonetic representation of the identifier associated with the entity.

14. The system of claim 12, wherein the one or more alternate text representations comprises an alternate spelling of the identifier associated with the entity.

15. The system of claim 12, wherein the one or more alternate text representations comprises a text string generated based at least in part on a previous speech-to-text conversion.

18. The system of claim 12, wherein the identifying the entity is based at least in part on user profile information.

19. The system of claim 12, wherein the identifying the entity is based at least in part on popularity information associated with the entity.

20. The system of claim 12, wherein the entity is a first entity, and the system is configured to:

identify a second entity based at least in part on the text query and metadata for the second entity; and

determine a first score for the first entity based at least in part on a comparison of the text query to the metadata associated with the first entity, and determine a second score for the second entity based at least in part on a comparison of the text query to metadata associated with the second entity, wherein the content item associated with the first entity is retrieved by selecting a maximum score of the first score and the second score.

21. The system of claim 12, wherein the text query is a first text query, and further comprising:

generating a plurality of text queries, wherein the plurality of text queries comprises the first text query, and wherein each text query of the plurality of text queries is generated based at least in part on a respective pronunciation setting of a speech-to-text module.
1. A method for responding to voice queries, the method comprising: 

receiving a voice query at an audio interface; 

extracting, using control circuitry, one or more keywords from the voice query; 

generating, using the control circuitry, a text query based on the one or more keywords; 

identifying an entity based on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of the entity based on pronunciation of an identifier associated with the entity, and wherein identifying the entity comprises: identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities; 

determining a respective score for each respective entity of the plurality of entities based on comparing the respective one or more alternate text representations with the text query; and 

selecting the entity by determining a maximum score; and 

retrieving a content item associated with the entity.

2. The method of claim 1, wherein the one more alternate text representations comprise a phonetic representation of the entity.

3. The method of claim 1, wherein the one more alternate text representations comprise an alternate spelling of the entity based on pronunciation.

4. The method of claim 1, wherein the one or more alternate text representations of the entity comprise a text string generated based on a previous speech-to-text conversion.

6. The method of claim 1, wherein identifying the entity is further based on user profile information.

7. The method of claim 1, wherein identifying the entity is further based on popularity information associated with the entity.

9. The method of claim 8, further comprising: 

identifying, based on a respective text query of the plurality of text queries, a respective entity; 

determining a respective score for the respective entity based on a comparison of the respective text query to metadata associated with the respective entity; and identifying the entity by selecting a maximum score of the respective scores.

8. The method of claim 1, further comprising 

generating a plurality of text queries, wherein the plurality of text queries comprises the text query, and wherein each text query of the plurality of text queries is generated based on a respective setting of a speech-to-text module of the control circuitry.

10. A system for responding to voice queries, the system comprising: 

an audio interface for receiving a voice query; 

control circuitry configured to: 

extract one or more keywords from the voice query; 

generate a text query based on the one or more keywords; 

identify an entity based on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of the entity based on pronunciation of an identifier associated with the entity, and wherein the control circuitry is further configured to identify the entity by: identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities, determining a respective score for each respective entity of the plurality of entities based on comparing the respective one or more alternate text representations with the text query; and 

selecting the entity by determining a maximum score; and 

retrieve a content item associated with the entity.

11. The system of claim 10, wherein the one more alternate text representations comprise a phonetic representation of the entity.

12. The system of claim 10, wherein the one more alternate text representations comprise an alternate spelling of the entity based on pronunciation.

13. The system of claim 10, wherein the one or more alternate text representations of the entity comprise a text string generated based on a previous speech-to-text conversion.

15. The system of claim 10, wherein the control circuitry is further configured to identify the entity based on user profile information.

16. The system of claim 10, wherein the control circuitry is further configured to identify the entity based on popularity information associated with the entity.

18. The system of claim 17, wherein the control circuitry is further configured to: 

identify, based on a respective text query of the plurality of text queries, a respective entity; 

determine a respective score for the respective entity based on a comparison of the respective text query to metadata associated with the respective entity; and identify the entity by selecting a maximum score of the respective scores.

17. The system of claim 10, wherein the control circuitry is further configured to 

generate a plurality of text queries, wherein the plurality of text queries comprises the text query, wherein the control circuitry comprises a speech-to-text module, and wherein each text query of the plurality of text queries is generated based on a respective setting of a speech-to-text module.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 2-21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
The 2019 PEG guidance for subject matter eligibility is applied in the following analyses:

At Step 1
The inventions of claims 2-21 are directed to the statutory categories of a process (claims 2-11) and machine (claims 12-21). Thus, the claimed invention is directed to statutory subject matter.

At Step 2A, Prong One
The claimed invention is directed to mental processes without significantly more. Claims 2 and 12 recite abstract ideas in the following limitations:
"extracting … one or more keywords from the voice query” recites a mental process as an evaluation or judgement of the important (i.e. key) words in audio. One listening to speech or audio can mentally evaluate that certain words heard are “important”, consistent with the specification at [0052]. 
“generating … a text query based at least in part on the one or more keywords” recites a mental process as one can mentally arrange the one or more keywords in a suitable order or omit one or more words of the voice query, consistent with the specification at [0055].
“identifying an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity” recites a mental process as an evaluation (i.e. comparison) of the text query to stored alternate text representations of the entity based on pronunciation. Consistent with the specification at [0066], one can mentally compare text strings and determining a match.

At Step 2A, Prong Two 
This judicial exception is not integrated into a practical application because the claims recite the additional elements of:
“an audio interface”, “control circuitry”, and “a memory” (claim 12) constitute a high-level recitation of a generic computer components and represent mere instructions to apply on a computer, see MPEP 2106.05(f). 
“receiving a voice query” constitutes preliminary data gathering, see MPEP 2106.05(g).
“retrieving a content item associated with the entity” constitutes preliminary data gathering, see MPEP 2106.05(g) or as mere instruction to ‘apply it’ under MPEP 2106.05(f).
“store in the memory the text query” may be characterized as insignificant extra-solution activity, particularly post-solution activity, see MPEP 2106.05(g).

Even when viewed in combination, these additional elements do not integrate the recited judicial exception into a practical application and the claim is directed to the judicial exception.

At Step 2B
Claims 2 and 12 do not include additional elements that are sufficient to amount to significantly more than the judicial exception because as discussed above the additional elements constitute a high-level recitation of a generic computer components which represent mere instructions to apply on a computer, preliminary data gathering, and insignificant extra-solution activity, particularly post-solution activity. As identified by courts retrieving, receiving, and storing data are well-understood, routine, and conventional activities, see MPEP 2106.05(d). [Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015); OIP Techs., 788 F.3d at 1363, 115 USPQ2d at 1092-93;]. 
Even when considered in combination, these additional elements do not provide an inventive concept or significantly more.
Therefore, claims 2 and 12 are rejected under 35 USC 101 as being directed to an abstract idea without significantly more.

Dependent claims 3-11 and 13-21 each recite abstract ideas elaborating on the further details of the “identifying” in the independent claims 2 and 12 that are still mentally performable.
Therefore, dependent claims 3-11 and 13-21 are also rejected under 35 USC 101 as being directed to an abstract idea without significantly more.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2, 3, 5, 7-10, 12, 13, 15, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Jang (US 20140359523 A1), in view of Ramos et al. (US 11157696 B1), and in further view of Olstad et al. (US 20130132374 A1).

With regard to claim 2,
	Jang teaches
a method for responding to voice queries (Fig. 11; [0165]-[0169]; Fig. 12; [0170]-[0173]), the method comprising: 
receiving a voice query received at an audio interface ([0166]; Fig. 1, microphone 122; [0047]: receive a voice query through an audio input component such as a microphone, wherein the microphone corresponds to “an audio interface”); 
extracting, using control circuitry (Fig. 1; [0073]-[0076]: controller 180 corresponds to “control circuity”), one or more keywords from the voice query ([0170]: identify a query term of the voice query, wherein a query term corresponds to “one or more keywords from the voice query”); 
generating, using the control circuitry, a text query based at least in part on the one or more keywords (Fig. 11; [0167]-[0168]; Fig. 12; [0170]: convert the voice query to a text query, which identifies query terms from the voice query, determines pronunciation information for each query term, and converts each query term into a typical text query term using a voice query term database that links a range of pronunciation of terms to a typical query term. As a result, a text query is generated comprising the query terms and their pronunciations); 
	Jang does not explicitly teach
identifying an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity; and 
retrieving a content item associated with the entity.
Ramos teaches
identifying an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises a pronunciation tag comprising a phonetic spelling for the entity (Fig. 1, 138-144; Col. 3, lines 61-67; Col. 4, lines 1-56: perform entity resolution based on the tagged portion of text data and the portion of audio data corresponding to the tagged portion of text by comparing the portion of audio data against audio data representing entities known to the system, wherein performing entity resolution corresponds to “identifying an entity”, the tagged portion of text data corresponds to “the text query”, and audio data representing entities known to the system corresponds to “a pronunciation tag” comprised in the “metadata for the entity”. Fig. 6; Col. 16, lines 42-67; Col. 17, lines 1-18: audio data comprises phonetic representation of text data, wherein the phonetic representation reads on "phonetic spelling"); and 
retrieving a content item associated with the entity (Fig. 1, 146; Col. 4, lines 57-62; Col. 2, lines 10-18: use the resolved entity to perform downstream processes. For example, for the user input of "Alexa, play Adele music," a system may output music sung by Adele, wherein output indicates “retrieving”, and music sung by Adele corresponds to “a content item associated with the entity” Adele).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang to incorporate the teachings of Ramos to identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity and retrieve a content item associated with the entity. Doing so would improve text-based entity resolution by providing a language agnostic phonetic searching as part of entity resolution when text-based entity resolution may be unsuccessful or successful to a degree below a requisite threshold confidence as taught by Ramos (Col. 2, lines 48-67).
Jang and Ramos do not teach
identifying an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity;
Olstad teaches
identifying an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity (Claim 20; [0039]: identify videos in the set of videos based on metadata associated with the videos, wherein the metadata includes phonetic transcription extracted from the audio track, and the phoneme sequences included in a phonetic transcription of the audio track are matched with a phonetic representation of the query to find locations inside the audio track with the best phonetic similarity. “query terms” indicates “text query” and phonetic transcription metadata is “one or more alternate text representations” of a speech-to-text transcription associated with the identified videos, i.e., “an identifier”);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang and Ramos to incorporate the teachings of Olstad to identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity. Doing so would find locations inside the audio track with the best phonetic similarity to a user query, improve search precision, and perform less analysis including metadata generation as taught by Olstad ([0039]).

With regard to claim 3,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
	Olstad further teaches
	the method of claim 2, wherein the one or more alternate text representations comprises a phonetic representation of the identifier associated with the entity (Claim 20; [0039]: the phonetic transcription included in the metadata that is extracted from the audio track of the identified videos corresponds to “a phonetic representation”).

With regard to claim 5,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
	Olstad further teaches
the method of claim 2, wherein the one or more alternate text representations comprises a text string generated based at least in part on a previous speech-to-text conversion. ([0039]: phonetic transcription being an alternative to speech-to-text transcription of the audio track in the video indicates generating a text string by either speech-to-text conversion or phonetic transcription, which takes place before the metadata is used for search, i.e., “a previous”).

With regard to claim 7,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
Olstad further teaches
the method of claim 2, wherein the identifier associated with the entity identifies information related to the entity (Claim 20; [0039]: the phonetic transcription included in the metadata of the identified videos identifies the audio track in the videos, i.e., “identifies information related to” the videos, i.e., “the entity”).

With regard to claim 8,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
	Ramos further teaches
the method of claim 2, wherein the identifying the entity is based at least in part on user profile information. (Col. 20, lines 13-19; Col. 7, lines 61-66: the phonetic entity resolution component 802 may consider user preferences, wherein user preferences read on “user profile information”).

With regard to claim 9,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
	Ramos further teaches
the method of claim 2, wherein the identifying the entity is based at least in part on popularity information associated with the entity. (Col. 20, lines 13-19: the phonetic entity resolution component 802 may consider popularity of known entities).

With regard to claim 10,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
	Olstad further teaches
the method of claim 2, wherein the entity is a first entity, and further comprising:
identifying a second entity based at least in part on the text query and metadata for the second entity (Claim 20; [0039]; [0003]: querying videos by matching a phonetic representation of the query to phonetic transcription of the audio track inside the videos indicates identifying plural videos that include a first video, a second video, and potentially more, wherein each video corresponds to an “entity”), and
	Ramos further teaches
determining a first score for the first entity based at least in part on a comparison of the text query to the metadata associated with the first entity, and determining a second score for the second entity based at least in part on a comparison of the text query to metadata associated with the second entity, wherein the content item associated with the first entity is retrieved by selecting a maximum score of the first score and the second score (Fig. 8; Col. 19, lines 27-41: perform phonetic matching of the audio data (representing the entity to be resolved) to audio data stored in the entity storage (608/706) and associate with each known entity a confidence value representing the confidence that the known entity corresponds to the entity in the user input, wherein the confidence value corresponds to a “score” and the phonetic matching corresponds to a  “comparison”. Col. 20, lines 20-41: the N-best list may include a maximum number of top scoring known entities, wherein selecting the maximum number of top scoring known entities indicates not only every entity is scored but also only the ones with "a maximum score" gets selected).

With regard to claim 12,
	Jang teaches
a system for responding to voice queries (Fig. 11; [0165]-[0169]; Fig. 12; [0170]-[0173]), the system comprising:
a memory (Fig. 1: memory 160); and
control circuitry (Fig. 1; [0073]-[0076]: controller 180 corresponds to “control circuity”) configured to: 
receive a voice query received at an audio interface ([0166]; Fig. 1, microphone 122; [0047]: receive a voice query through an audio input component such as a microphone, wherein the microphone corresponds to “an audio interface”); 
extract, using control circuitry (Fig. 1; [0073]-[0076]: controller 180 corresponds to “control circuity”), one or more keywords from the voice query ([0170]: identify a query term of the voice query, wherein a query term corresponds to “one or more keywords from the voice query”); 
generate, using the control circuitry, a text query based at least in part on the one or more keywords (Fig. 11; [0167]-[0168]; Fig. 12; [0170]: convert the voice query to a text query, which identifies query terms from the voice query, determines pronunciation information for each query term, and converts each query term into a typical text query term using a voice query term database that links a range of pronunciation of terms to a typical query term. As a result, a text query is generated comprising the query terms and their pronunciations);
store in the memory the text query ([0177]: the mobile terminal 100 stores the information search request in the memory 160); 
	Jang does not explicitly teach
identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity; and 
retrieve a content item associated with the entity.
Ramos teaches
identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises a pronunciation tag comprising a phonetic spelling for the entity (Fig. 1, 138-144; Col. 3, lines 61-67; Col. 4, lines 1-56: perform entity resolution based on the tagged portion of text data and the portion of audio data corresponding to the tagged portion of text by comparing the portion of audio data against audio data representing entities known to the system, wherein performing entity resolution corresponds to “identifying an entity”, the tagged portion of text data corresponds to “the text query”, and audio data representing entities known to the system corresponds to “a pronunciation tag” comprised in the “metadata for the entity”. Fig. 6; Col. 16, lines 42-67; Col. 17, lines 1-18: audio data comprises phonetic representation of text data, wherein the phonetic representation reads on "phonetic spelling"); and 
retrieve a content item associated with the entity (Fig. 1, 146; Col. 4, lines 57-62; Col. 2, lines 10-18: use the resolved entity to perform downstream processes. For example, for the user input of "Alexa, play Adele music," a system may output music sung by Adele, wherein output indicates “retrieving”, and music sung by Adele corresponds to “a content item associated with the entity” Adele).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang to incorporate the teachings of Ramos to identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity and retrieve a content item associated with the entity. Doing so would improve text-based entity resolution by providing a language agnostic phonetic searching as part of entity resolution when text-based entity resolution may be unsuccessful or successful to a degree below a requisite threshold confidence as taught by Ramos (Col. 2, lines 48-67).
Jang and Ramos do not teach
identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity;
Olstad teaches
identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity (Claim 20; [0039]: identify videos in the set of videos based on metadata associated with the videos, wherein the metadata includes phonetic transcription extracted from the audio track, and the phoneme sequences included in a phonetic transcription of the audio track are matched with a phonetic representation of the query to find locations inside the audio track with the best phonetic similarity. “query terms” indicates “text query” and phonetic transcription metadata is “one or more alternate text representations” of a speech-to-text transcription associated with the identified videos, i.e., “an identifier”);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang and Ramos to incorporate the teachings of Olstad to identify an entity based at least in part on the text query and metadata for the entity, wherein the metadata comprises one or more alternate text representations of an identifier associated with the entity. Doing so would find locations inside the audio track with the best phonetic similarity to a user query, improve search precision, and perform less analysis including metadata generation as taught by Olstad ([0039]).

With regard to claim 13,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
	Olstad further teaches
	the system of claim 12, wherein the one or more alternate text representations comprises a phonetic representation of the identifier associated with the entity (Claim 20; [0039]: the phonetic transcription included in the metadata that is extracted from the audio track of the identified videos corresponds to “a phonetic representation”).

With regard to claim 15,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
	Olstad further teaches
the system of claim 12, wherein the one or more alternate text representations comprises a text string generated based at least in part on a previous speech-to-text conversion. ([0039]: phonetic transcription being an alternative to speech-to-text transcription of the audio track in the video indicates generating a text string by either speech-to-text conversion or phonetic transcription, which takes place before the metadata is used for search, i.e., “a previous”).

With regard to claim 17,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
Olstad further teaches
the system of claim 12, wherein the identifier associated with the entity identifies information related to the entity (Claim 20; [0039]: the phonetic transcription included in the metadata of the identified videos identifies the audio track in the videos, i.e., “identifies information related to” the videos, i.e., “the entity”).

With regard to claim 18,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
	Ramos further teaches
the system of claim 12, wherein the identifying the entity is based at least in part on user profile information. (Col. 20, lines 13-19; Col. 7, lines 61-66: the phonetic entity resolution component 802 may consider user preferences, wherein user preferences read on “user profile information”).

With regard to claim 19,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
	Ramos further teaches
the system of claim 12, wherein the identifying the entity is based at least in part on popularity information associated with the entity. (Col. 20, lines 13-19: the phonetic entity resolution component 802 may consider popularity of known entities).

With regard to claim 20,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
	Olstad further teaches
the system of claim 12, wherein the entity is a first entity, and the system is configured to:
identify a second entity based at least in part on the text query and metadata for the second entity (Claim 20; [0039]; [0003]: querying videos by matching a phonetic representation of the query to phonetic transcription of the audio track inside the videos indicates identifying plural videos that include a first video, a second video, and potentially more, wherein each video corresponds to an “entity”), and
	Ramos further teaches
determine a first score for the first entity based at least in part on a comparison of the text query to the metadata associated with the first entity, and determine a second score for the second entity based at least in part on a comparison of the text query to metadata associated with the second entity, wherein the content item associated with the first entity is retrieved by selecting a maximum score of the first score and the second score (Fig. 8; Col. 19, lines 27-41: perform phonetic matching of the audio data (representing the entity to be resolved) to audio data stored in the entity storage (608/706) and associate with each known entity a confidence value representing the confidence that the known entity corresponds to the entity in the user input, wherein the confidence value corresponds to a “score” and the phonetic matching corresponds to a  “comparison”. Col. 20, lines 20-41: the N-best list may include a maximum number of top scoring known entities, wherein selecting the maximum number of top scoring known entities indicates not only every entity is scored but also only the ones with "a maximum score" gets selected).

Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Jang (US 20140359523 A1), in view of Ramos et al. (US 11157696 B1), and in further view of Olstad et al. (US 20130132374 A1) and YAO et al. (US 20110307432 A1).

With regard to claim 4,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
Jang and Ramos and Olstad do not teach
	the method of claim 2, wherein the one or more alternate text representations comprises an alternate spelling of the identifier associated with the entity.
	YAO teaches
	the method of claim 2, wherein the one or more alternate text representations comprises an alternate spelling of the identifier associated with the entity ([0055]: indexed metadata for web pages includes entity name equivalents data, variations of an entity's name, and entity name misspellings, all of which correspond to "an alternate spelling" associated with the web pages).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang and Ramos and Olstad to incorporate the teachings of YAO to make the one or more alternate text representations comprise an alternate spelling of the identifier associated with the entity. Doing so would provide improved search result relevance for name search queries as taught by YAO ([0005]).

With regard to claim 14,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
Jang and Ramos and Olstad do not teach
	the system of claim 12, wherein the one or more alternate text representations comprises an alternate spelling of the identifier associated with the entity.
	YAO teaches
	the system of claim 12, wherein the one or more alternate text representations comprises an alternate spelling of the identifier associated with the entity ([0055]: indexed metadata for web pages includes entity name equivalents data, variations of an entity's name, and entity name misspellings, all of which correspond to "an alternate spelling" associated with the web pages).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang and Ramos and Olstad to incorporate the teachings of YAO to make the one or more alternate text representations comprise an alternate spelling of the identifier associated with the entity. Doing so would provide improved search result relevance for name search queries as taught by YAO ([0005]).

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Jang (US 20140359523 A1), in view of Ramos et al. (US 11157696 B1), and in further view of Olstad et al. (US 20130132374 A1) and Pore et al. (US 20190295527 A1).

With regard to claim 6,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
	Jang and Ramos and Olstad do not teach
	the method of claim 2, wherein the metadata is generated by a text-to-speech module using a pronunciation setting for generating speech content; and
converting the generated speech content into text using a speech-to-text module.
	Pore teaches
	the method of claim 2, wherein the metadata is generated by a text-to-speech module using a pronunciation setting for generating speech content; and converting the generated speech content into text using a speech-to-text module (Abstract; [0035]-[0036]: convert a text message containing at least one phonemic spelling of a word into speech by running a text-to-speech application programming interface (API) with the text message as input. The converted speech may be input to a speech-to-text API and the speech-to-text API executed to convert the speech to text. Users of English language in different geographic locations may have different accents or pronunciations, and therefore, may pronounce or voice an English word based on phonemes particular to the geographic location. A speech-to-text API that can recognize the particular location's accents or pronunciation of words may provide for a more accurate conversion of speech into text).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang and Ramos and Olstad to incorporate the teachings of Pore to generate the metadata by a text-to-speech module using a pronunciation setting for generating speech content, and convert the generated speech content into text using a speech-to-text module. Doing so would recognize a particular location's accents of a user and provide more accurate conversion of the text into English speech as taught by Pore ([0035]).

With regard to claim 16,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
	Jang and Ramos and Olstad do not teach
	the system of claim 12, wherein the metadata is generated by a text-to-speech module using a pronunciation setting for generating speech content; and the system is configured to:
converting the generated speech content into text using a speech-to-text module.
	Pore teaches
	the system of claim 12, wherein the metadata is generated by a text-to-speech module using a pronunciation setting for generating speech content; and the system is configured to: converting the generated speech content into text using a speech-to-text module (Abstract; [0035]-[0036]: convert a text message containing at least one phonemic spelling of a word into speech by running a text-to-speech application programming interface (API) with the text message as input. The converted speech may be input to a speech-to-text API and the speech-to-text API executed to convert the speech to text. Users of English language in different geographic locations may have different accents or pronunciations, and therefore, may pronounce or voice an English word based on phonemes particular to the geographic location. A speech-to-text API that can recognize the particular location's accents or pronunciation of words may provide for a more accurate conversion of speech into text).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang and Ramos and Olstad to incorporate the teachings of Pore to generate the metadata by a text-to-speech module using a pronunciation setting for generating speech content, and convert the generated speech content into text using a speech-to-text module. Doing so would recognize a particular location's accents of a user and provide more accurate conversion of the text into English speech as taught by Pore ([0035]).

Claims 11 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Jang (US 20140359523 A1), in view of Ramos et al. (US 11157696 B1), and in further view of Olstad et al. (US 20130132374 A1) and Davallou (US 20060074892 A1).

With regard to claim 11,
	As discussed in claim 2, Jang and Ramos and Olstad teach all the limitations therein.
	Jang and Ramos and Olstad do not teach
the method of claim 2, wherein the text query is a first text query, and further comprising:
generating a plurality of text queries, wherein the plurality of text queries comprises the first text query, and wherein each text query of the plurality of text queries is generated based at least in part on a respective pronunciation setting of a speech-to-text module.
Davallou teaches
the method of claim 2, wherein the text query is a first text query, and further comprising:
generating a plurality of text queries, wherein the plurality of text queries comprises the first text query, and wherein each text query of the plurality of text queries is generated based at least in part on a respective pronunciation setting of a speech-to-text module ([0064]: generate the search string through a speech-to-text software application from the dictation of a user interacting with a database, such as through a telephone system or microphone. [0079]; [0069]: search for similar sounding words within the phonetic database for each word and generate a number of different combinations based on the approximate pronunciation of the text entered and the phonetically equivalent formulas, the result of which would be “generating a plurality of text queries” based on a pronunciation setting of a speech-to-text module).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang and Ramos and Olstad to incorporate the teachings of Davallou to generate a plurality of text queries, wherein the plurality of text queries comprises the first text query, and wherein each text query of the plurality of text queries is generated based at least in part on a respective pronunciation setting of a speech-to-text module. Doing so would eventually find the correct query even though the original entry involved a different spelling, without the user having to respell or retype the entry as taught by Davallou ([0069]).

With regard to claim 21,
	As discussed in claim 12, Jang and Ramos and Olstad teach all the limitations therein.
	Jang and Ramos and Olstad do not teach
the system of claim 12, wherein the text query is a first text query, and further comprising:
generating a plurality of text queries, wherein the plurality of text queries comprises the first text query, and wherein each text query of the plurality of text queries is generated based at least in part on a respective pronunciation setting of a speech-to-text module.
Davallou teaches
the system of claim 12, wherein the text query is a first text query, and further comprising:
generating a plurality of text queries, wherein the plurality of text queries comprises the first text query, and wherein each text query of the plurality of text queries is generated based at least in part on a respective pronunciation setting of a speech-to-text module ([0064]: generate the search string through a speech-to-text software application from the dictation of a user interacting with a database, such as through a telephone system or microphone. [0079]; [0069]: search for similar sounding words within the phonetic database for each word and generate a number of different combinations based on the approximate pronunciation of the text entered and the phonetically equivalent formulas, the result of which would be “generating a plurality of text queries” based on a pronunciation setting of a speech-to-text module).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Jang and Ramos and Olstad to incorporate the teachings of Davallou to generate a plurality of text queries, wherein the plurality of text queries comprises the first text query, and wherein each text query of the plurality of text queries is generated based at least in part on a respective pronunciation setting of a speech-to-text module. Doing so would eventually find the correct query even though the original entry involved a different spelling, without the user having to respell or retype the entry as taught by Davallou ([0069]).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAOQIN HU whose telephone number is (571)272-1792.  The examiner can normally be reached on Monday-Friday 7:00am-3:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Charles Rones can be reached on (571) 272-4085.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/XIAOQIN HU/Examiner, Art Unit 2168  

/CHARLES RONES/Supervisory Patent Examiner, Art Unit 2168
Read full office action
Prosecution Timeline

May 19, 2025
Application Filed
Mar 07, 2026
Non-Final Rejection — §101, §103, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/985,899
Patent 12585863
COMPRESSION SCHEME FOR STABLE UNIVERSAL UNIQUE IDENTITIES
2y 5m to grant Granted Mar 24, 2026
18/456,955
Patent 12554773
METHODS AND SYSTEM FOR IMPORTING DATA TO A GRAPH DATABASE USING NEAR-STORAGE PROCESSING
2y 5m to grant Granted Feb 17, 2026
18/984,266
Patent 12554736
METHODS AND SYSTEMS FOR GENERATING RECOMMENDATIONS IN CLOUD-BASED DATA WAREHOUSING SYSTEM
2y 5m to grant Granted Feb 17, 2026
18/472,565
Patent 12488055
DATASET IDENTIFICATION FOR DATASETS WITH MULTIPLE IDENTIFICATION ATTRIBUTES
2y 5m to grant Granted Dec 02, 2025
18/563,366
Patent 12481645
DATA MANAGEMENT SYSTEM AND METHOD FOR DETECTING BYZANTINE FAULT
2y 5m to grant Granted Nov 25, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
61%
Grant Probability
99%
With Interview (+57.9%)
2y 12m
Median Time to Grant
Low
PTA Risk
Based on 187 resolved cases by this examiner. Grant probability derived from career allow rate.