Last updated: May 29, 2026

Application No. 18/386,942

ENTITY RESOLUTION USING AUDIO SIGNALS

Non-Final OA §101§103

Filed

Nov 03, 2023

Examiner

ARMSTRONG, ANGELA A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Amazon Technologies, Inc.

OA Round

1 (Non-Final)

Interview Optional

— +8.2% interview lift. Interview lift (+8.2%) is below the 15.0% threshold. A written response is recommended.

Based on 646 resolved cases, 2023–2026

Examiner Intelligence

ARMSTRONG, ANGELA A View full profile →

Grants 74% — above average

Career Allowance Rate

480 granted / 646 resolved

+12.3% vs TC avg

Moderate +8% lift

Without

With

+8.2%

Interview Lift

resolved cases with interview

Typical timeline

3y 10m

Avg Prosecution

18 currently pending

Career history

672

Total Applications

across all art units

Statute-Specific Performance

§101

11.3%

-28.7% vs TC avg

§103

69.4%

+29.4% vs TC avg

§102

8.7%

-31.3% vs TC avg

§112

3.8%

-36.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 646 resolved cases

Office Action

§101 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to the preliminary amendment filed September 10, 2025.  Claims 1, 4, 8, 10, 12-13, 17, and 19 have been amended.  Claims 1-20 remain pending.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on September 10, 2025 is being considered by the examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.   Claim 1 is directed to a method for executing a first action to control the first device in response to the first request.  The method provides limitations for  receiving a first audio signal comprising first speech, wherein the first speech includes a first request to control a first device, which is a data gathering step that can be achieved by a person hearing a request for a task to be performed; generating, using an acoustic encoder, first embedding data representing the first audio signal, can be achieved by the person, using pen and paper, an interpretation of the command of the task to be performed; generating, using a text encoder, second embedding data representing a first entity, wherein the first entity corresponds to the first device, which can be achieved by the person, using pen and paper, providing the device the task command will be acted upon;  generating a first modified embedding using a first multi-head attention mechanism, wherein the first multi-head attention mechanism uses first query data generated from the first embedding data and first key data and first value data generated from the second embedding data, can be achieved via a mathematical algorithm to represent information/metadata or slot/filler data for performing the task; generating a second modified embedding using a second multi-head attention mechanism, wherein the second multi-head attention mechanism uses second query data generated from the second embedding data and second key data and second value data generated from the first embedding data can be achieved via a mathematical algorithm to represent information/metadata or slot/filler data for performing the task; generating a first averaged modified embedding across the temporal dimension of the first modified embedding can be achieved via a mathematical algorithms using pen and paper to process the information/metadata or slot/filler data for performing the task;  generating a second averaged modified embedding across the temporal dimension of the second modified embedding can be achieved via a mathematical algorithm using pen and paper to process the information/metadata or slot/filler data for performing the task; generating first multi-modal embedding data using the first averaged modified embedding and the second averaged modified embedding can be achieved using pen and paper to combine the two sets of embeddings; generating, by inputting the first multi-modal embedding data into at least one linear layer, first score data indicating a correspondence between the first audio signal and the first entity is a mathematical calculation that can be achieved via applying natural language modelling rules and principles to score the probability the input audio refers to the determined entity; determining that the first speech mentions the first device based on the first score data can be achieved by the person evaluating the score data and determining the entity/device are referenced; and executing a first action to control the first device in response to the first request can be achieved by the person performing the task on the determined device.
The recited limitations are directed a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind (or pen and paper) but for the recitation of the generic computer.  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas.  Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application because the recited the generic computer amounts to no more than mere instructions to apply the exception using generic computer components.  Accordingly, the elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.  The claims are directed to an abstract idea.  The claims are not patent eligible.
 The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, as indicated with respect to integration of the abstract idea into a practical application, the additional elements of the generic computer and computer instructions to perform the various steps amounts to no more than mere instructions to apply the exception using generic computer components.  Mere instructions to apply an exception using generic computer components cannot provide an inventive concept.  The claims are not patent eligible.
Dependent claims 2 and 3 do not integrate the judicial exception into a practical application and do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The limitations of the dependent claims are directed to steps of organizing or manipulating data for update/adjust model features/parameters that are utilizing natural language processing rules and principles to analyze/process audio/text using mathematical calculations.  The limitations of the dependent claims are steps that can be achieved via mental processing and/or using pen and paper.
Claims 4 and 13 are directed to a method and system for performing a first action.  The claims provide limitations for receiving first audio data representing speech comprising a natural language request that comprises a mention of  a first entity which is a data gathering step that can be achieved by a person hearing a request for a task to be performed; generating first embedding data representing the first audio data can be achieved by the person, using pen and paper, an interpretation of the command of the task to be performed; determining second embedding data representing the first entity which can be achieved by the person, using pen and paper, providing the device the task command will be acted upon; generating a first modified embedding using a first attention mechanism to compare the first embedding data to the second embedding data can be achieved via a mathematical algorithm to represent information/metadata or slot/filler data for performing the task; generating a first averaged modified embedding across the temporal dimension of the first modified embedding can be achieved via a mathematical algorithms using pen and paper to process the information/metadata or slot/filler data for performing the task; determining, based at least in part on the first averaged modified embedding, that the first audio data comprises the mention of the first entity can be achieved by the person evaluating the embeddings and determining the entity/device are referenced; and performing a first action with respect to the first entity in response to the natural language request can be achieved by the person performing the task on the determined device.
The recited limitations are directed a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind (or pen and paper) but for the recitation of the generic computer, apparatus, medium, and generic computer components (memory processor).  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas.  Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application because the recited the generic computer, system, medium, and generic computer components (memory processor) and computer instructions amounts to no more than mere instructions to apply the exception using generic computer components.  Accordingly, the elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.  The claims are directed to an abstract idea.  The claims are not patent eligible.
 The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, as indicated with respect to integration of the abstract idea into a practical application, the additional elements of the generic computer, system, medium, and generic computer components (memory processor) and computer instructions to perform the various steps amounts to no more than mere instructions to apply the exception using generic computer components.  Mere instructions to apply an exception using generic computer components cannot provide an inventive concept.  The claims are not patent eligible.
Dependent claims 5-12 and 14-20 do not integrate the judicial exception into a practical application and do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The limitations of the dependent claims are directed to steps of organizing or manipulating data; update/adjust model features/parameters that are utilizing natural language processing rules and principles to analyze/process audio/text using mathematical calculations; and understanding speech and providing transcriptions via pen and paper.  The limitations of the dependent claims are steps that can be achieved via mental processing and/or using pen and paper.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4-8, 11-17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shabat et al (US Patent Application Publication No. 2024/0203404).
Shabat teaches large language model-based spoken language understanding systems to leverage both audio data and textual data in processing spoken utterances.  Regarding claim 4, Shabat teaches a method comprising: receiving first audio data representing speech comprising a natural language request that comprises a mention of a first entity [para 0033-0034 – audio data 150]; generating first embedding data representing the first audio data [para 0036-0038; 0054-0065; 0066-0073 – audio encoder]; determining second embedding data representing the first entity [para 0036-0038; 0054-0065; 0066-0073 – text encoder]; generating a first modified embedding using a first attention mechanism to compare the first embedding data to the second embedding data [para 0036-0038; 0054-0065; 0066-0073 – attention module]; determining, based at least in part on the first modified embedding, that the first audio data comprises the mention of the first entity [para 0036-0038; 0054-0065; 0066-0075  ---- determined intent]; and performing a first action with respect to the first entity in response to the natural language request [para 0075 – one or more actions performed].
Shabat fails to specifically teach generating a first averaged modified embedding across the temporal dimension of the first modified embedding.  However, Shabat teaches computing a weighted sum of audio embeddings [para 0062], and finding an optimum functional relationship of known parameters [where the weights of the embeddings are all equal, making the weighted sum an average] is an obvious step requiring only routine skill in the art.  Therefore, one having ordinary skill in the art the time of the invention would have been able to determine an optimum functional relationship of the embeddings, as was well known, so as to  provide the full context of embeddings to detect the user’s intent, making the system more reliable with improved accuracy and performance in detecting the user’s intent.
Regarding claim 5, Shabat teaches generating, by the first attention mechanism, first query data generated from the first embedding data [para 0036-0038; 0054-0065; 0066-0073  -- attention module] and generating, by the first attention mechanism, first key data and first value data generated from the second embedding data, wherein the first modified embedding is generated based at least in part on the first query data, the first key data, and the first value data . [para 0036-0038; 0054-0065; 0066-0073 -- The attention module 322 can attend the text encodings over to the audio encodings. For instance, the attention module 322 can take attention from a particular text encoding in the sequence of text encodings (corresponding to a particular token in the textual data) over each one of the audio encodings in the sequence of audio encodings to determine a context vector for each text encoding].
Regarding claim 6, Shabat teaches generating a second modified embedding using a second attention mechanism to compare the first embedding data to the second embedding data [Figs 3A-3D – attention module 341]; generating a fused embedding by combining the first modified embedding and the second modified embedding [Figs 3A-3D – fused encoding]; and determining that the first audio data comprises the mention of the first entity based at least in part on the fused embedding [para 0036-0038; 0042-005; 0054-0065; 0066-0073 – determined intent].
Regarding claim 7, Shabat teaches  generating, by the second attention mechanism, second query data generated from the second embedding data [Figs 3A-3D – query vector]; and generating, by the second attention mechanism, second key data and second value data generated from the first embedding data [para 0036-0038; 0042-005; 0054-0065; 0066-0073 – corresponding slot values], wherein the fused embedding is generated based at least in part on the second query data, the second key data, and the second value data [Figs 3A-3D – fused encoding].
Regarding claim 8, Shabat teaches generating, for each entity of a plurality of entities, a respective entity embedding [para 0036-0038; 0054-0065; 0066-0073 – plurality of embeddings]; determining, using the first attention mechanism, at least one attention score for each entity embedding [para 0036-0038; 0054-0065; 0066-0073 – softmax processing… probability distribution indicative of one or more intent]; and selecting a first entity embedding corresponding to the first entity based at least in part on the attention score for the first entity embedding [para 0028; 0036-0038; 0054-0065; 0066-0073 – softmax processing… probability distribution indicative of one or more intent].
Regarding claim 11, Shabat teaches sending first entity data representing that the first audio data comprises the mention of the first entity to a large language model (LLM)  [para 0068-0069 – LLM processing]; and sending a first transcription of the speech to the LLM [para 0068-0069 – textual data input to LLM].
Regarding claim 12, Shabat teaches generating first entity data representing the determination that the first audio data comprises the mention of the first entity [para 0036-0038; 0042-005; 0054-0065; 0066-0073]; determining second entity data representing a determination that a transcription of the first audio data generated by a first automatic speech recognition (ASR) model comprises a second entity different from the first entity [para 0036-0038; 0042-005; 0054-0065; 0066-0073 –intent 206/intent 207]; and updating the ASR model based at least in part on a difference between the first entity and the second entity [para 0040-0041; 0069 – domain specific training].
Regarding claims 13-17, and 20, the claims are directed to system reciting features similar in scope in content to method claims 4-8, and 11-12, and are rejected under similar rationale.

Claims  9-10  and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Shabat in view of Xu et al (“Adaptive Contextual Biasing for Transducer Based Streaming Speech recognition”, 2023), hereinafter Xu.

Regarding claims 9 and 18, Shabat teaches trained encoder model [Fig 1A – trainer 110].  Shabat fails to teach the details of the training, bit Xu teaches training an encoder model to generate the first modified embedding using a cross-modality alignment loss, wherein the cross-modality alignment loss trains the encoder model to identify multi-modal embeddings that represent audio mentioning a second entity and ground truth entity data for the second entity.[section 2.1 – predictor entity detector continuing to 2.3 Adaptive Contextual Inference – cross-entropy loss], and suggests the techniques provides improved performance in speech recognition.  Therefore, one having ordinary skill in the art at the time of the invention would have recognized the advantages of implementing the encoder training techniques suggested by Xu, and the results would have been predictable and provided an improved performance of the ASR of the system.

Regarding claims 10 and 19, the combination of Shabat and Xu teaches training an encoder model to generate the first modified embedding using a discriminative loss, wherein the discriminative loss trains the encoder model to generate similar multi-modal embedding data for different input audio samples that refer to the same entity and different multi-modal embedding data for different input audio samples that refer to different entities [section 2.1 – predictor entity detector continuing to 2.3 Adaptive Contextual Inference – entity predictor based].


Allowable Subject Matter
Claims 1-3  would be allowable if the rejections under 35 USC 101 can be overcome.
The following is a statement of reasons for the indication of allowable subject matter:  the combination of the prior art of Shabat and Xu fails to specifically teach or disclose  
generating a first averaged modified embedding across a temporal dimension of a first modified embedding, where the first modified embedding is generated by using a first multi-head attention mechanism, wherein the first multi-head attention mechanism uses first query data generated from first embedding data and first key data and first value data generated from a second embedding data; 
generating a second averaged modified embedding across the temporal dimension of the second modified embedding, where the second modified embedding is generated using a second multi-head attention mechanism, wherein the second multi-head attention mechanism uses second query data generated from second embedding data and second key data and second value data generated from the first embedding data; 
generating first multi-modal embedding data using the first averaged modified embedding and the second averaged modified embedding; 
generating, by inputting the first multi-modal embedding data into at least one linear layer, first score data indicating a correspondence between the first audio signal and the first entity; 
 determining that the first speech mentions the first device based on the first score data; and executing a first action to control the first device in response to the first request.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598. The examiner can normally be reached M,T,TH,F 11:30-8:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

ANGELA A. ARMSTRONG
Primary Examiner
Art Unit 2659



/ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Nov 03, 2023

Application Filed

Apr 15, 2026

Non-Final Rejection mailed — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/924,466

Patent 12640146

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD AND RECORDING MEDIUM

3y 6m to grant Granted May 26, 2026

18/183,522

Patent 12640140

ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF

3y 2m to grant Granted May 26, 2026

18/132,793

Patent 12626704

DETECTING VISUAL ATTENTION DURING USER SPEECH

3y 1m to grant Granted May 12, 2026

18/089,392

Patent 12608566

Method and Apparatus for Selecting Sample Corpus Used to Optimize Translation Model

3y 3m to grant Granted Apr 21, 2026

18/240,480

Patent 12602547

DOMAIN ADAPTING GRAPH NETWORKS FOR VISUALLY RICH DOCUMENTS

2y 7m to grant Granted Apr 14, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

74%

Grant Probability

82%

With Interview (+8.2%)

3y 10m (~1y 3m remaining)

Median Time to Grant

Low

PTA Risk

Based on 646 resolved cases by this examiner. Grant probability derived from career allowance rate.