Last updated: April 19, 2026

Application No. 18/756,800

STREAMING SPEECH-TO-TEXT SYSTEM

Non-Final OA §103§112

Filed

Jun 27, 2024

Examiner

MUELLER, PAUL JOSEPH

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Amazon Technologies, Inc.

OA Round

1 (Non-Final)

Interview Optional

— +34.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 128 resolved cases, 2023–2026

Examiner Intelligence

MUELLER, PAUL JOSEPH View full profile →

Grants 76% — above average

Career Allow Rate

97 granted / 128 resolved

+13.8% vs TC avg

Strong +35% interview lift

Without

With

+34.6%

Interview Lift

resolved cases with interview

Typical timeline

3y 0m

Avg Prosecution

25 currently pending

Career history

153

Total Applications

across all art units

Statute-Specific Performance

§101

13.2%

-26.8% vs TC avg

§103

62.2%

+22.2% vs TC avg

§102

7.4%

-32.6% vs TC avg

§112

14.8%

-25.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 128 resolved cases

Office Action

§103 §112

DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on June 27, 2024. 
Claims 1-20 are pending in the application. As such, claims 1-20 have been examined. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings were received on June 27, 2024.  These drawings have been accepted and considered by the Examiner.

Claim Objections
Claims 3, 11 and 12 are objected to because of the following informalities:
Claim 3 line 2 reads “the compute instance”.  Examiner believes this to be a clerical error and it is intended to read “the determined compute instance” in order to be consistent with the entire claim set,
Claim 11 line 3 reads “the compute instance”.  Examiner believes this to be a clerical error and it is intended to read “the determined compute instance” in order to be consistent with the entire claim set,
Claim 12 line 2 reads “the compute instance”.  Examiner believes this to be a clerical error and it is intended to read “the determined compute instance” in order to be consistent with the entire claim set.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 17 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 17 line 7 recites the limitation "the request".  There is insufficient antecedent basis for this limitation in the claim.
Claim 17 line 8 recites the limitation "the transcript".  There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 4-7, 10, 13-14 and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over Stefani et al. (US Patent No. 10777186 B1), hereinafter Stefani, in view of Farinelli et al. (US Patent Pub. No. 20200251115 A1), hereinafter Farinelli.
Regarding claims 1, 4 and 17, Stefani teaches a computer-implemented method and system (Stefani in [col 11 lines 34-51] teaches using a computer system for a method)
comprising:
[claim 17 only] a first one or more computing devices to implement a storage service in a multi-tenant provider network (Stefani in [col 14 lines 50-63] teaches providing multiple computation resources (e.g., VMs) to customers); 
and
[claim 17 only] a second one or more computing devices to implement a speech-to-text service in the multi-tenant provider network, the speech-to-text service including instructions that upon execution cause the speech-to-text service to (Stefani in [col 11 lines 34-51] teaches the operations are performed under the control of one or more computer systems configured with executable instructions and are implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, by hardware or combinations thereof, and the code is stored on a computer-readable storage medium, for example, in the form of a computer program comprising instructions executable by one or more processors):
receiving a request to perform speech-to-text using a speech-to-text service to generate a transcript from an audio stream (Stefani in [col 9 lines 25-36] teaches receiving a request to perform ASR, and in [col 9 ln 62 – col 10 ln 9] teaches transcribing the audio data stream);
performing speech-to-text according to the request using the speech-to-text service to generate the transcript from the audio stream by:
determining a compute instance to send the request to based, at least in part, on availability information maintained, by a backend of the speech-to-text service, in a distributed routing cache for a plurality of compute instances and types of speech-to-text processing indicated by the request (Stefani in [col 9 lines 25-36] teaches using a load balancer to determine availability of services in a fleet of services and selecting one, and in [col 10 lines 36-52] teaches using a backend system, and in [col 4 lines 17-37] teaches choosing a specific type of decoder based on the type of audio stream),
sending the request to the determined compute instance (Stefani in [col 9 ln 62 – col 10 ln 9] teaches connect to the decoder host using the reference provided by the host management service and begin streaming audio data for transcription), 
and
processing the request using the determined compute instance to generate the transcript (Stefani in [col 9 ln 62 – col 10 ln 9] teaches connect to the decoder host using the reference provided by the host management service and begin streaming audio data for transcription); 
and
providing the transcript as indicated by the request (Stefani in [col 9 ln 62 – col 10 ln 9] teaches the decoder host can return the transcription over the same bi-directional connection to the frontend service, which may return the transcription to the client device via load balancer).
Stefani does not teach, however Farinelli teaches
[wherein the determined compute instance is to utilize a model cache to] dynamically switch speech-to-text models (Farinelli in [0002] teaches ranking multiple speech-to-text models under varying audio distortion types to continually select the most accurate model at time t). 
Farinelli is considered to be analogous to the claimed invention because it is in the same field of using speech-to-text models. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Stefani further in view of Farinelli to allow for continually selecting the most accurate model. Motivation to do so would allow for providing a high quality speech recognition on the specified quality level of audio source (Farinelli [0018]).

Regarding claims 2, 5 and 19, Stefani, as modified above, teaches the computer-implemented method and system of claims 1, 4 and 17.
Stefani further teaches 
wherein the request includes one or more of 
an indication of a language being used, 
an indication of a latency that is acceptable (Stefani in [col 5 lines 7-32] teaches the user can indicate an acceptable latency time which the decoder host can convert to an input window length), 
an indication of where the audio stream is located, 
[claims 5 and 19 only] an indication of one or more custom models to use for speech-to-text operations,
and/or 
an indication of the types of speech-to-text operations to perform.

Regarding claim 6, Stefani, as modified above, teaches the computer-implemented method of claim 4.
Stefani, as modified above, does not teach, however Farinelli teaches
wherein the indication of the types of speech-to-text operations to perform includes an automatic speed recognition (ASR) operation to be performed by an ASR model (Farinelli in [0002] teaches ranking multiple speech-to-text models under varying audio distortion types to continually select the most accurate model at time t, and in [0029] teaches using multiple types of speech to text models).
Farinelli is considered to be analogous to the claimed invention because it is in the same field of using speech-to-text models. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Stefani, as modified above, further in view of Farinelli to allow for continually selecting the most accurate model. Motivation to do so would allow for providing a high quality speech recognition on the specified quality level of audio source (Farinelli [0018]).

Regarding claim 7, Stefani, as modified above, teaches the computer-implemented method of claim 4.
Stefani further teaches 
wherein the indication of the types of speech-to-text operations to perform includes a punction operation to add punction to the transcript to be performed by a punctuation model (Stefani in [col 2 lines 1-20] teaches the results are then punctuated and normalized, and the resulting transcript is then streamed back to the user over the bi-directional connection).

Regarding claim 10, Stefani, as modified above, teaches the computer-implemented method of claim 4.
Stefani further teaches 
wherein the indication of the types of speech-to-text operations to perform includes a speech segmentation model to identify boundaries of at least words in the audio stream (Stefani in [col 2 lines 1-20] teaches the streaming ASR engine can analyze chunks of the audio data stream using an acoustic model to divide the audio data into words, and a language model to identify sentences made of the words spoken in the audio file).

Regarding claims 13 and 18, Stefani, as modified above, teaches the computer-implemented method and system of claims 4 and 17.
Stefani further teaches 
wherein the speech-to-text service supports a plurality of languages and types of models (Stefani in [col 3 lines 23-33] teaches various languages may be supported, including English and Spanish speech-to-text conversion).

Regarding claim 14, Stefani, as modified above, teaches the computer-implemented method of claim 4.
Stefani further teaches 
wherein the distributed routing cache for a plurality of compute instances is maintained by a backend of the speech-to-text service which updates the distributed routing cache for a plurality of compute instances upon model availability (Stefani in [col 12 lines 8-17] teaches the reference associated with the decoder host can be requested from a host management service, the host management service caching a list of decoder hosts and their availability, and after the reference has been provided, the list of decoder hosts can be updated to indicate the decoder host is in use).

Regarding claim 16, Stefani, as modified above, teaches the computer-implemented method of claim 4.
Stefani further teaches 
further comprising:
autoscaling compute instances based at least in part on information in the distributed routing cache for busy versus free slots of each available compute instance (Stefani in [col 10 lines 53-67] teaches the host fleet monitor can configure an autoscaling policy for the decoder hosts. The autoscaling policy may increase the number of decoder hosts in the decoder host fleet or decrease the number of decoder hosts, based on utilization. In some embodiments, the autoscaling policy may define that the decoder fleet is to scale up when utilization is greater than 85%, and to scale down when utilization is less than 55%. The autoscaling policies may define the number of hosts to add or remove, which may be the same number or different numbers of hosts, and how long to wait between scaling up or scaling down before checking utilization and scaling again. In some embodiments, the autoscaling policy may cause the decoder fleet to be scaled based on predicted future demand).

Claims 3 and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Stefani, in view of Farinelli, in view of Boldyrev et al. (US Patent Pub. No. 20120166645 A1), hereinafter Boldyrev.

Regarding claim 3, Stefani, as modified above, teaches the computer-implemented method of claim 1.
Stefani, as modified above, does not teach, however Boldyrev teaches
further comprising:
re-balancing one or more models of the compute instance by storing state of, and terminating, at least one model and restoring state of, and starting, at least one model (Boldyrev in [0074] teaches the pausing and resumption may be performed in response to an unsatisfactory level of capabilities, for example to perform load balancing and optimize computation cost).
Boldyrev is considered to be analogous to the claimed invention because it is in the same field of load balancing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Stefani, as modified above, further in view of Boldyrev to allow for pausing and resumption to perform load balancing. Motivation to do so would allow a system to minimize or improve or significantly improve data migration within a computational architecture by providing multi-level distributed computations, such that the data can be migrated to the closest possible computation level with minimized or improved cost (Boldyrev [0003]).

Regarding claim 11, Stefani, as modified above, teaches the computer-implemented method of claim 4.
Stefani further teaches 
based at least in part on speech-to-text traffic (Stefani in [col 11 lines 4-20] teaches identifying spikes in traffic associated with the current transcription jobs).
Stefani, as modified above, does not teach, however Boldyrev teaches
further comprising:
re-balancing [based at least in part on speech-to-text traffic] one or more models of the compute instance by storing state of, and stopping, at least one model and restoring state of, and re-starting, at least one model (Boldyrev in [0074] teaches the pausing and resumption may be performed in response to an unsatisfactory level of capabilities, for example to perform load balancing and optimize computation cost, and in [0086] teaches executable states can be transferred to next computational branch [this maps to storing and restoring state]).
Boldyrev is considered to be analogous to the claimed invention because it is in the same field of load balancing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Stefani, as modified above, further in view of Boldyrev to allow for pausing and resumption to perform load balancing. Motivation to do so would allow a system to minimize or improve or significantly improve data migration within a computational architecture by providing multi-level distributed computations, such that the data can be migrated to the closest possible computation level with minimized or improved cost (Boldyrev [0003]).

Regarding claim 12, Stefani, as modified above, teaches the computer-implemented method of claim 11.
Stefani, as modified above, teaches the state is stored (see claim 11).
Stefani further teaches 
wherein [the state is stored] to local memory of the compute instance (Stefani in [col 15 lines 13-33] teaches a virtualized data store gateway may be provided at the customer network that may locally cache at least some data, for example frequently-accessed or critical data, and that may communicate with storage service via one or more communications channels to upload new or modified data from a local cache so that the primary store of data (virtualized data store) is maintained).

Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Stefani, in view of Farinelli, in view of Jung et al. (US Patent Pub. No. 20220254351 A1), hereinafter Jung.

Regarding claim 8, Stefani, as modified above, teaches the computer-implemented method of claim 4.
Stefani, as modified above, does not teach, however Jung teaches
wherein the indication of the types of speech-to-text operations to perform includes a diarization operation to add one or more speakers to the transcript to be performed by a diarization model (Jung in [0005] teaches performing a speech-based speaker diarization).
Jung is considered to be analogous to the claimed invention because it is in the same field of using diarization models. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Stefani, as modified above, further in view of Jung to allow for performing a speech-based speaker diarization. Motivation to do so would allow for using a speaker change status based on the text in which context is considered for correcting the speaker diarization, it is possible to solve recognition errors found in the existing speaker diarization technologies (Jung [0056]).

Regarding claim 9, Stefani, as modified above, teaches the computer-implemented method of claim 8.
Stefani, as modified above, does not teach, however Jung teaches
wherein the indication of the types of speech-to-text operations to perform includes a speaker error correction operation to correct alignment of the one or more speakers to the transcript to be performed by a speaker error correction model (Jung in [0005] teaches correcting speaker diarization that may correct a point of a speaker change error by detecting a speaker change based on recognized text after performing a speech-based speaker diarization).
Jung is considered to be analogous to the claimed invention because it is in the same field of using diarization models. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Stefani, as modified above, further in view of Jung to allow for performing a speech-based speaker diarization. Motivation to do so would allow for using a speaker change status based on the text in which context is considered for correcting the speaker diarization, it is possible to solve recognition errors found in the existing speaker diarization technologies (Jung [0056]).

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Stefani, in view of Farinelli, in view of Pai Brahmavar Pattanshet et al. (US Patent Pub. No. 20240263958 A1), hereinafter Pai.

Regarding claim 15, Stefani, as modified above, teaches the computer-implemented method of claim 4.
Stefani, as modified above, does not teach, however Pai teaches
wherein the distributed routing cache has a plurality of slot types based on complexity of the STT operations to perform (Pai in [0094] teaches determining one or more candidate content slots based on complexity).
Pai is considered to be analogous to the claimed invention because it is in the same field of slot management. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Stefani, as modified above, further in view of Pai to allow for determining one or more candidate content slots based on complexity. Motivation to do so would allow for determining a time during the route to render the content item and allow for more relevant content to be displayed to a user at the desired time (Pai [0045]).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Stefani, in view of Farinelli, in view of Modi et al. (US Patent Pub. No. 20210392136 A1), hereinafter Modi.

Regarding claim 20, Stefani, as modified above, teaches the system of claim 17.
Stefani, as modified above, does not teach, however Modi teaches
further comprising:
chat service to receive the audio stream (Modi in [0078] teaches a system which uses a voice chat messaging service, and transcribing an audio stream).
Modi is considered to be analogous to the claimed invention because it is in the same field of transcription. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Stefani, as modified above, further in view of Modi to allow for using a voice chat messaging service, and transcribing an audio stream. Motivation to do so would allow for in response to receiving validation-related information, the application component can allow the communication device and/or associated user access to the service and the application component, in accordance with the subscriber or account status, service plan, etc., associated with the communication device and/or user (Modi [0034]).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL J. MUELLER whose telephone number is (571)272-1875. The examiner can normally be reached M-F 9:00am-5:00pm (Eastern).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

PAUL MUELLER
Examiner
Art Unit 2657



/PAUL J. MUELLER/Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

Jun 27, 2024

Application Filed

Feb 12, 2026

Non-Final Rejection — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/083,119

Patent 12597419

NATURAL LANGUAGE PROCESSING APPARATUS AND NATURAL LANGUAGE PROCESSING METHOD

2y 5m to grant Granted Apr 07, 2026

18/373,547

Patent 12596867

Detecting Computer-Generated Hallucinations using Progressive Scope-of-Analysis Enlargement

2y 5m to grant Granted Apr 07, 2026

18/418,871

Patent 12596886

PERSONALIZED RESPONSES TO CHATBOT PROMPT BASED ON EMBEDDING SPACES BETWEEN USER AND SOCIETY

2y 5m to grant Granted Apr 07, 2026

18/518,155

Patent 12579378

USING LLM FUNCTIONS TO EVALUATE AND COMPARE LARGE TEXT OUTPUTS OF LLMS

2y 5m to grant Granted Mar 17, 2026

18/036,481

Patent 12562174

NOISE SUPPRESSION LOGIC IN ERROR CONCEALMENT UNIT USING NOISE-TO-SIGNAL RATIO

2y 5m to grant Granted Feb 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

76%

Grant Probability

99%

With Interview (+34.6%)

3y 0m

Median Time to Grant

Low

PTA Risk

Based on 128 resolved cases by this examiner. Grant probability derived from career allow rate.

STREAMING SPEECH-TO-TEXT SYSTEM

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email