Last updated: May 29, 2026

Application No. 18/589,118

SYSTEMS AND TECHNIQUES FOR RECOGNIZING DICTATION COMMANDS

Non-Final OA §103

Filed

Feb 27, 2024

Priority

Jun 02, 2023 — provisional 63/470,719

Examiner

PATEL, SHREYANS A

Art Unit

2659

Tech Center

2600 — Communications

Assignee

Apple Inc.

OA Round

2 (Non-Final)

Interview Optional

— +7.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 89% grant rate with +7.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 406 resolved cases, 2023–2026

Examiner Intelligence

PATEL, SHREYANS A View full profile →

Grants 89% — above average

Career Allowance Rate

361 granted / 406 resolved

+26.9% vs TC avg

Moderate +8% lift

Without

With

+7.7%

Interview Lift

resolved cases with interview

Fast prosecutor

2y 0m

Avg Prosecution

26 currently pending

Career history

449

Total Applications

across all art units

Statute-Specific Performance

§101

11.0%

-29.0% vs TC avg

§103

67.1%

+27.1% vs TC avg

§102

11.6%

-28.4% vs TC avg

§112

0.8%

-39.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 406 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments with respect to 35 U.S.C. 101 rejection of claims 1 and 20-21 have been considered and found persuasive, and the rejection has been withdrawn.
Applicant's arguments with respect to 35 U.S.C. 103 in regards to claims 1 and 20-21 and 5 have been considered, however are not found to be persuasive due to the following reasons. Applicant argues that Wutte in view of Amento does not explicitly teach “in accordance with a determination that the first audio input satisfies a predetermined condition.” Examiner respectfully disagrees with the Applicant because explicitly teach in [0029], determining a condition based on confidence relative to a predetermined threshold: transcript displayer uses word confusion data “to provide a visual indication with respect to words having a confidence score less than a predetermined threshold”). Claim 5 is further rejected under Amento. Therefore, claims stand rejected. See detailed rejection below. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-21 are rejected under 35 U.S.C. 103 as being unpatentable over Wutte (US 7,027,985) in view of Amento et al. (US 2007/0208567).

Claims 1, 20 and 21,
Wutte teaches a non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device with a display, cause the electronic device to: receive a first audio input ([Figs. 1-2] [col. 4 lines 22-30] processor, memory, device; receive utterance an analog microphone signal MS, which contains speech information SI, can be applied when a user speaks into the microphone and A/D converter produces digital audio data AD, which contains the speech information SI);
after receiving the first audio input, receive a second audio input ([col. 7 lines 33-44] a subsequent spoken input after first text is displayed (after first text information TI1 is displayed); “therefore, the user speaks the second text information TI2 = ‘Replace Invasion with Invention’ of the spoken command SB into the microphone 2”); 
display, via the display, one or more words transcribed from at least a portion of the first audio input, wherein the one or more words are transcribed using a language model ([col. 5 lines 4-9] [col. 4 lines 44-48] a language model as word-sequence probability information (in the speech model memory 8 is stored probability information WI about the sequence of words customary in language and recognizer evaluate the probability information WI, determine the sequence that has the greatest overall probability); 
in accordance with a determination, based on the updated language model, that the second audio input includes a valid command for the one or more words transcribed from the at least a portion of the first audio input ([Summary] [col. 4 line 60 to col. 5 line 3] [col. 5 lines 26-37] determining whether recognized second input has the command structure “replace A with B”: checking whether the recognized the next information TI has the structure of a four word spoken command SB = “replace A with B” and executing next steps only when structure is found): 
execute the valid command to modify the display of the one or more words transcribed from the at least a portion of the first audio input ([col. 6 lines 29-40] [col. 9 lines 1-8] editing means 11, replacing, the erroneously recognized TW with the new TN and in example: the eighth word “invasion” is replaced with “invention”); and 
in accordance with a determination that the second audio input does not include a valid command for the one or more words transcribed from the at least a portion of the first audio input: forgo executing, based on the second audio input, a command for the one or more words transcribed from the at least a portion of the first audio input ([col. 5 line 58 to col. 6 line 9] [col. 7 lines 16-26] forgoing command execution when structure not found (“providing for producing the text information TI, unmodified as the first text information TI1 when the structure of the four-word spoken command SB is not found”; example because no “replace” or “with” in dictation, replace means “deliver the text information TI as first text information TI1).
The difference between the prior art and the claimed invention is that Wutte does not explicitly teach in accordance with a determination that the first audio input satisfies a predetermined condition: generate a textual representation of the at least a portion of the first audio input; and update the language model based on the textual representation.
Amento teaches in accordance with a determination that the first audio input satisfies a predetermined condition ([0029] determining a condition based on confidence relative to a predetermined threshold: transcript displayer uses word confusion data “to provide a visual indication with respect to words having a confidence score less than a predetermined threshold”): 
generate a textual representation of the at least a portion of the first audio input ([0029] [0051] generating/displaying transcript text from recognized words (“transcript displayer 204 may receive best hypothesis words.. to generate a display of a transcript” and “display a transcript on a display device”); and 
update the language model based on the textual representation ([0034] [0048] [0052] using correction feedback to update models (including language models) to improve future accuracy (“provide correction information to ASR 202, such that ASR 202 may update its language and acoustical models to improve speech recognition accuracy; ASR 202 may use this information to update its language and acoustical models and send feedback data.. adjust language and acoustical models); 
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Wutte with teachings of Amento by modifying the speech recognition method with a replace command as taught by Wutte to include in accordance with a determination that the first audio input satisfies a predetermined condition: generate a textual representation of the at least a portion of the first audio input; and update the language model based on the textual representation as taught by Amento for the benefit of updating its language and acoustical models to improve speech recognition accuracy (Amento [0034]).

Claim 2,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the textual representation is different from the one or more words transcribed from the at least a portion of the first audio input ([Abstract] first text and second text are different wherein the first text is in error and the second text is being modified).

Claim 3,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein: the textual representation includes a plurality of candidate transcriptions determined based on the at least a portion of the first audio input; and the one or more words transcribed from the at least a portion of the first audio input comprise a first candidate transcription of the plurality of candidate transcriptions ([Fig. 4] determining alternative text information part and chooses the highest overall probability for the command).

Claim 4,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: forgo displaying the textual representation ([Abstract] if the structure of the four-word command is not found in the text information, then the first text information is delivered unmodified (no command executed)).

Claim 5,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the first audio input satisfies the predetermined condition when first speech input in the first audio input is followed by a predetermined period of non-speech ([0029] determining a condition based on confidence relative to a predetermined threshold: transcript displayer uses word confusion data “to provide a visual indication with respect to words having a confidence score less than a predetermined threshold”).

Claim 6,
Amento further teaches the non-transitory computer-readable storage medium of claim 1, wherein the first audio input satisfies the predetermined condition when the first audio input is determined to correspond to a first topic and the second audio input is determined to correspond to a second topic different from the first topic ([0008-0010] updating the processing based on confidence/semantic context of a transcript and subsequent user interactions; treating different semantic contexts/topic as conditions that trigger different handling for later inputs).

Claim 7,
Amento further teaches the non-transitory computer-readable storage medium of claim 1, wherein updating the language model based on the textual representation includes: adjusting one or more weights associated with the language model, wherein the one or more weights correspond to at least one word represented by the textual representation ([0008-0010] providing correction information back to the speech processing module to improve accuracy (LM/acoustic model are updated by the corrected text)).

Claim 8,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: determine, based on the updated language model, that the second audio input includes the valid command for the one or more words transcribed from the at least a portion of the first audio input, including: processing the second audio input based on the updated language model to recognize at least one word of the one or more words transcribed from the at least a portion of the first audio input ([Fig. 2] [col. 5 lines 26-57] parsing the four-word command (“replace A with B”); locating A in the first text and performing editing to the second text).

Claim 9,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: determine, based on the updated language model, that the second audio input includes the valid command for the one or more words transcribed from the at least a portion of the first audio input, including: detecting, based on a predefined syntax structure, a first candidate command for the one or more words transcribed from the at least a portion of the first audio input ([col. 5 lines 26-57] predefined syntax for the command and a structure checking of the positions SB1…SB4).

Claim 10,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: determine, based on the updated language model, that the second audio input includes the valid command for the one or more words transcribed from the at least a portion of the first audio input, including: determining that the second audio input includes a second candidate command; and determining, based on the updated language model, that the second candidate command corresponds to at least one target word of the one or more words transcribed from the at least a portion of the first audio input ([col. 5 line 58 to col. 6 line 40] checking the second text for structure and then search the first text for the target word (TW) to be replaced by TN; multiple candidate parses and target word identification).

Claim 11,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: display, via the display, one or more words transcribed from the second audio input; and after displaying the one or more words transcribed from the second audio input: in accordance with a determination that the second audio input includes a third candidate command for the one or more words transcribed from the at least a portion of the first audio input, cease to display, via the display, the one or more words transcribed from the second audio input; and after the determination that the second audio input includes the third candidate command: in accordance with a determination that the third candidate command does not correspond to a target word of the one or more words transcribed from the at least a portion of the first audio input: display, via the display, the one or more words transcribed from the second audio input ([Abstract] information (TI1) and the second text information (T2), checking whether a word at the first position in the second text information (TI2) matches a first spoken command part (SB1) and checking whether a word at the third or further position in the second text information (TI2) matches a third spoken command part (SB3) while, in the event of a positive result of the checks, the next step of the text editing method is executed, checking whether the first text information (TI1) contains the erroneously recognized text information part (TW) which is featured by a second spoken command part (SB2) in the second text information (TI2) between the first spoken command part (SB1) and the third spoken command (SB3) while, in the event of a positive result of the check, the next step of the text editing method is executed, replacing the erroneously recognized text information part (TW) in the first text information TI1 with new text information part (TN) which the second text information (TI2) contains as a fourth spoken command part (SB4) of the four-word spoken command (SB) after the third spoken command part (SB3)).

Claim 12,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: in accordance with the determination that the second audio input includes the valid command for the one or more words transcribed from the at least a portion of the first audio input: forgo displaying, via the display, one or more words transcribed from the second audio input ([Abstract] applying the command to modify the first text; the command text itself need not persist on screen after validation; if the structure of the four-word command is not found in the text information, then the first text information is delivered unmodified (no command executed)).

Claim 13,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: in accordance with the determination that the second audio input does not include a valid command for the one or more words transcribed from the at least a portion of the first audio input: display, via the display, one or more words transcribed from the second audio input ([col. 10 lines 38-62] when command structure fails or target word not found, no replacement occurs and the original text remains; the command speech is handled as ordinary text when invalid).

Claim 14,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: after receiving the second audio input, receive a third audio input; in accordance with the determination that the second audio input includes the valid command for the one or more words transcribed from the at least a portion of the first audio input: perform speech recognition on the third audio input without using one or more words transcribed from the second audio input; and in accordance with the determination that the second audio input does not include the valid command for the one or more words transcribed from the at least a portion of the first audio input: perform speech recognition on the third audio input using the one or more words transcribed from the second audio input ([Abstract] receiving speech information (SI) and recognizing the first text information (TI1) and the second text information (T2), checking whether a word at the first position in the second text information (TI2) matches a first spoken command part (SB1) and checking whether a word at the third or further position in the second text information (TI2) matches a third spoken command part (SB3) while, in the event of a positive result of the checks, the next step of the text editing method is executed, checking whether the first text information (TI1) contains the erroneously recognized text information part (TW) which is featured by a second spoken command part (SB2) in the second text information (TI2) between the first spoken command part (SB1) and the third spoken command (SB3) while, in the event of a positive result of the check, the next step of the text editing method is executed, replacing the erroneously recognized text information part (TW) in the first text information TI1 with new text information part (TN) which the second text information (TI2) contains as a fourth spoken command part (SB4) of the four-word spoken command (SB) after the third spoken command part (SB3)).
	In addition, Amento’s teaches a framework conditions subsequent processing on earlier outcomes by sending correction feedback to the model and changing handling for later inputs; using/ignoring the second-input words as context within the loop (Amento [0008-0010]).

Claim 15,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the valid command requests to add a first word to the one or more words transcribed from the at least a portion of the first audio input ([Abstract] [Claims 3-5] inserting/adding/replacing with a word).

Claim 16,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the valid command requests to replace a second word of the one or more words transcribed from the at least a portion of the first audio input with a third word different from the second word ([col. 5 lines 26-57] replace command (“REPLACE A WITH B”); SB1…SB4).

Claim 17,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the valid command requests to delete a fourth word of the one or more words transcribed from the at least a portion of the first audio input ([col. 5 lines 26-27] [Abstract] SB = “REPLACE”; SB1…SB4).

Claim 18,
Amento further teaches the non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions, which when executed by the one or more processors, cause the electronic device to: while receiving the first audio input and while receiving the second audio input, display, via the display, a dictation user interface; and in response to receiving the second audio input: in accordance with a determination that the second audio input corresponds to text for dictation: display, via the display, the dictation user interface with a first appearance; and in accordance with a determination that the second audio input corresponds to a command associated with previously dictated text: display, via the display, the dictation user interface with a second appearance different from the first appearance ([0008-0010] UI elements that visually indicate confidence and correction mode and provides editing options lists when in correction/command context; this is a distinct UI state from plain dictation).

Claim 19,
Wutte further teaches the non-transitory computer-readable storage medium of claim 1, wherein the second audio input is received after an endpoint is determined for second speech in the first audio input ([col. 10 lines 51-63] spoken command part also be formed by a pause/endpoint when the four-word spoken command is uttered).

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Primary Examiner
Art Unit 2653



/SHREYANS A PATEL/Examiner, Art Unit 2659

Read full office action

Prosecution Timeline

Show 3 earlier events

Dec 12, 2025

Examiner Interview Summary

Dec 29, 2025

Response Filed

Feb 03, 2026

Final Rejection mailed — §103

Mar 26, 2026

Examiner Interview Summary

Mar 26, 2026

Applicant Interview (Telephonic)

Mar 30, 2026

Response after Non-Final Action

Apr 28, 2026

Request for Continued Examination

Apr 30, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/132,165

Patent 12608559

METHOD AND SYSTEM FOR ENHANCING A MUTIMODAL INPUT CONTENT

3y 0m to grant Granted Apr 21, 2026

18/696,802

Patent 12609128

METHOD FOR IMPROVING FAR-FIELD SPEECH INTERACTION PERFORMANCE, AND FAR-FIELD SPEECH INTERACTION SYSTEM

2y 0m to grant Granted Apr 21, 2026

17/934,906

Patent 12586597

ENHANCED AUDIO FILE GENERATOR

3y 6m to grant Granted Mar 24, 2026

18/744,449

Patent 12586561

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM, AND A METHOD OF CALCULATING AN EXPRESSIVITY SCORE

1y 9m to grant Granted Mar 24, 2026

17/983,671

Patent 12548549

ON-DEVICE PERSONALIZATION OF SPEECH SYNTHESIS FOR TRAINING OF SPEECH RECOGNITION MODEL(S)

3y 3m to grant Granted Feb 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

2-3

Expected OA Rounds

89%

Grant Probability

97%

With Interview (+7.7%)

2y 0m (~0m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 406 resolved cases by this examiner. Grant probability derived from career allowance rate.