Last updated: April 19, 2026
Application No. 18/634,876
Text-To-Speech Progress-Aware Fulfillment and Response

Final Rejection §103
Filed
Apr 12, 2024
Examiner
MCCORD, PAUL C
Art Unit
2692
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
2 (Final)
Interview Optional

— +26.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 569 resolved cases, 2023–2026
Examiner Intelligence

MCCORD, PAUL C View full profile →
Grants 69% — above average
Career Allow Rate
393 granted / 569 resolved
+7.1% vs TC avg
Strong +27% interview lift
Without
With
+26.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
41 currently pending
Career history
610
Total Applications
across all art units
Statute-Specific Performance

§101
10.5%
-29.5% vs TC avg
§103
54.0%
+14.0% vs TC avg
§102
6.8%
-33.2% vs TC avg
§112
20.9%
-19.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 569 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-3, 5-13, 15-20 rejected under 35 U.S.C. 103 as being unpatentable over Pi: 7437286 further in view of Kulis: 20100241963 hereinafter Ku and further in view of Torok: 20140180697 hereinafter Tor.

Regarding claim 1
Pi teaches:
A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: 
outputting, from an assistant-enabled device, a first text-to-speech (TTS) utterance generated from a first output transcription comprising a sequence of terms (Pi: Col 3:4-3:9: voice output device plays voice prompt to a user; said voice prompt generated by a TTS engine to generate speech output); 
while outputting the first TTS utterance from the assistant-enabled device (Pi: Col  4:45-4:65, 5:50-5:55: system monitors an input signal while the TTS is delivered, output, etc.): 
for each respective term of the sequence of terms, determining a corresponding playback status of the respective term (Pi: 4:45-4:65, 5:50-5:55, 6:49-55; Fig 3: system effectively monitors the waveform of the TTS output by monitoring the input signal thereby determining frames of the output signal to detect echo frame-by-frame and estimate a time delay based thereon for cancellation of the output signal at the input the framewise monitoring effectively arrives at a term by term monitoring of the waveform); 
receiving a barge-in utterance spoken by a user (Pi: Abstract: system monitors and responds to a voice barge in from the user); and 
identifying, based on the corresponding playback status of portions of the output, a  portion output from the assistant-enabled device audibly output as synthesized speech before the user spoke the barge-in utterance  (Pi: Col 1:47-1:53, 4:45-4:65, 5:50-5:55, 6:49-55; Fig 3: system effectively monitors the waveform of the TTS output by monitoring the input signal thereby determining frames of the output signal to detect echo frame-by-frame and estimate a time delay based thereon for cancellation of the output signal at the input the framewise monitoring effectively arrives at a term by term monitoring of the waveform; incident upon a particular portion of an output waveform, frame thereof, the system ceases output and records the user input for the purpose of providing a second, etc. response, output, etc. to a user; wherein the subset of terms comprises terms output to the user before user barge in utterance, that is, the terms output before the utterance are known, and used to subtract the user speech from particular portions of the output speech to improve the clarity, accuracy, etc. of the input speech such as for aiding response thereto); 

determining, based on the identified subset of terms, a second output transcription responsive to the barge-in utterance spoken by the user (Pi: ¶ Col 5:6-5:22; Fig 3: such as a response based on barge in user speech recognized in response to the barge-in utterance at 340); and 
outputting, from the assistant-enabled device, a second TTS utterance generated from the second output transcription (Pi: ¶ Col 5:6-5:22, 6:23-6:36; Fig 3: such as a prompt output to a user in response to commands, etc. recognized in the barge in utterance).  

Pi does not explicitly teach tracking progress term by term to delineate between terms output before the barge in and terms not output due to occurring after the barge in, such as for determining context of the and a per term status of term playback status which indicates whether the respective term has been audibly output from the assistant-enabled device as synthesized speech; to thereby determine a subset of terms which have been audibly output as synthesized speech.

In a related field of endeavor Ku teaches a system, method, and device for generation of interactive audio (Ku: Abstract) comprising: 
outputting utterances generated from a selected and segmented first output transcription comprising a sequence of terms (Ku: ¶ 77, 79, 87, 88; Fig 2A, 2B); 
while outputting the first utterance from a device: for each respective term of the sequence of terms, determining a corresponding playback status of the respective term (Ku: ¶ 80: a segmentation engine generates a list of timestamps corresponding to time boundaries of each word in speech based audio and generates metadata thereof thereby segmenting the output content in concert with stamped, marked, etc. time boundaries) indicates whether the respective term has been audibly output from the assistant-enabled device as synthesized speech (Ku: ¶ 63, 80, 236, 244, etc.; Fig 16A: words stored in concert with stamps, markers, etc. to maintain a current playback position robust to pause operations such that the system operates to return to a tracked pause point effectively maintaining whether words in a transcript have been output or are yet to be output by maintaining relations of word stamps, markers, etc. to a pause point time value) 
receiving utterances spoken by a user (Ku: ¶ 236; Fig 16A, etc.: such as a user command which instantiates the figure 16A, etc. process)
identifying, based on the corresponding playback status of each respective term of the sequence of terms, a subset of terms which have been audibly output as synthesized speech from a device before the user spoke the barge-in utterance (Ku: ¶ 80, 166, 167, etc.; Fig 16A, etc.: a segmentation engine generates a list of timestamps corresponding to time boundaries of each word in speech based audio and generates metadata thereof; the system handles a request during playback by pausing and entering into a handle command process such as that of Fig 16A and subsequently operates to resume playback without discontinuity such as from a paused point); 
determining, based on the identified subset of terms, a second output transcription responsive to the utterance spoken by the user (Ku: ¶ 80, 166, 210 etc.; Fig 11, 16A, etc.: such as by providing voice response such as in response to the command or voice output of the resumed content after response to the voice command);  and outputting, from the assistant-enabled device, a second TTS utterance generated from the second output transcription (Ku: ¶ 80, 162, 166, 167, 210 etc.; Fig 11, 16A-16D, etc.: such as by providing voice response such as in response to the command or voice output of the resumed content after response to the voice command).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Ku taught timestamped word boundaries to improve the Pi taught tracking of waveform output to thereby improve a system and method for providing a response to a user based on tracking words output with respect to a user barge in  thereby determining which terms have been presented prior to a pause of playback dictated by a user entry of instructions as discussed by Pi and to thereby determine a word border time stamp for the barge in utterance and word borders, timestamps etc. denoting which portions of a prompt have and have not been output to a user for at least the purpose of explicitly tracking output in the playback of a prompt with respect to a user utterance such as to determine a context for a next, subsequent, etc. user interaction; one of ordinary skill in the art would have expected only predictable results therefrom. 

Pi in view of Ku does not explicitly teach the recited term level monitoring for the purpose of determining based on the identified subset of terms a second output response to the barge in for output of a prompt based thereon to a user.

In a related field of endeavor Tor teaches a system and method for generating a TTS presentation of a user text for audible presentation to a user (Tor: Abstract; ¶ 28, etc.; Fig 1: system generates a TTS presentation of a set of user tasks); comprising a sequence of terms with corresponding indicators (Tor: ¶ 28, 48; system generates text for audio presentation based on a user request and further generates data regarding the output of the presentation such as beginning and end points for items in the set of tasks); while outputting the system maintains an identifier tracking the element of the presentation is currently being output (Tor: ¶ 31: “As each element is presented and an identifier is encountered, the identifier may be stored such that the identifier may be transmitted to the speech service,”); such that during or subsequent to output of audio the system remains operative to receive user voice input, such as a barge-in or response input to the output audio (Tor: ¶ 12) and the system determines markers for particular portions of the output audio to which the user response input refers such as in concert with a last received marker indicating an element which was presented during or prior to the user voice input (Tor: ¶ 20, 21) and thereby determine a subsequent, second, etc. output based thereon (Tor: ¶ 41, 45: system determines intent of the user voice input and responds by executing a particular application, generating a relevant response, etc.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to improve the Pi in view of Kul system and method to include tracking identifiers of per item playback states as taught or suggested by Tor for at least the purpose of flagging particular output words, phrases, sentences, etc. of a voice output to a user such as for returning to an appropriate place in a barged in upon output after responding appropriately to the user barge in; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 2
Pi in view of Ku in view of Tor teaches or suggests:
The computer-implemented method of claim 1, wherein the operations further comprise: receiving an initial utterance spoken by the user; and determining the first output transcription based on the initial utterance (Pi: such as operating the input output loop of figure 2); (Tor: ¶ 28, 48; system generates text for audio presentation based on a user request and further generates data regarding the output of the presentation such as beginning and end points for items in the set of tasks). The claim is considered obvious over Pi as modified by Ku and Tor as addressed in the base claim as it would have been obvious to apply the further teaching of Pi, Ku, and/or Tor to the modified device of Pi, Ku, and Tor; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 3
Pi in view of Ku in view of Tor teaches or suggests:
The computer-implemented method of claim 1, wherein the operations further comprising determining the first output transcription without receiving an initial utterance spoken by the user (Ku: ¶ 215; Fig 13: such as in response to a user interface element); (Ku: ¶ 266: such as by instructing the system to generate a response based on manual entry upon a user interface element). The claim is considered obvious over Pi as modified by Ku and Tor as addressed in the base claim as it would have been obvious to apply the further teaching of Pi, Ku, and/or Tor to the modified device of Pi, Ku, and Tor; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 5
Pi in view of Ku in view of Tor teaches or suggests:
The computer-implemented method of claim 1, wherein, while outputting the first TTS utterance from the assistant-enabled device, the operations further comprise:
identifying, based on the corresponding playback status of each respective term of the of the sequence of terms, a second subset terms from the sequence of terms not output by the assistant-enabled device before the user spoke the barge-in utterance (Pi: 4:45-4:65, 5:50-5:55, 6:49-55; Fig 3: such as by monitoring the waveform of the prompt output played with respect to a pause necessitated by a barge in signal); (Ku: ¶ 63, 80, 162, 166, 210 etc.; Fig 11, 16A-16D, etc.: such as by determining a remainder of a paused media subsequent to a user command; such as from a timestamp position); and in response to receiving the barge-in utterance, terminating output of the second subset of terms (Ku: ¶ 63, 80, 162, 166, 210 etc.; Fig 11, 16A-16D, etc.:  such as when a user input skips a media or continues from a media to a response to the user utterance). The claim is considered obvious over Pi as modified by Ku and Tor as addressed in the base claim as it would have been obvious to apply the further teaching of Pi, Ku, and/or Tor to the modified device of Pi, Ku, and Tor; one of ordinary skill in the art would have expected only predictable results therefrom.
 
Regarding claim 6
Pi in view of Ku in view of Tor teaches or suggests:
The computer-implemented method of claim 1, wherein receiving the barge-in  utterance spoken by the user occurs: after the assistant-enabled device begins outputting the first TTS utterance; and before the assistant-enabled device finishes outputting the first TTS utterance (Pi: 4:45-4:65, 5:50-5:55, 6:49-55; Fig 3: such as by monitoring the waveform of the prompt output played with respect to a pause necessitated by a barge in signal); (Ku: ¶ 63, 80, 162, 166, 210 etc.; Fig 11, 16A-16D, etc.: such as by determining a remainder of a paused media subsequent to a user command; such as from a timestamp position).  The claim is considered obvious over Pi as modified by Ku and Tor as addressed in the base claim as it would have been obvious to apply the further teaching of Pi, Ku, and/or Tor to the modified device of Pi, Ku, and Tor; one of ordinary skill in the art would have expected only predictable results therefrom.
 
Regarding claim 7
Pi in view of Ku in view of Tor teaches or suggests:
The computer-implemented method of claim 1, wherein the operations further comprise:
determining, based on the subset of terms, a context of the barge-in utterance wherein determining the second output transcription is further based on the context of the barge-in utterance (Pi: Col 5:15-5:23; Fig 3: such as by speech recognition sufficient to generate a prompt based on at least the content of the recognized speech and additionally on parameters thereof); (Ku: Figs 16A-16Dsuch as directly wherein the system shifts context based on a user command corresponding thereto, e.g. change publication or section, tag, get status, etc.), (Tor: ¶ 28, 41, 45, 48; system determines intent of the user voice input and responds by executing a particular application, generating a relevant response, etc. such as by generating text for audio presentation based on a user request, intent thereof and with respect to data regarding an output of the system with respect to the user voice input such as by maintaining beginning and end points for items in the system output). The claim is considered obvious over Pi as modified by Ku and Tor as addressed in the base claim as it would have been obvious to apply the further teaching of Pi, Ku, and/or Tor to the modified device of Pi, Ku, and Tor; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 8
Pi in view of Ku in view of Tor teaches or suggests:
The computer-implemented method of claim 1, wherein the operations further comprise:
assigning a corresponding playback timestamp to each respective term of the sequence of terms as the respective term is output from the assistant-enabled device (Pi: 4:45-4:65, 5:50-5:55, 6:49-55; Fig 3: such as by monitoring the waveform of the prompt output played with respect to a pause necessitated by a barge in signal); (Ku: ¶ 62, 63, 80, 162, 166, 167, 210 etc.; Fig 11, 16A-16D, etc.: such as by managing output and interrupt with respect to time based word boundary markers, timestamps, etc., such as to resume a paused media after responding to a user) and  determining a barge-in timestamp of the barge-in utterance as the assistant- enabled device receives the barge-in utterance (Pi: 4:45-4:65, 5:50-5:55, 6:49-55; Fig 3: such as by monitoring the waveform of the prompt output played with respect to a pause necessitated by a barge in signal); (Ku: ¶ 62, 63, 80, 162, 166, 167, 210 etc.; Fig 11, 16A-16D, etc.: such as by managing output and interrupt with respect to time based word boundary markers, timestamps, etc., such as to resume a paused media after responding to a user).   The claim is considered obvious over Pi as modified by Ku and Tor as addressed in the base claim as it would have been obvious to apply the further teaching of Pi, Ku, and/or Tor to the modified device of Pi, Ku, and Tor; one of ordinary skill in the art would have expected only predictable results therefrom.
 
Regarding claim 9
Pi in view of Ku in view of Tor teaches or suggests:
The computer-implemented method of claim 8, wherein identifying the subset of terms is further based on the corresponding playback timestamp of each respective term of the sequence of terms and the barge-in timestamp. Please see claim 1 supra and additionally: Pi: 4:45-4:65, 5:50-5:55, 6:49-55; Fig 3: such as by monitoring the waveform of the prompt output played with respect to a pause necessitated by a barge in signal; Ku: ¶ 62, 63, 80, 162, 166, 167, 210 etc.; Fig 11, 16A-16D, etc.: such as by managing output and interrupt with respect to time based word boundary markers, timestamps, etc., such as to resume a paused media after responding to a user; Tor: ¶ 28, 41, 45, 48; system determines intent of the user voice input and responds by executing a particular application, generating a relevant response, etc. such as by generating text for audio presentation based on a user request, intent thereof and with respect to data regarding an output of the system with respect to the user voice input such as by maintaining beginning and end points for items in the system output. The claim is considered obvious over Pi as modified by Ku and Tor as addressed in the base claim as it would have been obvious to apply the further teaching of Pi, Ku, and/or Tor to the modified device of Pi, Ku, and Tor; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 10
Pi in view of Ku in view of Tor teaches or suggests:
The computer-implemented method of claim 1, wherein the barge-in utterance comprises a hotword-free utterance. Neither Pi, Ku nor Tor discuss hotword type commands in the user input, merely parameters, context, etc. Pi in view of Ku in view of Tor does not explicitly teach the system, method, etc. wherein the utterance does not contain hot words however Examiner considers such a solution obvious to try. Consider that Pi in view of Ku in view of Bae has recognized the need to contextually respond to a user based on context at the time of a user utterance and that the predictable solutions proffered by Pi and Bae rely on speech recognition and natural language understanding and the solution of Ku relies on a hotword oriented command structure. Therefore it would have been obvious to try to effect a system with one solution, i.e. a menu of commands, hot words, etc.; a separate solution, i.e.  based on natural language understanding or indeed a hybrid system combining elements of both. Certainly one of ordinary skill in the art could have pursued the known potential solutions with a reasonable expectation of success and without undue experimentation thereon. The claim is considered obvious over Pi as modified by Ku and Tor as addressed in the base claim as it would have been obvious to apply the further teaching of Pi, Ku, and/or Tor to the modified device of Pi, Ku, and Tor; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 11—the claim is considered to recite substantially similar subject matter to that of claim 1 and is similarly rejected. 

Regarding claim 12—the claim is considered to recite substantially similar subject matter to that of claim 2 and is similarly rejected. 

Regarding claim 13—the claim is considered to recite substantially similar subject matter to that of claim 3 and is similarly rejected. 

Regarding claim 15—the claim is considered to recite substantially similar subject matter to that of claim 5 and is similarly rejected. 

Regarding claim 16—the claim is considered to recite substantially similar subject matter to that of claim 6 and is similarly rejected. 

Regarding claim 17—the claim is considered to recite substantially similar subject matter to that of claim 7 and is similarly rejected. 

Regarding claim 18—the claim is considered to recite substantially similar subject matter to that of claim 8 and is similarly rejected. 

Regarding claim 19—the claim is considered to recite substantially similar subject matter to that of claim 9 and is similarly rejected. 

Regarding claim 20—the claim is considered to recite substantially similar subject matter to that of claim 10 and is similarly rejected. 

Response to Arguments
Applicant’s arguments in concert with claim amendments, see Remarks and Claims, filed 1/6/26, with respect to the rejection(s) of claim(s) 1-20 under 35 USC 103 over Pi in view of Kulis in view of Baeuml have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Pi, Kulis, and Torok.


Conclusion
 Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL C MCCORD whose telephone number is (571)270-3701. The examiner can normally be reached 730-630 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, CAROLYN EDWARDS can be reached at (571) 270-7136. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PAUL C MCCORD/               Primary Examiner, Art Unit 2692
Read full office action
Prosecution Timeline

Apr 12, 2024
Application Filed
Nov 05, 2025
Non-Final Rejection — §103
Jan 06, 2026
Response Filed
Feb 02, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/674,333
Patent 12603094
ADAPTIVE PROCESSING WITH MULTIPLE MEDIA PROCESSING NODES
2y 5m to grant Granted Apr 14, 2026
18/653,631
Patent 12592238
INFORMATION PROCESSING METHOD, INFORMATION PROCESSING DEVICE, AND NON-TRANSITORY COMPUTER READABLE RECORDING MEDIUM STORING INFORMATION PROCESSING PROGRAM
2y 5m to grant Granted Mar 31, 2026
19/029,744
Patent 12593192
MEDIA PLAYBACK BASED ON SENSOR DATA
2y 5m to grant Granted Mar 31, 2026
18/280,697
Patent 12572323
DYNAMIC AUDIO CONTENT GENERATION
2y 5m to grant Granted Mar 10, 2026
16/822,293
Patent 12567003
TECHNOLOGIES FOR DECENTRALIZED FLEET ANALYTICS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
69%
Grant Probability
96%
With Interview (+26.6%)
3y 5m
Median Time to Grant
Moderate
PTA Risk
Based on 569 resolved cases by this examiner. Grant probability derived from career allow rate.