Last updated: April 19, 2026
Application No. 18/772,267
SPEAKER DIARIZATION

Non-Final OA §102§103
Filed
Jul 15, 2024
Examiner
BOGGS JR., JAMES
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
This examiner grants 60% of cases after interview

— +38.8% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 107 resolved cases, 2023–2026
Examiner Intelligence

BOGGS JR., JAMES View full profile →
Grants 60% of resolved cases
Career Allow Rate
64 granted / 107 resolved
-2.2% vs TC avg
Strong +39% interview lift
Without
With
+38.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
28 currently pending
Career history
135
Total Applications
across all art units
Statute-Specific Performance

§101
12.4%
-27.6% vs TC avg
§103
48.5%
+8.5% vs TC avg
§102
16.2%
-23.8% vs TC avg
§112
18.1%
-21.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 107 resolved cases
Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The disclosure is objected to because of the following informalities:
In paragraph 0023, lines 2-3, “a smart speaker that any sounds in the surrounding area” should read “a smart speaker that detects any sounds in the surrounding area”.
In paragraph 0031, lines 5-6, “er 210” should read “hotworder 210”.
Appropriate correction is required.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 – 5, 9 – 15 and 19 – 20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kim et al. (US Patent No. 10,127,911), hereinafter Kim.
Regarding claim 1, Kim discloses a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations (Column 4, line 61 - Column 5, line 3, "In some examples, a non-transitory computer-readable storage medium of memory 250 can be used to store instructions (e.g., for performing some or all of process 300, 400, 500, 600, or 700, described below) for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device, and execute the instructions.") comprising:
receiving audio data characterizing an utterance comprising a hotword followed by multiple terms (Column 6, lines 28-48, "In some examples, process 300 can be performed by a system similar or identical to system 100 having a user device similar or identical to user device 102 configured to implement a virtual assistant capable of continuously (or intermittently over an extended period of time) monitoring an audio input for a receipt of a trigger phrase that initiates activation of the virtual assistant. For example, a user device implementing the virtual assistant can continuously or intermittently monitor sounds, speech, and the like detected by a microphone of the user device without performing an action, such as performing a task flow, generating an output response in an audible (e.g., speech) and/or visual form, or the like, in response to the monitored sounds and speech. However, in response to detecting the trigger phrase, the virtual assistant can perform a speaker identification process to ensure that the speaker of the trigger phrase is the intended operator of the virtual assistant. Upon verification of the identity of the speaker, the virtual assistant can be activated, causing the virtual assistant to process a subsequently received word or phrase and to respond accordingly."; Monitoring an audio input for a receipt of a trigger phrase and processing a subsequently received phrase reads on receiving audio data characterizing an utterance comprising a hotword followed by multiple terms.);
processing the audio data to: identify, using a hotword model, a presence of the hotword in the audio data (Column 6, lines 58-66, "At block 304, speech-to-text conversion can be performed on the audio input received at block 302 to determine whether the audio input includes user speech containing a predetermined trigger phrase. The trigger phrase can include any desired set of one or more predetermined words, such as “Hey Siri.” The trigger phrase can be used to activate the virtual assistant and signal to the virtual assistant that a user input, such as a request, command, or the like, will be subsequently provided."; Determine whether the audio input includes user speech containing a predetermined trigger phrase reads on identify a presence of the hotword in the audio data.);
and determine, using a speaker identification model, that the utterance characterized by the audio data was spoken by a particular person (Column 9, lines 15-20, "At block 502, the user device can perform a speaker identification process on the audio input received at block 302 of process 300 to determine whether the speaker is a predetermined user (e.g., an authorized user of the device). Any desired speaker identification process can be used, such as an i-vector speaker identification process."; Performing a speaker identification process on the received audio input to determine whether the speaker is a predetermined user reads on determining that the utterance characterized by the audio data was spoken by a particular person.);
based on identifying the presence of the hotword in the audio data and determining that the utterance characterized by the audio data was spoken by the particular user, performing speech recognition on the audio data to generate a transcription of the utterance (Column 6, lines 58-61, "At block 304, speech-to-text conversion can be performed on the audio input received at block 302 to determine whether the audio input includes user speech containing a predetermined trigger phrase."; Column 8, lines 23-38, "At block 404, the user device can activate the virtual assistant by processing audio input received subsequent to the audio input containing the trigger phrase. For example, block 404 can include receiving the subsequent audio input, performing speech-to-text conversion on the subsequently received audio input to generate a textual representation of user speech contained in the subsequently received audio input, determining a user intent based on the textual representation, an acting on the determined user intent by performing one or more of the following: identifying a task flow with steps and parameters designed to accomplish the determined user intent; inputting specific requirements from the determined user intent into the task flow; executing the task flow by invoking programs, methods, services, APIs, or the like; and generating output responses to the user in an audible (e.g., speech) and/or visual form."; Column 12, lines 1-12, "At block 602, the user device can perform a speaker identification process on the audio input received at block 302 of process 300 in a manner similar or identical to that of block 502 of process 500. If it is determined that the speaker of the audio input is the predetermined user represented by the speaker profile, then process 600 can proceed to block 604 without adding the audio input to a speaker profile in a manner similar or identical to block 402 of process 400 or block 504 of process 500. At block 604, the virtual assistant can be activated and subsequently received audio input can be processed in a manner similar or identical to block 404 of process 400 or block 506 of process 500."; Performing speech-to-text conversion on the subsequently received audio input to generate a textual representation of user speech contained in the subsequently received audio input after determining that the audio input includes user speech containing a predetermined trigger phrase and determining that the speaker of the audio input is the predetermined user represented by the speaker profile reads on performing speech recognition on the audio data to generate a transcription of the utterance based on identifying the presence of the hotword in the audio data and determining that the utterance characterized by the audio data was spoken by the particular user.);
processing, using a command identifier, the transcription of the utterance to determine that the multiple terms of the utterance comprise a command directed toward a user computing device (Column 6, lines 63-66, "The trigger phrase can be used to activate the virtual assistant and signal to the virtual assistant that a user input, such as a request, command, or the like, will be subsequently provided."; Column 8, lines 23-38, "At block 404, the user device can activate the virtual assistant by processing audio input received subsequent to the audio input containing the trigger phrase. For example, block 404 can include receiving the subsequent audio input, performing speech-to-text conversion on the subsequently received audio input to generate a textual representation of user speech contained in the subsequently received audio input, determining a user intent based on the textual representation, an acting on the determined user intent by performing one or more of the following: identifying a task flow with steps and parameters designed to accomplish the determined user intent; inputting specific requirements from the determined user intent into the task flow; executing the task flow by invoking programs, methods, services, APIs, or the like; and generating output responses to the user in an audible (e.g., speech) and/or visual form."; Determining a user intent based on the textual representation, where the user intent is a command, reads on processing the transcription of the utterance to determine that the multiple terms of the utterance comprise a command directed toward a user computing device.);
and based on determining that the multiple terms of the utterance comprise the command, initiating performance of the command using the user computing device (Column 8, lines 23-38, "At block 404, the user device can activate the virtual assistant by processing audio input received subsequent to the audio input containing the trigger phrase. For example, block 404 can include receiving the subsequent audio input, performing speech-to-text conversion on the subsequently received audio input to generate a textual representation of user speech contained in the subsequently received audio input, determining a user intent based on the textual representation, an acting on the determined user intent by performing one or more of the following: identifying a task flow with steps and parameters designed to accomplish the determined user intent; inputting specific requirements from the determined user intent into the task flow; executing the task flow by invoking programs, methods, services, APIs, or the like; and generating output responses to the user in an audible (e.g., speech) and/or visual form."; Acting on the determined user intent by executing the task flow reads on initiating performance of the command using the user computing device based on determining that the multiple terms of the utterance comprise the command.).
Regarding claim 2, Kim discloses the computer-implemented method as claimed in claim 1.
Kim further discloses:
wherein the audio data is captured by a microphone residing on the user computing device (Column 6, lines 50-53, "At block 302 of process 300, an audio input including user speech can be received at a user device. In some examples, a user device (e.g., user device 102) can receive the audio input including user speech via a microphone (e.g., microphone 230).").
Regarding claim 3, Kim discloses the computer-implemented method as claimed in claim 1.
Kim further discloses:
wherein the speaker identification model is trained to recognize speech spoken by the particular person (Column 7, lines 21-30, "At block 305, the user device can generate a speaker profile, selectively perform speaker recognition using the speaker profile, and selectively activate the virtual assistant in response to positively identifying the speaker using speaker recognition. In some examples, the speaker profile can generally include one or more voice prints generated from an audio recording of a speaker's voice. The voice prints can be generated using any desired speech recognition technique, such as by generating i-vectors to represent speaker utterances."; Column 7, lines 37-48, "Specifically, at block 306, the user device can select one of multiple modes in which to operate. In some examples, the multiple modes can include a speaker profile building mode (represented by block 308) in which a speaker's voice can be modeled to generate a speaker profile, a speaker profile modifying mode (represented by block 310) in which a speaker profile can be used to verify the identity of a user and in which the speaker profile can be updated based on newly received user speech, and a static speaker profile mode in which an existing speaker profile can be used to verify the identity of a user and in which the speaker profile may not be changed based on newly received user speech."; Modeling a speaker's voice to generate a speaker profile to selectively perform speaker recognition reads on training the speaker identification model to recognize speech spoken by the particular person.).
Regarding claim 4, Kim discloses the computer-implemented method as claimed in claim 3.
Kim further discloses:
wherein the speaker identification model is trained on previously collected speech data for the particular person (Column 7, lines 21-30, "At block 305, the user device can generate a speaker profile, selectively perform speaker recognition using the speaker profile, and selectively activate the virtual assistant in response to positively identifying the speaker using speaker recognition. In some examples, the speaker profile can generally include one or more voice prints generated from an audio recording of a speaker's voice. The voice prints can be generated using any desired speech recognition technique, such as by generating i-vectors to represent speaker utterances."; Generating voice prints from an audio recording of a speaker's voice reads on the speaker identification model being trained on previously collected speech data for the particular person.).
Regarding claim 5, Kim discloses the computer-implemented method as claimed in claim 4.
Kim further discloses:
wherein the previously collected speech data characterizes utterances of various phrases the particular person is requested to repeat (Column 1, lines 33-38, "Some natural language processing systems can perform speaker identification to verify the identity of a user. These systems typically require the user to perform an enrollment process during which the user speaks a series of predetermined words or phrases to allow the natural language processing system to model the user's voice."; Performing an enrollment process during which the user speaks a series of predetermined words or phrases reads on the previously collected speech data characterizing utterances of various phrases the particular person is requested to repeat.).
Regarding claim 9, Kim discloses the computer-implemented method as claimed in claim 1.
Kim further discloses:
wherein the data processing hardware resides on the user computing device (Column 3, lines 62-66, "Although the functionality of the virtual assistant is shown in FIG. 1 as including both a client-side portion and a server-side portion, in some examples, the functions of the assistant can be implemented as a standalone application installed on a user device."; Column 4, lines 7-10, "FIG. 2 is a block diagram of a user-device 102 according to various examples. As shown, user device 102 can include a memory interface 202, one or more processors 204, and a peripherals interface 206.").
Regarding claim 10, Kim discloses the computer-implemented method as claimed in claim 1.
Kim further discloses:
wherein the user computing device comprises a smart phone, a laptop computer, a desktop computer, a smart speaker, or a smart watch (Column 3, lines 22-33, "As shown in FIG. 1, in some examples, a virtual assistant can be implemented according to a client-server model. The virtual assistant can include a client-side portion executed on a user device 102, and a server-side portion executed on a server system 110. User device 102 can include any electronic device, such as a mobile phone, tablet computer, portable media player, desktop computer, laptop computer, PDA, television, television set-top box, wearable electronic device, or the like, and can communicate with server system 110 through one or more networks 108, which can include the Internet, an intranet, or any other wired or wireless public or private network.").
Regarding claim 11, arguments analogous to claim 1 are applicable.  In addition, Kim discloses a system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations (Column 4, line 61 - Column 5, line 3, "In some examples, a non-transitory computer-readable storage medium of memory 250 can be used to store instructions (e.g., for performing some or all of process 300, 400, 500, 600, or 700, described below) for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device, and execute the instructions.") comprising the steps of claim 1.
Regarding claim 12, arguments analogous to claim 2 are applicable.  
Regarding claim 13, arguments analogous to claim 3 are applicable.  
Regarding claim 14, arguments analogous to claim 4 are applicable.  
Regarding claim 15, arguments analogous to claim 5 are applicable.  
Regarding claim 19, arguments analogous to claim 9 are applicable.  
Regarding claim 20, arguments analogous to claim 10 are applicable.  
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 6 – 8 and 16 – 18 are rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Foerster et al. (US Patent No. 9,418,656), hereinafter Foerster.
Regarding claim 6, Kim discloses the computer-implemented method as claimed in claim 1, but does not specifically disclose: wherein processing the audio data to identify the presence of the hotword in the audio data comprises: processing, using the hotword model, the audio data to compute a hotword confidence score reflecting a likelihood that the audio data includes the hotword; and identifying, using the hotword model, the presence of the hotword based on the hotword confidence score.
Foerster teaches:
wherein processing the audio data to identify the presence of the hotword in the audio data comprises: processing, using the hotword model, the audio data to compute a hotword confidence score reflecting a likelihood that the audio data includes the hotword (Column 4, lines 25-31, "The audio subsystem of the computing device 115 provides the processed audio data 120 to a first stage of the hotworder. The first stage hotworder 125 may be a “coarse” hotworder. The first stage hotworder 125 performs a classification process that may be informed or trained using known utterances of the hotword, and computes a likelihood that the utterance 110 includes a hotword."; Column 5, lines 15-17, "Based on the classification process performed by the first stage hotworder 125, the first stage hotworder 125 computes a hotword confidence score.");
and identifying, using the hotword model, the presence of the hotword based on the hotword confidence score (Column 6, lines 51-60, "In some implementations, the speaker identification module 150 transmits a signal to the second stage hotworder 145 or to the first stage hotworder 125 indicating that the speaker identity confidence score satisfies a threshold and to cease storing or forwarding of the audio data 120. For example, the second stage hotworder 145 determines that the utterance 110 likely includes the hotword “OK computer,” and the second stage hotworder 145 transmits a signal to the first stage hotworder 125 instructing the first stage hotworder 125 to cease storing the audio data 125 into memory 140."; Indicating that the speaker identity confidence score satisfies a threshold reads on identifying the presence of the hotword based on the hotword confidence score.).
Foerster is considered to be analogous to the claimed invention because it is in the same field of speech recognition.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Kim to incorporate the teachings of Foerster to compute a hotword confidence score indicating the likelihood that an utterance includes a hotword and indicate that the speaker identity confidence score satisfies a threshold.  Doing so would allow for discerning when an utterance is directed at the system as opposed to being directed at an individual present in the environment (Foerster; Column 1, lines 44-67).
Regarding claim 7, Kim in view of Foerster discloses the computer-implemented method as claimed in claim 6.
Foerster further teaches:
wherein identifying the presence of the hotword based on the hotword confidence score comprises identifying the presence of the hotword based on determining that the hotword confidence score satisfies a hotword confidence score threshold (Column 6, lines 51-60, "In some implementations, the speaker identification module 150 transmits a signal to the second stage hotworder 145 or to the first stage hotworder 125 indicating that the speaker identity confidence score satisfies a threshold and to cease storing or forwarding of the audio data 120. For example, the second stage hotworder 145 determines that the utterance 110 likely includes the hotword “OK computer,” and the second stage hotworder 145 transmits a signal to the first stage hotworder 125 instructing the first stage hotworder 125 to cease storing the audio data 125 into memory 140."; Indicating that the speaker identity confidence score satisfies a threshold reads on identifying the presence of the hotword based on determining that the hotword confidence score satisfies a hotword confidence score threshold.).
Foerster is considered to be analogous to the claimed invention because it is in the same field of speech recognition.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Kim in view of Foerster to further incorporate the teachings of Foerster to indicate that the speaker identity confidence score satisfies a threshold.  Doing so would allow for discerning when an utterance is directed at the system as opposed to being directed at an individual present in the environment (Foerster; Column 1, lines 44-67).
Regarding claim 8, Kim in view of Foerster discloses the computer-implemented method as claimed in claim 6.
Foerster further teaches:
wherein the hotword confidence score is computed without the hotword model performing speech recognition on the audio data (Column 4, lines 25-31, "The audio subsystem of the computing device 115 provides the processed audio data 120 to a first stage of the hotworder. The first stage hotworder 125 may be a “coarse” hotworder. The first stage hotworder 125 performs a classification process that may be informed or trained using known utterances of the hotword, and computes a likelihood that the utterance 110 includes a hotword."; Computing the likelihood that an utterance includes a hotword by performing a classification process trained using known utterances of the hotword reads on computing the hotword confidence score without the hotword model performing speech recognition on the audio data.).
Foerster is considered to be analogous to the claimed invention because it is in the same field of speech recognition.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Kim in view of Foerster to further incorporate the teachings of Foerster to compute the likelihood that an utterance includes a hotword by performing a classification process trained using known utterances of the hotword.  Doing so would allow for discerning when an utterance is directed at the system as opposed to being directed at an individual present in the environment (Foerster; Column 1, lines 44-67).
Regarding claim 16, arguments analogous to claim 6 are applicable.  
Regarding claim 17, arguments analogous to claim 7 are applicable.  
Regarding claim 18, arguments analogous to claim 8 are applicable.  
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Rao et al. (US Patent No. 10,462,545)
Parthasarathi et al. (US Patent No. 10,373,612)
Piersol et al. (US Patent No. 10,192,546)
Sharifi et al. (US Patent No. 9,747,926)
Sharifi (US Patent No. 9,318,107)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to James Boggs whose telephone number is (571)272-2968. The examiner can normally be reached M-F 8:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/JAMES BOGGS/Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Jul 15, 2024
Application Filed
Mar 09, 2026
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/163,848
Patent 12586600
Streaming Vocoder
2y 5m to grant Granted Mar 24, 2026
17/977,443
Patent 12573406
VOICE AUTHENTICATION BASED ON ACOUSTIC AND LINGUISTIC MACHINE LEARNING MODELS
2y 5m to grant Granted Mar 10, 2026
18/314,249
Patent 12572752
DYNAMIC CONTENT GENERATION METHOD
2y 5m to grant Granted Mar 10, 2026
18/483,896
Patent 12562170
BIOMETRIC AUTHENTICATION DEVICE, BIOMETRIC AUTHENTICATION METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Feb 24, 2026
18/131,866
Patent 12554931
Method and System of Improving Communication Skills for High Client Conversation Rate
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+38.8%)
3y 3m
Median Time to Grant
Low
PTA Risk
Based on 107 resolved cases by this examiner. Grant probability derived from career allow rate.