Last updated: April 19, 2026
Application No. 18/338,432
COGNITIVE ASSISTANT VOICE AMELIORATION MODEL

Non-Final OA §103
Filed
Jun 21, 2023
Examiner
HUTCHESON, CODY DOUGLAS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
International Business Machines Corporation
OA Round
3 (Non-Final)
Interview Optional

— +47.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 24 resolved cases, 2023–2026
Examiner Intelligence

HUTCHESON, CODY DOUGLAS View full profile →
Grants 62% of resolved cases
Career Allow Rate
15 granted / 24 resolved
+0.5% vs TC avg
Strong +47% interview lift
Without
With
+47.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
34 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
33.9%
-6.1% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
7.5%
-32.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/20/2026 has been entered.
 
Response to Arguments
1. Regarding the objections to claims 2, 9, and 16, Applicant has amended each claim to address the minor informalities. Accordingly, the objections have been withdrawn.

	2. Regarding the rejection of:
Claims 1, 4-5, 8, 11-12, 15, and 18-19 under 35 U.S.C. § 103 as being unpatentable over Pichaimurthy in view of Olivieri, 
Claims 2-3, 9-10, and 16-17 under 35 U.S.C. § 103 as being unpatentable over Pichaimurthy in view of Olivieri and further in view of Ingel
Claims 6, 13, and 20 under 35 U.S.C. § 103 as being unpatentable over Pichaimurthy in view of Olivieri and further in view of Kimbley and Kim.

Applicant's arguments have been fully considered but they are not persuasive. 

Applicant argues that the cited prior art of record does not specifically teach “in response to determining that at least one of the one or more frequency metrics of the current waveform fail a threshold degree of match to at least one corresponding frequency metric of the baseline voice waveform, augmenting the at least one of the one or more frequency metrics of the current voice waveform…and notifying the user, in real-time, of the augmenting…during the online conference call” (see pgs. 11-13 of Remarks). The Examiner respectfully disagrees. Under the broadest reasonable interpretation, Pichaimurthy in view of Olivieri teaches these limitations. Pichaimurthy discloses a step of retrieving a voice profile associated with a user comprising a base frequency and voice characteristics (para. 0034-0036), and further discloses a step of determining that a captured second voice signal does not have a matching base frequency and voice characteristics (para. 0036). Determining if a captured second voice signal matches a base frequency and voice characteristics of a stored voice profile signal reads on the BRI of “in response to determining that at least one of the one or more frequency metrics of the current waveform fail a threshold degree of match to at least one corresponding frequency metric of the baseline voice waveform…”. While Pichaimurthy discloses the determination step, Oliveri teaches the augmentation and notifying steps. Under the broadest reasonable interpretation of the claim, the remaining limitation first requires that the current voice waveform frequency metrics be augmented (made greater/added to in some manner). Oliveri teaches a step of adding synthesized audio which has a close match of speech characteristics for audio frames that have been interrupted (see para. 0159, para. 0054, 0062; Fig. 4b). Adding synthesized audio frames to a current audio stream to augment the stream with speech resembling voice characteristics of the user reads on the BRI of “…augmenting the at least one of the one or more frequency metrics of the current voice waveform…”. Furthermore, Olivieri teaches selecting displaying an avatar concurrently with synthesized audio output during an online conference call such that the avatar is only present when synthesized/augmented audio is played and is absent when non-augmented speech is played (see para. 0003, 0071, 0078). Selectively displaying an avatar on the screen concurrently with augmented speech reads on the BRI of “notifying the user, in real-time, of the augmenting…during the online conference call.”.
The combined teachings of Pichaimurthy and Olivieri read on the above limitations. Hence, Applicants arguments are not persuasive.
Regarding the argument that Olivieri does not specifically disclose “…wherein the augmenting comprises aligning the at least one of the one or more frequency metrics of the current voice waveform to match the at least one corresponding frequency metric of the baseline voice waveform…” (see pg. 12, lines 5-9 of Remarks), Applicant’s arguments have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


3. Claims 1, 4-5, 8, 11-12, 15, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Pichaimurthy et al. (US PGPUB No. 2022/0383888, hereinafter Pichaimurthy) in view of Olivieri et al. (US PGPUB No. 2022/0246133, hereinafter Olivieri) and further in view of Chen et al. (US 2023/0343351 A1, hereinafter Chen).

Regarding claim 1, Pichaimurthy discloses identifying a current waveform of a user, via performance of spectral analysis of captured speech output of the user being transmitted during an online conference call (para. 0048 “At 804, control circuitry 406, using audio processing circuitry 408, identifies a base frequency of Voice.sub.current. For example, control circuitry 406 may analyze a frequency spectrum of Voice.sub.current to determine a primary harmonic frequency of the voice. At 806, control circuitry 406 determines a plurality of voice characteristics, such as pitch, intonation, accent, loudness, and rate. For example, control circuitry 406 may compare vowel sounds spoken in Voice.sub.current with a set of known accents to determine an accent with which the speech represented by Voice.sub.current was spoken. Audio amplitude may be analyzed to determine loudness. Patterns of changes in loudness and frequency may be used to determine an intonation.”), wherein the current voice waveform comprises one or more frequency metrics (para. 0035 “During a second conference call in which the user is participating, subsequent to the first conference call, audio input circuitry 400 receives 422 a second voice signal. Audio input circuitry 400 transmits 424 the second voice signal to control circuitry 406, where audio processing circuitry 408 receives and analyzes the second voice signal.”; para. 0035-0036 “Audio processing circuitry 408 requests 426 the voice profile of the user from memory 412 and receives 428, in response to the request, the voice profile of the user. … Once the voice profile of the user has been retrieved, audio processing circuitry 408 compares the base frequency and voice characteristics of the voice represented by the second voice signal to the voice profile of the user.”; para. 0034 “Audio processing circuitry 408 transmits 410 the base frequency and voice characteristics to memory 412 for storage in a voice profile associated with the user”); comparing the one or more frequency metrics of the current waveform to one or more corresponding frequency metrics of a baseline voice waveform of the user (para. 0035-0036 “Audio processing circuitry 408 requests 426 the voice profile of the user from memory 412 and receives 428, in response to the request, the voice profile of the user. … Once the voice profile of the user has been retrieved, audio processing circuitry 408 compares the base frequency and voice characteristics of the voice represented by the second voice signal to the voice profile of the user.”; para. 0034 “Audio processing circuitry 408 transmits 410 the base frequency and voice characteristics to memory 412 for storage in a voice profile associated with the user”); and in response to determining that at least one of the one or more frequency metrics of the current waveform fail a threshold degree of match to at least one corresponding frequency metric of the baseline voice waveform (para. 0036 “If the base frequency and voice characteristics of the second voice signal do not match the voice profile of the user, audio processing circuitry 408 prevents transmission of the second voice signal into the conference call.”).
Pichaimurthy performs the step of determining if a threshold degree of match is failed, but does not specifically disclose [in response to determining…] augmenting the at least one of the one or more frequency metrics of the current voice waveform via a generative artificial intelligence (AI) voice model trained to produce speech which mimics a voice and speaking style of the user…and notifying the user, in real-time, of the augmenting of the current voice waveform during the online conference call.
Olivieri teaches in response to determining that an interruption has occurred during an online meeting (para. 0159 “The method 800 further includes selectively generating an output based on the text stream in response to an interruption in the speech audio stream, at 806. For example, the interruption manager 164 of FIG. 1 selectively generates the synthesized speech audio stream 133 based on the text stream 121 in response to an interruption in the speech audio stream 111, as described with reference to FIG. 1.”), augmenting the at least one of the one or more frequency metrics of the current voice waveform via a generative artificial intelligence (AI) voice model trained to produce speech which mimics a voice and speaking style of the user… (para. 0054 “For example, the metadata stream 123 indicates intonations (e.g., emotion, pitch, tone, or a combination thereof) of the speech detected in the audio input 153.”; para. 0062 “The text-to-speech converter 166 uses the speech model 131 to generate a synthesized speech audio stream 133 by performing text-to-speech conversion on the text stream 121, the metadata stream 123, or both. …According to some aspects, the speech model 131 is updated so that subsequent text-to-speech conversion using the speech model 131 is more likely to generate synthesized speech that is a closer match of speech characteristics of the user 142.”; Fig. 7B, synthetic speech “133” used to augment media stream “109” for lost media frames (para. 0149-0155))) notifying the user, in real-time, of the augmenting of the current voice waveform during the online conference call (para. 0071 “In a particular aspect, the interruption manager 164 selectively displays the avatar 135 concurrently with providing the synthesized speech audio stream 133 as the audio output 143 to the speaker 154. For example, interruption manager 164 refrains from displaying the avatar 135 while providing the speech audio stream 111 as the audio output 143 to the speaker 154. As another example, the interruption manager 164 displays the avatar 135 while providing the synthesized speech audio stream 133 as the audio output 143 to the speaker 154. To illustrate, the GUI generator 168 updates the GUI 145 to display the avatar 135 instead of the video stream 113 while the synthesized speech audio stream 133 is output as the audio output 143 for playout by the speaker 154…”; para. 0078 “In a particular aspect, the interruption manager 164, in response to detecting that the interruption has ended, sends a first request to the GUI generator 168 to update the GUI 145 to indicate that the speech model 131 is not being used to output synthetic speech audio (e.g., the avatar 135 is not speaking). …For example, the GUI generator 168, in response to receiving the second request, updates the GUI 145 to refrain from displaying the avatar 135.”).
Pichaimurthy and Olivieri are considered to be analogous to the claimed invention as they both are in the same field of performing speech processing during online meetings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pichaimurthy to incorporate the teachings of Olivieri in order to augment captured speech output with AI generated voice mimicking the voice and style of the user and to notify the user in real-time of the augmenting of the current voice waveform during the online conference call. Doing so would be beneficial, as this would allow for a continuous audio output stream to be heard by the receiving user even if audio interrupts like frame loss occur, leading to a better user experience (para. 0034).
Pichaimurthy in view of Olivieri does not specifically disclose wherein the augmenting comprises aligning the at least one of the one or more frequency metrics of the current voice waveform to match the at least one corresponding frequency metric of the baseline waveform.
Chen teaches wherein the augmenting comprises aligning the at least one of the one or more frequency metrics of the current voice waveform to match the at least one corresponding frequency metric of the baseline waveform (para. 0030 “The gain module employs a machine learning gain model to adjust a strength of signals (or gains) across different frequencies using a non-linear function that handles different types of facial masks and produces a same consistent audio output resembling original voice signals (e.g., unaffected by a facial mask).”; para. 0047 “Initially, original voice signals 410 from a user that does not employ a facial mask (e.g., the original voice signals are unaffected by a facial mask, etc.) are captured by microphone or other sound sensing device 1524 to generate original voice or audio samples 412 corresponding to the original voice signals.”).
Pichaimurthy, Oliveri, and Chen are considered to be analogous to the claimed invention as they are all in the same field of speech processing for online meetings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Pichaimurthy in view of Olivieri to incorporate the teachings of Chen in order to specifically augment by aligning the at least one of the one or more frequency metrics of the current voice waveform to match the at least one corresponding frequency metric of the baseline waveform. Doing so would help to remove noise in speech cause by any type of facial or other covering affecting voice signals (para. 0046).
	
Regarding claim 4, Pichaimurthy in view of Olivieri and Chen discloses wherein the speech output of the user is captured via a microphone of a computing device of the user during participation by the user in the online conference call (Pichaimurthy, para. 0033 “Alternatively, audio input circuitry 400 may include a microphone through which audio information is captured directly. Audio input circuitry 400 may convert the audio to a digital format such as WAV. Audio input circuitry 400 transmits 404 the first voice signal to control circuitry 406.”; para. 0024 “FIG. 1 shows (a) an exemplary scenario in which unwanted sounds are present during a conference call, and (b) data associated with each sound, in accordance with some embodiments of the disclosure. User P1 100 is participating in a conference call using media device 102.”; Fig. 1, 102).

Regarding claim 5, Pichaimurthy in view of Olivieri and Chen discloses evaluating a current network connection of the computing device to identify one or more current network metrics (Olivieri, Evaluation of current network connection is determined by observing network metric of duration of last received audio frame: para. 0069 “In a particular aspect, the interruption manager 164 detects the interruption in the speech audio stream 111 in response to determining that no audio frames of the speech audio stream 111 are received within a threshold duration of a last received audio frame of the speech audio stream 111. … In a particular aspect, the interruption manager 124 detects network issues in response to receiving the interruption notification from the device 104.) comparing the one or more current network metrics with one or more specified corresponding minimum threshold values (Olivieri, para. 0069 “In a particular aspect, the interruption manager 164 detects the interruption in the speech audio stream 111 in response to determining that no audio frames of the speech audio stream 111 are received within a threshold duration of a last received audio frame of the speech audio stream 111. For example, the last received audio frame of the speech audio stream 111 is received at a first receipt time at the device 104. The interruption manager 164 detects the interruption in response to determining that no audio frames of the speech audio stream 111 are received within a threshold duration of the first receipt time”); and in response to determining that at least one of the one or more current network metrics falls below at least one specified corresponding minimum threshold value (para. 0069 “In a particular aspect, the interruption manager 164 sends an interruption notification to the device 102. In a particular aspect, the interruption manager 164 sends an interruption notification to the device 102. In a particular aspect, the interruption manager 124 detects network issues in response to receiving the interruption notification from the device 104.”), augmenting the captured speech output via the generative AI voice model trained to produce speech which mimics a voice and a speaking style of the user (para. 0090 “The interruption manager 164, in response to detecting the interruption, provides the text stream 121, the metadata stream 123, the annotated text stream 137, or a combination thereof, to the text-to-speech converter 166. The text-to-speech converter 166 generates the synthesized speech audio stream 133 by using the speech model 131 to perform a text-to-speech conversion based on the text stream 121, the metadata stream 123, the annotated text stream 137, or a combination thereof, as described with reference to FIG. 1. The interruption manager 164, in response to the interruption, provides the synthesized speech audio stream 133 as the audio output 143 to the speaker 154, halts playback of the speech audio stream 111, halts playback of the video stream 113, displays the avatar 135, displays a particular representation of the avatar 135, displays the text stream 121, displays the annotated text stream 137, or a combination thereof, as described with reference to FIG. 1”).
Pichaimurthy, Olivieri, and Chen are considered to be analogous to the claimed invention as they are in the same field of performing speech processing during online meetings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Olivieri in order to augment captured speech output with AI generated voice mimicking the voice and style of the user in response to determine a network metric falls below a threshold value. Doing so would be beneficial, as network metrics provide an indication of the current network connection, and augmenting speech when a network issue is detected allows for a continuous audio stream to be outputted to a receiving user even when audio frames are not being received, improving user experience (para. 0069, 0034).

Regarding claim 8, claim 8 is a system claim with limitations similar to those recited in method claim 1, and is thus rejected under similar rationale.
Additionally, Pichaimurthy in view of Olivieri and Chen discloses A computer system (Pichairmurthy, Fig. 4), the computer system comprising: one or more processors (Oliveri, Fig. 18, “1806”; para. 0171-0172 “In an illustrative implementation, the device 1800 may perform one or more operations described with reference to FIGS. 1-17. [0172] In a particular implementation, the device 1800 includes a processor 1806 (e.g., a central processing unit (CPU)). The device 1800 may include one or more additional processors 1810 (e.g., one or more DSPs).”), one or more computer-readable memories (Oliveri, Fig. 18, “1886”; para. 0173 “The device 1800 may include a memory 1886”), one or more computer-readable tangible storage medium (Olivieri, para. 0180 “In some implementations, a non-transitory computer-readable medium (e.g., a computer-readable storage device, such as the memory 1886)”), and program instructions stored on at least one of the one or more tangible storage medium (Oliveri, para. 0180 “In some implementations, a non-transitory computer-readable medium (e.g., a computer-readable storage device, such as the memory 1886) includes instructions (e.g., the instructions 1856)…”) for execution by at least one of the one or more processors via at least one of the one or more memories, wherein the computer system is capable of performing a method comprising (Oliveri, para. 0180 “…that, when executed by one or more processors (e.g., the one or more processors 1810 or the processor 1806), cause the one or more processors to…”; para. 0173 “The memory 1886 may include instructions 1856, that are executable by the one or more additional processors 1810 (or the processor 1806) to implement the functionality described with reference to the meeting manager 162, the interruption manager 164, the GUI generator 168, or a combination thereof. In a particular aspect, the memory 1886 stores program data 1858 used or generated by the meeting manager 162, the interruption manager 164, the GUI generator 168, or a combination thereof.”).

Regarding claim 11, claim 11 is rejected for analogous reasons to claim 4.
Regarding claim 12, claim 12 is rejected for analogous reasons to claim 5.

Regarding claim 15, claim 15 is a computer program product claim with limitations similar to those recited in method claim 1, and is thus is rejected under similar rationale. 
Additionally, Pichaimurthy in view of Oliveri and Chen discloses A computer program product, the computer program product comprising: one or more computer-readable tangible storage medium (Oliveri, Fig. 18, 1886; para. 0180 “In some implementations, a non-transitory computer-readable medium (e.g., a computer-readable storage device, such as the memory 1886)”) and program instructions stored on at least one of the one or more tangible storage medium, the program instructions executable by a processor capable of performing a method, the method comprising (Oliveri, para. 0180 “…includes instructions (e.g., the instructions 1856) that, when executed by one or more processors (e.g., the one or more processors 1810 or the processor 1806), cause the one or more processors to…”).

Regarding claim 18, claim 18 is rejected for analogous reasons to claim 4.
Regarding claim 19, claim 19 is rejected for analogous reasons to claim 5.

4. Claims 2-3, 9-10, and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Pichaimurthy in view of Olivieri and Chen, and in further view of Ingel et al. (US PGPUB No. 2022/0070550, hereinafter Ingel).

Regarding claim 2, Pichaimurthy in view of Olivieri and Chen discloses training the generative AI voice model using a corpus of user-specific training material which comprises a plurality of audio recordings and corresponding text transcriptions of the user (Olivieri, para. 0062 “In a particular aspect, the interruption manager 164 trains (e.g., generates or updates) the speech model 131 based on speech detected in the speech audio stream 111 (e.g., prior to an interruption in the speech audio stream 111) during the online meeting. To illustrate, the text-to-speech converter 166 is configured to use the speech model 131 to perform text-to-speech conversion. In a particular aspect, the interruption manager 164 receives (e.g., during the first operating mode of the device 102) or generates (e.g., during the second operating mode of the device 102) the text stream 121, the metadata stream 123, or both, corresponding to the speech audio stream 111. The text-to-speech converter 166 uses the speech model 131 to generate a synthesized speech audio stream 133 by performing text-to-speech conversion on the text stream 121, the metadata stream 123, or both. The interruption manager 164 uses training techniques to update the speech model 131 based on a comparison of the speech audio stream 111 and the synthesized speech audio stream 133. In an illustrative example in which the speech model 131 includes an artificial neural network, the interruption manager 164 uses backpropagation to update weights and biases of the speech model 131. According to some aspects, the speech model 131 is updated so that subsequent text-to-speech conversion using the speech model 131 is more likely to generate synthesized speech that is a closer match of speech characteristics of the user 142.”) and wherein the baseline voice waveform comprises a plurality of baseline frequency metrics (para. 0034 “Audio processing circuitry 408 analyzes the first voice signal to identify a base frequency of the voice represented by the first voice signal, as well as other voice characteristics such as pitch, intonation, accent, loudness, and rate. Audio processing circuitry 408 transmits 410 the base frequency and voice characteristics to memory 412 for storage in a voice profile associated with the user.”).
Pichaimurthy, Olivieri, and Chen are considered to be analogous to the claimed invention as they are in the same field of performing speech processing during online meetings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Olivieri in order to train the generative AI model using a user-specific training corpus. Doing so would be beneficial, as this would ensure that synthesized speech closely resembles the user’s own speech, leading to a more consistent sounding output stream when switching between actual and augmented speech, leading to a better user experience.
Pichaimurthy in view of Olivieri and Chen does not specifically disclose identifying the baseline voice waveform of the user based, via performance of spectral analysis of the plurality of audio recordings of the user-specific training material.
Ingel teaches identifying the baseline voice waveform of the user based, via performance of spectral analysis of the plurality of audio recordings of the user-specific training material (para. 0139 “The term “voice profile” also known as “audioprint,” “acoustic fingerprint,” and “voice signature,” refers to a condensed digital summary of the specific acoustic features of a sound-emanating object (e.g., individuals and also inanimate objects) deterministically generated from a reference audio signal. … A common technique for determining a voice profile from a reference media stream is using a time-frequency graph called a spectrogram. Specifically, voice profile determination module 406 may determine the voice profile for each one or more individuals speaking in the received media stream by extracting spectral features, also referred to as spectral attributes, spectral envelope, or spectrogram from an audio sample of a single individual.”; para. 0165 “For example, step 442 may analyze the source audio data or the component of the source audio data to determine characteristics of a voice of a speaker producing speech in the source audio data or the component of the source audio data, and the voice profile may be based on the determined characteristics of the voice. In another example, step 442 may analyze the historic audio recordings or components of historic audio recordings to determine characteristics of a voice of a speaker producing speech in the historic audio data or the component of the historic audio data, and the voice profile may be based on the determined characteristics of the voice.”).
Pichaimurthy, Olivieri, Chen, and Ingel are considered to be analogous to the claimed invention as Pichaimurthy, Olivieri, and Chen are in the same field of speech processing during online meetings, and Ingel is in the same field of determining baseline waveforms of speakers. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Ingel in order to specifically determine the baseline voice waveform by performing spectrogram analysis. Doing so would be beneficial, as spectrograms provide spectral information that can be used for mimicking a user’s voice, improving the baseline voice waveform.

Regarding claim 3, Pichaimurthy in view of Olivieri and Chen, and further in view of Ingel discloses wherein the baseline voice waveform of the user represents averages of individual frequency metrics within the audio recordings of the user-specific training material (Ingel, para. 0140 “Consistent with embodiments of the present disclosure, the output voice profile may be a vector of numbers. …In a first example embodiment, the computer-based model may detect and output various statistical characteristics of the captured audio such as average loudness or average pitch of the audio…Such parameters may be used to form an output voice profile comprising a set of numbers forming a vector.”; para. 0165 “For example, step 442 may analyze the source audio data or the component of the source audio data to determine characteristics of a voice of a speaker producing speech in the source audio data or the component of the source audio data, and the voice profile may be based on the determined characteristics of the voice. In another example, step 442 may analyze the historic audio recordings or components of historic audio recordings to determine characteristics of a voice of a speaker producing speech in the historic audio data or the component of the historic audio data, and the voice profile may be based on the determined characteristics of the voice.”).
Pichaimurthy, Olivieri, Chen, and Ingel are considered to be analogous to the claimed invention as Pichaimurthy, Olivieri, and Chen are in the same field of speech processing during online meetings, and Ingel is in the same field of determining baseline waveforms of speakers. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Ingel in order to have the baseline voice waveform represent averages of individual frequency metrics. Doing so would be beneficial, as averaging would make the resulting baseline voice waveform less susceptible to outlier speech fragments of the user, ensuring that the resulting frequency metrics reflect the central tendency of the user’s voice, resulting in more accurate baseline voice waveforms.

Regarding claim 9, claim 9 is rejected for analogous reasons to claim 2.
Regarding claim 10, claim 10 is rejected for analogous reasons to claim 3.
Regarding claim 16, claim 16 is rejected for analogous reasons to claim 2.
Regarding claim 17, claim 17 is rejected for analogous reasons to claim 3.

5. Claims 6, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Pichaimurthy in view of Olivieri and Chen, and further in view of Kimbley (NPL Article (Kimbley IT), “Use This Tool to Get the Perfect Video and Sound in a Google Meet Video Call”) and Kim et al. (NPL, “’I Can’t Talk Now’: Speaking with Voice Output Communication Aid Using Text-to-Speech Synthesis During Multiparty Video Conference”, hereinafter Kim).

Regarding claim 6, Pichaimurthy in view of Olivieri and Chen discloses evaluating a network connection … of the computing device (Olivieri, Evaluation of current network connection is determined by observing network metric of duration of last received audio frame: para. 0069 “In a particular aspect, the interruption manager 164 detects the interruption in the speech audio stream 111 in response to determining that no audio frames of the speech audio stream 111 are received within a threshold duration of a last received audio frame of the speech audio stream 111. … In a particular aspect, the interruption manager 124 detects network issues in response to receiving the interruption notification from the device 104.”); and in response to … metrics of the network connection being below minimum threshold values (Olivieri, para. 0069 “In a particular aspect, the interruption manager 164 detects the interruption in the speech audio stream 111 in response to determining that no audio frames of the speech audio stream 111 are received within a threshold duration of a last received audio frame of the speech audio stream 111. … In a particular aspect, the interruption manager 124 detects network issues in response to receiving the interruption notification from the device 104.; para. 0075 “In another a particular aspect, the interruption manager 164 detects that the interruption has ended and sends an interruption ended notification to the device 102. For example, the interruption manager 164 detects that the interruption has ended in response to determining that an available communication bandwidth of a communication link with the device 102 is greater than a threshold.”), substituting speech output of the user during the online conference call via the generative AI voice model (Olivieri, para. 0090 “The interruption manager 164, in response to detecting the interruption, provides the text stream 121, the metadata stream 123, the annotated text stream 137, or a combination thereof, to the text-to-speech converter 166. The text-to-speech converter 166 generates the synthesized speech audio stream 133 by using the speech model 131 to perform a text-to-speech conversion based on the text stream 121, the metadata stream 123, the annotated text stream 137, or a combination thereof, as described with reference to FIG. 1. The interruption manager 164, in response to the interruption, provides the synthesized speech audio stream 133 as the audio output 143 to the speaker 154, halts playback of the speech audio stream 111, halts playback of the video stream 113, displays the avatar 135, displays a particular representation of the avatar 135, displays the text stream 121, displays the annotated text stream 137, or a combination thereof, as described with reference to FIG. 1”; Fig 4B), wherein the substituting comprises: … providing the … text input to the generative AI voice model for speech output (Olivieri, para. 0090 “The interruption manager 164, in response to detecting the interruption, provides the text stream 121, the metadata stream 123, the annotated text stream 137, or a combination thereof, to the text-to-speech converter 166.”), in a voice and speaking style of the user (Olivieri, para. 0070 “As another example, the synthesized speech audio stream 133 based on the annotated text stream 137 (e.g., the text stream 121 and the metadata stream 123) corresponds to speech indicated by the text stream 121 having speech characteristics of the user 142 as represented by the speech model 131 with intonations indicated by the metadata stream 123”), via text-to-speech synthesis (Olivieri, para. 0090 “The text-to-speech converter 166 generates the synthesized speech audio stream 133 by using the speech model 131 to perform a text-to-speech conversion based on the text stream 121”).
Pichaimurthy, Olivieri, and Chen are considered to be analogous to the claimed invention as they are in the same field of speech processing in online meetings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Olivieri in order to substitute speech output with synthesized speech in response to network metrics not meeting minimum threshold values. Doing so would be beneficial, as network issues can lead to missing portions of speech, and by substituting the missing portions of speech, the resulting output will be more continuous and less disruptive of the flow of conversation, leading to improved user experience (para. 0034).
Pichaimurthy in view of Olivieri and Chen does not specifically disclose prior to participation by the user in the online conference call, evaluating [a network connection] and microphone performance of the computer device; and in response to a detected microphone failure [or in response metrics of the network connection being below minimum threshold values…].
Kimbley teaches prior to participation by the user in the online conference call, evaluating [a network connection] (pg. 1, “For Google Meet users, Google has added testing functionally to the “green room” (the area you go to before hitting “join meeting”…”; third bullet point: “Google Meet will now scan your settings and should return with four green ticks. Showing…that your internet connection is sufficient”) and microphone performance of the computer device (pg. 1, “There are not tools for you to use before joining a video call to make sure your mic is working…”; third bullet point: “Google Meet will now scan your settings and should return with four green ticks. Showing you have the right Mic…”); and in response to a detected microphone failure [or in response metrics of the network connection being below minimum threshold values…] (third bullet point: “Google Meet will now scan your settings and should return with four green ticks…”; Microphone and Internet connection are tested, with green ticks indicating success, and absence of a green tick indicates failure).
Pichaimurthy, Olivieri, Chen, and Kimbley are considered to be analogous to the claimed invention as they are all in the same field of speech processing in online meetings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Kimbley in order to evaluate a network connection and a microphone performance prior to a user participating in an online meeting, and to determine if either the network or the microphone are not satisfactory. Doing so would be beneficial, as weak network connections can lead to loss in audio frames (Olivieri, para. 0003) and failed microphones can lead to an inability to capture speech output, and therefore by checking both conditions prior to a user joining a meeting, disruptive speech output can be minimized, leading to a better user experience.
Pichaimurthy in view of Olivieri and Chen and further in view of Kimbley does not specifically disclose receiving text input from the user while participating in the online conference call; [providing] received [text input].
 Kim teaches receiving text input from the user while participating in the online conference call (Fig. 2(a) “Input Page”; Fig. 2a caption: “(a) On the Input Page, PX can send the content to be spoken and use the Speak Immediately mode”; section 3.1 “With this mode activated, each time the user presses the space bar in the input field, the entered text up to that point…”); [providing] received [text input] (section 3.1 “…the entered text up to that point are queued for speech synthesis. In other words, every typed word in queued immediately for speech synthesis.”).
Pichaimurthy, Olivieri, Chen, Kimbley, and Kim are considered to be analogous to the claimed invention as they are all in the same field of speech processing in online meetings. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Kim in order to synthesize speech using text input received from a user. Doing so would be beneficial, as doing so would enable participation by users in the online conference without functioning microphones, and participation by users who would otherwise by uncomfortable with speaking (Kim, Fig. 1), which improves user experience.

Regarding claim 13, claim 13 is rejected for analogous reasons to claim 6.
Regarding claim 20, claim 20 is rejected for analogous reasons to claim 6.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Lynch et al. (US 2022/0343934 A1): adjusting amplitudes of audio frequencies to compensate for a face covering relative to what the amplitudes would be had user not been wearing a mask (Fig. 2 and para. 0023)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CODY DOUGLAS HUTCHESON/Examiner, Art Unit 2659  

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Jun 21, 2023
Application Filed
May 28, 2025
Non-Final Rejection — §103
Aug 18, 2025
Interview Requested
Sep 02, 2025
Examiner Interview Summary
Sep 02, 2025
Response Filed
Sep 02, 2025
Applicant Interview (Telephonic)
Oct 09, 2025
Final Rejection — §103
Dec 17, 2025
Response after Non-Final Action
Jan 20, 2026
Request for Continued Examination
Jan 27, 2026
Response after Non-Final Action
Feb 25, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/330,472
Patent 12603096
VOICE ENHANCEMENT METHODS AND SYSTEMS
2y 5m to grant Granted Apr 14, 2026
18/545,677
Patent 12591750
GENERATIVE LANGUAGE MODEL UNLEARNING
2y 5m to grant Granted Mar 31, 2026
18/163,230
Patent 12579447
TECHNIQUES FOR TWO-STAGE ENTITY-AWARE DATA AUGMENTATION
2y 5m to grant Granted Mar 17, 2026
18/217,880
Patent 12537018
METHOD AND SYSTEM FOR PREDICTING A MENTAL CONDITION OF A SPEAKER
2y 5m to grant Granted Jan 27, 2026
17/877,543
Patent 12530529
DOMAIN-SPECIFIC NAMED ENTITY RECOGNITION VIA GRAPH NEURAL NETWORKS
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
62%
Grant Probability
99%
With Interview (+47.1%)
2y 10m
Median Time to Grant
High
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allow rate.
COGNITIVE ASSISTANT VOICE AMELIORATION MODEL

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email