Last updated: May 29, 2026
Application No. 18/363,101
VISUAL REPRESENTATION OF TEXT ON VOICE MODULATION GRAPH

Non-Final OA §103
Filed
Aug 01, 2023
Examiner
HUTCHESON, CODY DOUGLAS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Cisco Technology Inc.
OA Round
3 (Non-Final)
Interview Optional

— +51.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 63% grant rate with +51.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 27 resolved cases, 2023–2026
Examiner Intelligence

HUTCHESON, CODY DOUGLAS View full profile →
Grants 63% of resolved cases
Career Allowance Rate
17 granted / 27 resolved
+1.0% vs TC avg
Strong +52% interview lift
Without
With
+51.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
21 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
11.4%
-28.6% vs TC avg
§103
82.9%
+42.9% vs TC avg
§102
3.8%
-36.2% vs TC avg
§112
1.9%
-38.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 27 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/28/2026 has been entered.

Response to Arguments
1. Regarding the rejection under 35 U.S.C. § 103, Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Objections
2. Claim 7 is objected to because of the following informalities:
In claim 7, “further comprising: Training the machine learning model”, the capitalization should be removed so that the limitation reads “further comprising: training the machine learning model”.
  Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


3. Claims 1, 3-5, 7-8, 10-12, 14-15, and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Myron & Mitchell (US 2023/0409804 A1, hereinafter Myron) in view of Rubin & Moorer (US 2017/0169840 A1, hereinafter Rubin) and further in view of Kamran et al. (WO 2023/041763 A1, hereinafter Kamran).

 	Regarding claim 1, Myron discloses A computer-implemented method comprising: obtaining audio that includes words spoken by a user (para. 0062 “The process 500 begins at block 510, where audio data is received for a spoken utterance. The audio data can be received in various formats, and the audio data can be recorded or received in real time…For example, the audio data can comprise speech of a mobile device user who is generating a message or post using a speech-to-text feature.”); generating a modulation graph representative of the audio (Fig. 3, 330 and 340, lines below text; para. 0050 “Encoded text representation 330 includes a visual representation of a speaker's volume measured and/or detected by the disclosed system in a spoken utterance. In the depicted example, the visual representation is a geometric element comprising a graph of a line that rises when the speaker's volume increases and falls when the speaker's volume decreases.”; para. 0051 “Encoded text representation 340 includes a visual representation of a speaker's pitch or inflection measured and/or detected by the disclosed system in a spoken utterance. In the depicted example, the visual representation is a geometric element comprising a rectangle. Each rectangle corresponds to a syllable in the spoken utterance. Thus, the width of each rectangle corresponds to a syllable in the text transcription of the spoken utterance. The height of each rectangle is based on the pitch or inflection of the corresponding syllable, with a taller rectangle indicating a higher pitch/inflection and a shorter rectangle indicating a lower pitch/inflection.”); obtaining text of the words spoken by the user from the audio (para. 0064 “At block 530, the system generates an encoded text representation of the spoken utterance. The encoded text representation comprises a text transcription of the spoken utterance”); displaying the text of the words within the modulation graph of the audio to generate a text-enhanced modulation graph in which the words being spoken are displayed at a location within the modulation graph the corresponds to the audio of the words being spoken (see Fig. 3, 330 and 340; para. 0051 “Encoded text representation 340 includes a visual representation of a speaker's pitch or inflection measured and/or detected by the disclosed system in a spoken utterance. In the depicted example, the visual representation is a geometric element comprising a rectangle. Each rectangle corresponds to a syllable in the spoken utterance. Thus, the width of each rectangle corresponds to a syllable in the text transcription of the spoken utterance. The height of each rectangle is based on the pitch or inflection of the corresponding syllable, with a taller rectangle indicating a higher pitch/inflection and a shorter rectangle indicating a lower pitch/inflection.”; para. 0065 “At block 540, the system causes display of the generated encoded text representation. The representation can be displayed, for example, in a GUI on a display screen of the mobile device of the user who provided the audio data at block 510.”); receiving an input from the user to perform one or more actions with respect to the text-enhanced modulation graph (para. 0065 “In some implementations, the GUI is integrated into an application, such as a messaging application, a social media application, or a browser. The user can review the generated encoded text representation, such as to determine whether the user's intended meaning is consistent with the encoded text representation. The user can edit the encoded text representation, such as by correcting errors in a transcription, modifying the visual representation, or repeating the process 500 with a new spoken utterance.“); performing the one or more actions based on the input (para. 0065 “The user can edit the encoded text representation, such as by correcting errors in a transcription, modifying the visual representation, or repeating the process 500 with a new spoken utterance. When the user is satisfied with the generated encoded text representation, the user can indicate approval (e.g., by selecting a button or icon) and/or use the generated encoded text representation.”); training a machine learning model using the text-enhanced modulation graph…to visually identify unwanted noises in the text-enhanced modulation graph (para. 0057 “At block 430, the system generates a training dataset using the received audio data and the received encoded text representation.”; para. 0058 “At block 440, the system trains a machine learning model using the generated training dataset to generate encoded text representations based on received audio data.”; Fig. 3, resulting output of model is an encoded text representation (transcription + visual representation: para. 0035 “An encoded text representation comprises a text transcription of a spoken utterance and a visual representation of a nonverbal characteristic of the spoken utterance.”) which visually identifies unwanted noises (undesired audio characteristics, such as sarcastic tone: para. 0044 “In some implementations, the message composition module 240 can generate suggestions and/or prompts to facilitate composition of messages. For example, the message composition module 240 can detect that an encoded text representation indicates anger or sarcasm in a spoken utterance, and the message composition module 240 can suggest changes to the spoken utterance that would cause the spoken utterance to be perceived as less angry or sarcastic (e.g., “lower your volume,” “increase your pitch,” “try changing your inflection”).”)) and training the machine learning model trained to visually identify the unwanted noises in the text-enhanced modulation graph (see above mapping; para. 0064 “At block 530, the system generates an encoded text representation of the spoken utterance. The encoded text representation comprises a text transcription of the spoken utterance included in the audio data and a visual representation of the nonverbal characteristic identified at block 520. The visual representation can include a variation in a text attribute (e.g., font, font size, font color, italics, subscript, superscript, all caps, bold, underline) and/or a geometric element, as described herein. To generate the encoded text representation, the disclosed system uses a trained machine learning model.”).
	Myron does not specifically disclose the [training a machine learning model using…] the input from the user [to visually identify unwanted noises…].
	Rubin teaches [training a machine learning model using…] the input from the user (para. 0040 “The hybrid waveform system can use a variety of machine learning methods. For example, the hybrid waveform system can employ machine learning to identify waveform patterns. Using the identified patterns, the hybrid waveform system can better match audio to text, descriptive text, symbols, graphics, etc.”; para. 0041 “As still another example, the hybrid waveform system can learn words and names when a user replaces waveforms in the hybrid waveform with the words and text. As such, each time a user replaces a waveform in the hybrid waveforms with text, the hybrid waveform system can create an association between the waveform or corresponding audio and the provided text. Further, the hybrid waveform system can combine learning from multiple user edits over time as well as combine edits from multiple users. Using the aggregate learning, the hybrid waveform system can better recognize patterns and improve accuracy when generating hybrid waveforms.”).
Myron and Rubin are considered to be analogous to the claimed invention as
they both are in the same field of generating visual representations of user speech. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Myron to incorporate the teachings of Rubin in order to train a machine learning model using the input from the user for visually identifying noise. Doing so would be beneficial, as this would help to improve accuracy when generating the text-enhanced modulation graphs (Rubin, para. 0041).
	Myron in view of Rubin discloses training the machine learning model trained to visually identify the unwanted noises in the text-enhanced modulation graph (see above mapping). However, Myron in view of Rubin does not specifically disclose removing, by the machine learning model, unwanted noises in subsequent audio…
Kamran teaches removing, by the machine learning model, unwanted noises in subsequent audio based on training the machine learning model…( pg. 7 Lines 8-10 “The machine learning algorithm may be pre-trained and/or be trained “online”, i.e. while the user uses the audio signal circuitry, such that the machine learning algorithm may be able to learn the preferences of the user, i.e. what the user perceives as disturbing and what not…”; pg. 12 Lines 11-19 “According to the present disclosure, a personalized machine learning algorithm or DNN may be trained. Hence, in some embodiments, a user input is acquired, for example based on a user interface…for learning e.g. to cancel sounds from last minute’s audio (or any other time period)…”; pg. 8 Lines 10-14 “When the audio signal circuitry obtains an audio signal, due to its pattern, e.g. during a first few frames, the disturbing sound, i.e. the predicted audio signal property, may be identified, such that counter-measures can be taken by the audio signal circuitry for cancelling out the noise…”).
Myron, Rubin, and Kamran are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Myron in view of Rubin to incorporate the teachings of Kamran in order to remove, by the machine learning model, unwanted noises in subsequent audio based on the training the machine learning model. Doing so would be beneficial, as this would allow for the user to only perceive audio which they want to hear instead of noise (pg. 6 Lines 29-33 and pg. 7 Lines 1-2).

Regarding claim 3, Myron in view of Rubin and Kamran discloses wherein displaying includes displaying the text of the words within the modulation graph on a user interface that includes options for selecting the one or more actions to perform (Myron, para. 0065 “At block 540, the system causes display of the generated encoded text representation. The representation can be displayed, for example, in a GUI on a display screen of the mobile device of the user who provided the audio data at block 510.”; para. 0043 “The user can then view the displayed draft message in a GUI to determine whether the encoded text representation correctly captures nonverbal information in the spoken utterance (e.g., the user's intended meaning and/or sentiment). The user can then edit the draft message as appropriate and/or indicate approval of the message (e.g., by selecting a button or icon) and send the message to a mobile device of a recipient of the message.”).

	Regarding claim 4, Myron in view of Rubin and Kamran discloses wherein the one or more actions include marking noise in the audio (Rubin, para. 0036-0037 “Accordingly, as shown, the hybrid waveform 110 provides a user with recognizable words in an audio segment along with waveforms from non-speech-recognizable audio (e.g., music, sound effects, background noise, unlearned words, etc.)…In some embodiments, a user can edit portions of the hybrid waveform 110…As shown, a user may label and/or replace waveforms with text or descriptive text. Additionally, a user may label and/or replace a waveform in the hybrid waveform 110 with a tag, a symbol, a graphic, an emoji, etc. For example, the user may use an emoji in place of the descriptive text to indicate the introduction music.”), editing the audio (Rubin, para. 0052 “Using the edit menu option 412, the user can apply formatting changes to the hybrid waveform 410. For example, the user can select the word “welcome” and choose the option within the dropdown menu of the edit menu option 412 to lengthen the duration of the audio (i.e., “Change Length” 414). In response, the hybrid waveform system can apply the associated formatting style to the word “welcome” (e.g., the hybrid waveform system can italicize the word “welcome” in the hybrid waveform 410 based on the italics formatting style being associated with audio length modifications). Additionally, and/or alternatively, the user can directly apply a formatting style to selected text in the hybrid waveform 410, such as by using shortcuts or by applying other format styling options. For example, the user can directly apply formatting changes to the text and the hybrid waveform system can make the corresponding acoustic changes to the audio segment.”), and deleting a portion of the audio (Rubin, para. 0050 “For instance, upon the user selecting the selected text 408, the user may delete the selection. In one or more embodiments, in response to the user deleting the selected text 408, the hybrid waveform system removes the words “Uhhh” from the hybrid waveform 410. The hybrid waveform system also removes the corresponding audio portion from the audio segment.”).
Myron, Rubin, and Kamran are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Rubin in order to have the one or more actions include marking noise in the audio, editing the audio, and deleting a portion of the audio. Doing so would be beneficial, as this would allow for the user to make edits to the audio content through direct interactions with the text itself, which is easier for users than waveform-based editing (Rubin, para. 0011).
	
	Regarding claim 5, Myron in view of Rubin and Kamran discloses training a machine learning model based on the input to remove the identified sounds in subsequently obtained audio in real-time (Kamran, pg. 7 Lines 8-10 “The machine learning algorithm may be pre-trained and/or be trained “online”, i.e. while the user uses the audio signal circuitry, such that the machine learning algorithm may be able to learn the preferences of the user, i.e. what the user perceives as disturbing and what not…”; pg. 12 Lines 11-19 “According to the present disclosure, a personalized machine learning algorithm or DNN may be trained. Hence, in some embodiments, a user input is acquired, for example based on a user interface…for learning e.g. to cancel sounds from last minute’s audio (or any other time period)…”; pg. 8 Lines 10-14 “When the audio signal circuitry obtains an audio signal, due to its pattern, e.g. during a first few frames, the disturbing sound, i.e. the predicted audio signal property, may be identified, such that counter-measures can be taken by the audio signal circuitry for cancelling out the noise…”).
Myron, Rubin, and Kamran are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Kamran in order to train a machine learning model based on the input to remove the identified sounds in subsequently obtained audio in real-time. Doing so would be beneficial, as this would allow for the user to only perceive audio which they want to hear instead of noise (Kamran, pg. 6 Lines 29-33 and pg. 7 Lines 1-2).

	Regarding claim 6, Myron in view of Rubin and Kamran discloses wherein the modulation graph displays a stress or a pitch of words in a plurality of languages (Myron, para. 0039 “”; para. 0051 “Encoded text representation 340 includes a visual representation of a speaker's pitch or inflection measured and/or detected by the disclosed system in a spoken utterance. In the depicted example, the visual representation is a geometric element comprising a rectangle. Each rectangle corresponds to a syllable in the spoken utterance. Thus, the width of each rectangle corresponds to a syllable in the text transcription of the spoken utterance. The height of each rectangle is based on the pitch or inflection of the corresponding syllable, with a taller rectangle indicating a higher pitch/inflection and a shorter rectangle indicating a lower pitch/inflection.”).

	Regarding claim 7, Myron in view of Rubin and Kamran discloses training a machine learning model using the modulation graph to learn different dialects, sounds, and patterns (Myron, para. 0057 “At block 430, the system generates a training dataset using the received audio data and the received encoded text representation.”; para. 0058 “At block 440, the system trains a machine learning model using the generated training dataset to generate encoded text representations based on received audio data.”; para. 0060 “In some implementations, a model can be retrained to maintain accuracy over time, such as to account for changes in language or usage (e.g., new words, slang), culture, habits or patterns, accents, and so forth.”).

	Regarding claim 8, claim 8 is a device claim with limitations similar to those recited in method claim 1, and is thus rejected under similar rationale.
	Additionally, Myron discloses A device (Fig. 6, para. 0068) comprising: a memory (Fig. 6, 606, 610), and one or more processors coupled to the memory (Fig. 6, 602, 616, coupled via 616) and configured to (para. 0073).
	
	Regarding claim 10, claim 10 is rejected for analogous reasons to claim 3.
	Regarding claim 11, claim 11 is rejected for analogous reasons to claim 4.
	Regarding claim 12, claim 12 is rejected for analogous reasons to claim 5.
	Regarding claim 13, claim 13 is rejected for analogous reasons to claim 6.
	Regarding claim 14, claim 14 is rejected for analogous reasons to claim 7.

	Regarding claim 15, claim 15 is a non-transitory computer readable storage medium claim with limitations similar to those recited in method claim 1, and is thus rejected under similar rationale.
	Additionally, Myron disclose One or more non-transitory computer readable storage media (para. 0071 “The machine-readable medium 626 can be non-transitory or comprise a non-transitory device. In this context, a non-transitory storage medium can include a device that is tangible, meaning that the device has a concrete physical form, although the device can change its physical state. Thus, for example, non-transitory refers to a device remaining tangible despite this change in state.”) encoded with instructions that, when executed by one or more processors, cause the one or more processors to (para. 0071 “The machine-readable (storage) medium 626 can include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the computing system 600.”; para. 0073 “The computer programs typically comprise one or more instructions (e.g., instructions 604, 608, 628) set at various times in various memory and storage devices in computing device(s). When read and executed by the processor 602, the instruction(s) cause the computing system 600 to perform operations to execute elements involving the various aspects of the disclosure.”).  
	
	Regarding claim 17, claim 17 is rejected for analogous reasons to claim 3.
	Regarding claim 18, claim 18 is rejected for analogous reasons to claim 4.
	Regarding claim 19, claim 19 is rejected for analogous reasons to claim 5.
	Regarding claim 20, claim 10 is rejected for analogous reasons to claim 6.
 
4. Claims 2, 6, 9, 13, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over  Myron in view of Rubin and Kamran, and further in view of Schlippe et al. (US 2018/0068662 A1, hereinafter Schlippe).

Regarding claim 2, Myron in view of Rubin and Kamran discloses a height or width…displayed within the text-enhanced modulation graph [which] indicates a stress or pitch with which the words or syllables within the words are spoken (see Fig. 3, 340; para. 0051 “Encoded text representation 340 includes a visual representation of a speaker's pitch or inflection measured and/or detected by the disclosed system in a spoken utterance. In the depicted example, the visual representation is a geometric element comprising a rectangle. Each rectangle corresponds to a syllable in the spoken utterance. Thus, the width of each rectangle corresponds to a syllable in the text transcription of the spoken utterance. The height of each rectangle is based on the pitch or inflection of the corresponding syllable, with a taller rectangle indicating a higher pitch/inflection and a shorter rectangle indicating a lower pitch/inflection.”).
Myron in view of Rubin and Kamran does not specifically disclose that this height or weight be of letters of the text of the words displayed within the text-modulation graph.
Schlippe teaches a height or width of letters of the text of the words displayed within the text-enhanced modulation graph indicates a stress or pitch with which the words or syllables within the words are spoken (Schlippe, para. 0065 “Layer one 360 illustrates acoustic (speed) feature values as vectors 301, 302, 303, and 304 which include, in this example, acoustic feature values for loudness and pitch, and also includes a parameter value for a speed parameter. As discloses herein, referring to FIG. 1, the loudness and pitch feature values can be extracted from the audio speech signal 201 by a feature extractor 120.”; para. 0067 “An acoustic pitch parameter can be transformed into a visual shaping parameter with a corresponding horizontal stroke weight.”; para. 0075 “For example, referring to the pitch dimension as characterized by the Low-High dimension line 522, when words are spoken with a low pitch, a horizontal stroke weight can be reduced (e.g., character shape 502) as compared to the reference shape 500. In cases when the words are spoken with a high pitch, a horizontal stroke weight can be increased (e.g., character shape 504) as compared to the reference shape 500.”).
Myron, Rubin, Kamran, and Schlippe are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Schlippe in order to specifically have the height or width of the letters of the text indicate the stress. Doing so would be beneficial, as this would provide information related to how the user is speaking which can be utilized for hearing impaired users or language learners to learn how their speech is being perceived (Schlippe, para. 0008).

Regarding claim 6, Myron in view of Rubin and Kamran discloses wherein the modulation graph displays a stress or a pitch of words… (Myron, para. 0039 “”; para. 0051 “Encoded text representation 340 includes a visual representation of a speaker's pitch or inflection measured and/or detected by the disclosed system in a spoken utterance. In the depicted example, the visual representation is a geometric element comprising a rectangle. Each rectangle corresponds to a syllable in the spoken utterance. Thus, the width of each rectangle corresponds to a syllable in the text transcription of the spoken utterance. The height of each rectangle is based on the pitch or inflection of the corresponding syllable, with a taller rectangle indicating a higher pitch/inflection and a shorter rectangle indicating a lower pitch/inflection.”).
Myron in view of Rubin and Kamran does not specifically disclose to display a stress or pitch of words in a plurality of languages.
Schlippe teaches to display a stress or pitch of words in a plurality of languages (Figs. 7A, 7B and 8 (para. 0075, 0085, 0092) show pitch information in Spanish, English, and German, respectively; para. 0027 “A grapheme functions as a representation of a written character of a particular language (e.g., a symbol). It is the smallest unit of the writing system of a language. Each individual grapheme may or may not carry meaning by itself. Each individual grapheme may or may not correspond to a single phoneme of a spoken language. Example graphemes can include, but are not limited to, alphabetic letters, typographic ligatures, Chinese characters, Hangul characters, Arabic characters, Cyril characters, numerical digits, punctuation marks, and other individual symbols. Stated in another way, a grapheme can be a symbol that functions as a character in a specific language's written tradition (orthography). In most cases, a grapheme can represent (is a symbol for) a single character. In some cases, however, a grapheme can represent two or more characters. In some cases, a grapheme can represent a character and an associated accent mark. For example, in Spanish “ch” is considered a single unit, so a single grapheme is used in Spanish to represent “ch”. In English, however, “ch” may not be considered a single unit, so a grapheme that represents “c” is used along with another grapheme that represents “h”. In another example, referring to the alphabet for Northern Sami, “á” is sorted and counted separately, and represented with a grapheme different from the grapheme used to represent “a”. In Danish, however, because “á” is just considered a variant of “a”, the same grapheme is used to represent both characters. In some implementations, a transcription can be in the original language of the audio speech signal.”).
Myron, Rubin, Kamran, and Schlippe are considered to be analogous to the claimed invention as they are all in the same field of speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Schlippe in order to specifically the stress or pitch be displayed in a plurality of languages. Doing so would be beneficial, as this would provide information related to how the user is speaking which can be utilized by language learners to learn how their speech is being perceived (Schlippe, para. 0008).

	Regarding claim 9, claim 9 is rejected for analogous reasons to claim 2.
	Regarding claim 13, claim 13 is rejected for analogous reasons to claim 6.
	Regarding claim 16, claim 16 is rejected for analogous reasons to claim 2.
	Regarding claim 20, claim 20 is rejected for analogous reasons to claim 6.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Lee & Yu (US 2021/0090592 A1): enhancing transcript with volume information (Fig. 4)
Dontcheva et al. (US 2024/0127858 A1): transcript text with visualization representation of speech (Fig. 10)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CODY DOUGLAS HUTCHESON/Examiner, Art Unit 2659   

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Show 5 earlier events
Aug 28, 2025
Response Filed
Nov 04, 2025
Final Rejection mailed — §103
Jan 13, 2026
Interview Requested
Jan 27, 2026
Examiner Interview Summary
Jan 27, 2026
Applicant Interview (Telephonic)
Jan 28, 2026
Request for Continued Examination
Jan 30, 2026
Response after Non-Final Action
Apr 03, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/094,556
Patent 12626715
ROLE SEPARATION METHOD, ELECTRONIC DEVICE, AND COMPUTER STORAGE MEDIUM
3y 4m to grant Granted May 12, 2026
18/421,318
Patent 12614036
INTELLIGENT DETECTION OF BIAS WITHIN AN ARTIFICIAL INTELLIGENCE MODEL
2y 3m to grant Granted Apr 28, 2026
18/330,472
Patent 12603096
VOICE ENHANCEMENT METHODS AND SYSTEMS
2y 10m to grant Granted Apr 14, 2026
18/545,677
Patent 12591750
GENERATIVE LANGUAGE MODEL UNLEARNING
2y 3m to grant Granted Mar 31, 2026
18/163,230
Patent 12579447
TECHNIQUES FOR TWO-STAGE ENTITY-AWARE DATA AUGMENTATION
3y 1m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
63%
Grant Probability
99%
With Interview (+51.7%)
2y 8m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 27 resolved cases by this examiner. Grant probability derived from career allowance rate.