Last updated: April 19, 2026
Application No. 18/248,887
Embedded Dictation Detection

Non-Final OA §103
Filed
Apr 13, 2023
Examiner
TENGBUMROONG, NATHAN NARA
Art Unit
2654
Tech Center
2600 — Communications
Assignee
Solventum Intellectual Properties Company
OA Round
3 (Non-Final)
Interview Optional

— +75.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 14 resolved cases, 2023–2026
Examiner Intelligence

TENGBUMROONG, NATHAN NARA View full profile →
Grants 43% of resolved cases
Career Allow Rate
6 granted / 14 resolved
-19.1% vs TC avg
Strong +75% interview lift
Without
With
+75.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
34 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
27.2%
-12.8% vs TC avg
§103
54.3%
+14.3% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
3.2%
-36.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 14 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/21/2025 has been entered.

Response to Amendment
Claim 3 is amended. Claims 1-8, 10-13, 15-18, 26, and 33-35 are presented for examination.

Response to Arguments
Rejection under 35 U.S.C. 103
Applicant’s arguments have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 and 4-8 are rejected under 35 U.S.C. 103 as being unpatentable over Binder et al. (US 20200279576 A1; hereinafter referred to as Binder) in view of Balasubramaniam et al. (US 20210272571 A1; hereinafter referred to as Balasubramaniam) and Zimmerman et al. (US 20210272571 A1; hereinafter referred to as Zimmerman).
Regarding claim 1, Binder discloses: a computer-implemented method for processing audio data, the method comprising: accessing an initial first classification model stored in a computer memory of one or more computing devices, the initial first classification model having been trained ([0141] sound detectors (e.g., the sound-type detector 404 and/or the trigger sound detector 406) may be configured to compare a representation of a sound input (e.g., the sound or utterance provided by a user) to one or more reference representations… the reference representation is adjusted (or created) as part of a voice enrollment or “training” procedure, where a user outputs the trigger sound several times so that the device can adjust (or create) the reference representation) to identify one or more audio segments that are indicative of dictation ([0009] a different type of sound detector (e.g., one that uses less power than the trigger sound detector) is used to monitor an audio channel to determine whether the sound input corresponds to a certain type of sound. Sounds are categorized as different “types” based on certain identifiable characteristics of the sounds. For example, sounds that are of the type “human voice” have certain spectral content, periodicity, fundamental frequencies, etc. Other types of sounds (e.g., whistles, hand claps, etc.) have different characteristics. Sounds of different types are identified using audio and/or signal processing techniques. Dictation is a specific type of sound that is human speech.) without using automatic speech recognition… ([0110] the sound-type detector 404 includes a “voice activity detector” (VAD));
receiving the audio data from a recording device ([0026] an electronic device includes a sound receiving unit configured to receive sound input);
storing the audio data in the computer memory ([0012] In some implementations, sound inputs are stored in memory as they are received and passed to an upstream detector so that a larger portion of the sound input can be analyzed);
analyzing the audio data using the initial first classification model to identify one or more segments in the audio data that are indicative of dictation ([0110] the sound-type detector 404 generates a spectrogram of a received sound input (e.g., using a Fourier transform), and analyzes the spectral components of the sound input to determine whether the sound input is likely to correspond to a particular type or category of sounds (e.g., human speech)), wherein only segments with a determined probability above a predetermined threshold ([0011] a sound detector that uses less power than the sound-type detector is used to monitor an audio channel to determine whether a sound input satisfies a predetermined condition, such as an amplitude (e.g., volume) threshold. This sound detector may be referred to herein as a noise detector. When the noise detector detects a sound that satisfies the predetermined threshold, the noise detector initiates the sound-type detector to further process and/or analyze the sound) are identified as dictation… ([0013] upon a determination that the sound input includes the predetermined content, initiating a speech-based service… In some implementations, speech-based service is a dictation service).
Binder does not explicitly, but Balasubramaniam teaches: wherein the initial first classification model is further configured to: analyze one or more low-level acoustic features of the audio data, and analyze one or more higher-level features derived from the one or more low-level acoustic features… ([0281] Low-level feature extraction focuses on the extraction of acoustic information such as cepstral features and other voice parameters from raw audio. High-level feature extraction focuses on the extraction of phonetic, prosodic, or lexical information and other speaker-related characteristics. The output after a low-level feature extraction is fed as the input for high-level feature extraction);
 Binder and Balasubramaniam are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder to combine the teachings of Balasubramaniam because doing so would help differentiate speakers using a speaker clustering based on analyzing different low-level audio features, improving dictation processing and transcription generation (Balasubramaniam [0273] speaker identification subsystem may be configured to process audio of a conversation between healthcare professional and a patient and determine an identity for the healthcare professional (e.g. Dr. Smith) and an identity for the patient (e.g. John Doe). The speaker identification subsystem may be used together with speaker clustering functionalities to provide improved audio processing and conversation transcript).
The combination of Binder and Balasubramaniam does not explicitly, but Zimmerman teaches: accessing an initial second classification model stored in the computer memory of the one or more computing devices, the initial second classification model having been trained to process the audio data ([col 10, lines 27-31] the feature set and predictor parameters may be adapted from the speaker-independent CR predictor by using the speaker's dictations in the CR predictor builder 28, for example, using a classification framework such as ANN training) using automatic speech recognition… ([col 7, lines 62-66] The ASR module 60 uses the ASR models 69, stored in the memory 62 to compute a transcribed text, along with associated ASR output data such as word lattices, word alignments and energy values from the digital audio file);
and analyzing the one or more identified segments that are indicative of dictation using the initial second classification model to extract one or more features from the one or more identified segments ([col 4, lines 23-26] an automatic speech recognition (ASR) system is supplemented by a correction rate predictor that is statistically trained and developed from a set of features extracted from a set of dictations) for inclusion in an electronic health record ([col 2, lines 14-16] these transcriptions are considered to be drafts manually edited by MTs before the final documents are uploaded into the electronic medical record).
Binder, Balasubramaniam, and Zimmerman are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder and Balasubramaniam to combine the teachings of Zimmerman because doing so would allow for the use of automatic speech recognition to assist in extracting features from an audio segment that contains dictation, reducing human workload (Zimmerman [col 3, lines 48-56] Time and cost of editing automatically generated medical transcription documents can be reduced. Transcriptionist editing time can be reduced. Transcriptionist fatigue in editing transcribed documents can be reduced. Stress associated with typing/editing, including physical stress, can be reduced. Drafts of medical transcription documents can be generated for speakers with lower training data requirements. Portions of dictations can be drafted, as opposed to all of a dictation or none of a dictation).

Regarding claim 4, the combination of Binder, Balasubramaniam, and Zimmerman teaches: the method of claim 1. Balasubramaniam further teaches: wherein the low-level feature extraction is configured to extract one or more features, the one or more features including at least one of: one or more mel-frequency cepstral coefficients, one or more Fourier-transformed features, one or more log-mel features, one or more power features, or one or more zero-crossing counts ([0312] the low-level features were extracted using the 0.25 ms with 10 ms overlap frames (i.e. to make the context the same (200 ms)) were used to generate 39 MFCCs (13 static coefficients, 13 delta coefficients, 13 delta-delta coefficients) each. As the MFCC feature vector describes only the power spectral envelope of a single frame, it seems like speech would also have information in the dynamics).

Regarding claim 5, the combination of Binder, Balasubramaniam, and Zimmerman teaches: the method of claim 1. Balasubramaniam further teaches: aggregating the results of analyzing the one or more low-level features, pitch estimation, speaker distance, and change of speaker into one or more higher-level features ([0213] At a second stage 712, voice feature aggregation via speaker modeling is performed on the extracted lower level features to generate a voice feature-based representation (e.g. feature representation 322 of FIG. 3)).

Regarding claim 6, the combination of Binder, Balasubramaniam, and Zimmerman teaches: the method of claim 5. Balasubramaniam further teaches: wherein the one or more higher-level features represents a temporal change in a respective low-level feature ([0283] The use of the attention mechanism of the present disclosure may provide a way to search for the information related to the speaker and thus obtain better high-level embeddings, including better performance on speaker identification using the extracted embeddings. Further, the obtained higher-level embeddings are more discriminative than the existing model. The higher level features use the low-level features as input and inherently show how features would change over time.).

Regarding claim 7, the combination of Binder, Balasubramaniam, and Zimmerman teaches: the method of claim 1. Balasubramaniam further teaches: wherein the initial first classification model is a first classification neural network ([0123] The neural network model 320 may include a generic audio embedding model. The generic audio embedding model may differentiate a wide range of voice characteristics and capture acoustic and linguistic content. The neural network model 320 may include a large-scale audio classification model such as AlexNet, VGG, Inception, or ResNet).

Regarding claim 8, the combination of Binder, Balasubramaniam, and Zimmerman teaches: the method of claim 7. Balasubramaniam further teaches: wherein the first classification neural network comprises at least one of: one or more feedforward layers; one or more convolutional neural network layers; one or more long short-term memory layers; or one or more conditional random field layers ([0282] To help the sincnet discover more meaningful filters in the input layer, the present disclosure provides an approach which adds an attention-based LSTM layer to the sincnet implementation. This architecture may achieve better representation of the speech embeddings).

Claims 2 is rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of  Balasubramaniam and Zimmerman,  as applied to claims 1 and 4-8 above, and further in view of Gelfenbeyn et al. (US 20160027440 A1; hereinafter referred to as Gelfenbeyn).
Regarding claim 2, the combination of Binder, Balasubramaniam, and Zimmerman teaches: the method of claim 1. Balasubramaniam further teaches: wherein analyzing the audio data using the initial first classification model comprises: analyzing the one or more low-level acoustic features, the one or more low-level acoustic features including at least one of: pitch estimation, speaker distance, or change of speaker ([0111] Speaker change detection may include detecting speaker change points in the audio data 312 and portioning the audio data 312 according to the speaker change points. The segmenter module 314 may use audio characteristics to determine speaker change points);
providing results of the one or more low-level acoustic features analysis ([0212, 0214] At a first stage 710, an acoustic feature extraction process is applied to the audio utterances to generate lower level acoustic features… Stages 710 and 712 may be performed via a neural network model such as CNN model (e.g. neural network model 320 of FIG. 3)) to the initial first classification model… ([0123] The neural network model 320 may include a generic audio embedding model. The generic audio embedding model may differentiate a wide range of voice characteristics and capture acoustic and linguistic content. The neural network model 320 may include a large-scale audio classification model).
The combination of Binder, Balasubramaniam, and Zimmerman does not explicitly, but Gelfenbeyn teaches: determining, using the initial first classification model, one or more probabilities for one or more segments in the audio data, wherein the one or more probabilities represent a likelihood that each of the one or more segments in the audio data is dictation ([0065] If the recognized input meets predetermined criteria (e.g., have weight, probability or confidence level higher than a predetermined threshold));
and selecting the one or more segments with the determined probability higher than the predetermined threshold as the identified one or more segments in the audio data that are indicative of dictation ([0065] the very first part of the audio input (in other words, “beginning part” of the user audio input) may be recognized using one default speech recognizer 220 (e.g., a free-dictation recognizer) to generate a recognized input. If the recognized input meets predetermined criteria (e.g., have weight, probability or confidence level higher than a predetermined threshold), the same default speech recognizer 220 may be selected for recognizing the first part of the user input. Alternatively, another speech recognizer 220 (e.g., rule-based speech recognizer) may be used to process the first part of the user audio input. The predetermined criteria can be audio with a certain word pattern like dictation.).
Binder, Balasubramaniam, Zimmerman, and Gelfenbeyn are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder, Balasubramaniam, and Zimmerman to combine the teachings of Gelfenbeyn because doing so would improve the reliability and accuracy of recognizing specific voice commands and patterns like dictation, leading to better dictation analysis (Gelfenbeyn [0009] this technology overcomes at least some drawbacks of the prior art systems improving reliability and accuracy for automatic recognition of user voice commands and, thereby, enhancing overall user experience of using CIS, chat agents and similar digital personal assistant systems).

Claims 3 is rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of Balasubramaniam and Zimmerman, as applied to claims 1 and 4-8 above, and further in view of Asano (US 20040054531 A1) and Varerkar et al. (US 20170280235 A1; hereinafter referred to as Varerkar). 
Regarding claim 3, the combination of Binder, Balasubramaniam, and Zimmerman teaches: the method of claim 1. The combination of Binder, Balasubramaniam, and Zimmerman does not explicitly, but Asano teaches: wherein analyzing the audio data using the initial first classification model comprises: estimating speaker distance using stereoscopic information received from the recording device… ([0083] On the basis of the image signals received from the CCD cameras 22L and 22R, the distance calculator 47 performs stereoscopic processing (processing on the basis of stereoscopic matching) to determine the distance from the microphone 21 to a sound source, such as a user uttering a speech, included in the images taken by the CCD cameras 22L and 22R. Data indicating the calculated distance is supplied to the speech recognition unit 41B).
Binder, Balasubramaniam, Zimmerman, and Asano are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder, Balasubramaniam, and Zimmerman to combine the teachings of Asano because doing so would allow for the use of acoustic models that use stereoscopic information to determine speaker distance for speech recognition, leading to more accurate speech recognition at different speaker distances and improved dictation recognition (Asano [0137] because speech recognition is performed on the basis of a set of acoustic models produced by learning speech data acquired in an acoustic environment similar to that in which a user actually utters, the accuracy of the speech recognition is improved).
The combination of Binder, Balasubramaniam, Zimmerman, and Asano does not explicitly, but Varerkar teaches: to differentiate dictation from non-dictation segments based on speaker proximity ([0021] the microphone array 118 can be used together with an angular information and triangulated distances to identify multiple users/speakers as well as changes in the user/speaker. An envelope may be created based on the location of the speaker. As used herein, an envelope in case of speech usages indicates a set/collection of relevant audio data from an audio stream. For example, the envelope may be a boundary or a signal which would help the speech application identify transition from one user's speech to the other or in some cases can distinguish a user's speech from some other sound source present at the same time (could be noise or could be another user's voice)).
Binder, Balasubramaniam, Zimmerman, Asano, and Varerkar are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder, Balasubramaniam, Zimmerman, and Asano to combine the teachings of Varerkar because doing so would allow for different speech features to be analyzed and used for improving accuracy of audio classification, leading to improved dictation determination (Varerkar [0051] The present techniques result in improved and more deterministic results for speech usages and enhance the user experience. The present techniques also result in more accurate angular/location information from speech when compared with existing algorithms. Additionally, the present techniques enhance the accuracy of speech applications in ideal circumstances).

Claims 10-13 and 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of Balasubramaniam and Zimmerman, as applied to claims 1 and 4-8 above, and further in view of Pinter et al. (US 20180308565 A1; hereinafter referred to as Pinter).
Regarding claim 10, the combination of Binder, Balasubramaniam, and Zimmerman teaches: the method of claim 1. Zimmerman further teaches: adapting the initial second classification model using the received input as additional training data ([col 11, lines 54-63] The models can be built using a small amount of training data, or with no training data (i.e., the models can be speaker-dependent or speaker-independent, for example). Also, at stage 212, a correction rate predictor is built for the speaker to whom the dictation is associated. The CRP builder 28 builds the CR-predictor based on existing draft transcriptions and the associated edited transcriptions completed by a particular speaker. The CRP builder 28 can build the CR-predictor substantially simultaneously with the CR-predictor being built by the ASR module. The edited dictation transcripts can be used as training data.);
and storing, in the computer memory of the one or more computing devices, the adapted second classification model ([col 7, lines 19-22] These models are stored in the database 40 so that they may be accessed and used by the automatic transcription device 30 to create a draft transcription for a subsequent dictation).
The combination of Binder, Balasubramaniam, and Zimmerman does not explicitly, but Pinter teaches: generating a graphical user interface that presents the one or more identified segments that are indicative of dictation ([0026] the system may only populate the SOAP note with content from a specified participant such as, for example and without imputing limitation, the physician. In some examples, the audio can be processed by multiple neural networks or preprocessed by various services) to a user and including one or more of the features extracted from the one or more identified segments ([0028] the system may diarize audio and process speaker identity as further context and input for the deep learning neural network. In some examples, dedicated microphones on both the patient-side and physician-side of the system can inform the system which speaker is associated with what audio content through, for example, dedicated and predefined audio channels);
presenting the graphical user interface on at least one display devices in communication with the one or more computing devices ([0035] Upon completion of the live encounter with the patient, the physician can end the audio and/or video session. The video window closes and, in the case of a robotic patient-side endpoint, the patient-side tele-presence device may navigate back to its dock. The physician-side interface may display a patient record (e.g., within a clinical documentation tool). In some examples, the generated SOAP note may be displayed next to the patient record. The SOAP note may be editable so the physician can make changes to the SOAP note);
receiving input from the one more computing devices in communication with the respective display devices on which the graphical user interface was presented, the received input representing agreement or disagreement with a plurality of extracted features from the one or more identified segments ([0030] the physician may choose to add or change certain things in a live SOAP note as it is generated. The physician input can be integrated as another data source in the neural network. In some examples, the physician input can be used to update the neural network while the SOAP note is generated and thus increase the quality of the generated SOAP note as the encounter progresses);
storing, in the computer memory of the one or more computing devices, the information in an electronic health record based on the received input… ([0035] the physician may sign the note and click a “Send” button to automatically insert the SOAP note into an EMR for that patient).
Binder, Balasubramaniam, Zimmerman, and Pinter are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder, Balasubramaniam, and Zimmerman to combine the teachings of Pinter because doing so would allow for the use of a graphical user interface for a user to interact with in order to provide feedback regarding determined dictations, leading to more accurate health records and dictation determinations using the user feedback (Pinter [0035] The physician-side interface may display a patient record (e.g., within a clinical documentation tool). In some examples, the generated SOAP note may be displayed next to the patient record. The SOAP note may be editable so the physician can make changes to the SOAP note. When satisfied, the physician may sign the note and click a “Send” button to automatically insert the SOAP note into an EMR for that patient. Further, as discussed above, the physician changes to the generated SOAP note can be fed back into the neural network in order to further improve SOAP note generation. In some examples, the neural network can train a physician-specific model based on multiple SOAP note changes received from a particular physician).

Regarding claim 11, the combination of Binder, Balasubramaniam, Zimmerman, and Pinter teaches: the computer-implemented method of claim 10. Pinter further teaches: wherein the plurality of extracted features are a plurality of textual representations corresponding to a plurality of identified spoken words in the one or more identified segments ([0052] The spoken content may be processed to identify a portion for insertion into a SOAP note (operation 404). The identified portion may be provided to the SOAP note generator 216 as input into the deep learning neural network 224 or, in some examples, may be provided to the SOAP text generator 226. Nevertheless, the identified portion of spoken content may be converted into SOAP note data (operation 406). In some examples, the SOAP note data may be able to be directly inserted into the SOAP note 202 (e.g., as string variables and the like)).

Regarding claim 12, the combination of Binder, Balasubramaniam, Zimmerman, and Pinter teaches: the computer-implemented method of claim 10. Pinter further teaches: adapting the initial first classification model using the received input as additional training data ([0037] neural network output data may include a SOAP note produced from the encounter. The SOAP note may be cleaned and curated by a third party or the responsible physician. In some examples, the SOAP note can be provided back to the neural network as, for example, further training data in order to improve the accuracy of the neural network for later encounter); 
and storing, in the computer memory of the one or more computing devices, the adapted first classification model ([0037] the neural network feedback process 228 may perform model updates as a background process on a mirror version of the deep learning neural network 224 and directly update the deep learning neural network 224 once the mirror version has converged on an updated model. Updating the neural network requires storing the model.).

Regarding claim 13, the combination of Binder, Balasubramaniam, Zimmerman, and Pinter teaches: the computer-implemented method of claim 12. Pinter further teaches: wherein the received input is implicit input derived from one or more actions of at least one human user using the one or more computing devices ([0027] the system can automatically fill in the SOAP note. A deep learning neural network or other trained machine learning model analyzing the encounter can run concurrent to the encounter and update itself using automatic and/or physician-provided feedback. In some examples, early entries in the SOAP note may be inaccurate, but later entries will become increasingly correct as greater context becomes available throughout the encounter. Information input by the physician at a later time can implicitly agree or disagree with earlier inputs.).

Regarding claim 15, the combination of Binder, Balasubramaniam, Zimmerman, and Pinter teaches: the computer-implemented method of claim 12. Binder further teaches: training a new first classification model, using the adapted first classification model, that is configured to process audio data to identify one or more audio segments that are indicative of dictation ([0144] these sound inputs are used to adapt the voice trigger system 400 only if a certain conditions or combinations of conditions are met. For example, in some implementations, the sound inputs are used to adapt the voice trigger system 400 when a predetermined number of sound inputs are received in succession (e.g., 2, 3, 4, 5, or any other appropriate number), when the sound inputs are sufficiently similar to the reference representation, when the sound inputs are sufficiently similar to each other, when the sound inputs are close together (e.g., when they are received within a predetermined time period and/or at or near a predetermined interval), and/or any combination of these or other conditions) without using automatic speech recognition ([0159] the first sound detector is a voice-activity detector that is configured to determine whether the sound input includes frequencies that are characteristic of a human voice (or other features, aspects, or properties of the sound input that are characteristic of a human voice));
storing the new first classification model in the computer… ([0141] the reference representation is adjusted (or created) as part of a voice enrollment or “training” procedure, where a user outputs the trigger sound several times so that the device can adjust (or create) the reference representation. The device can then create a reference representation using that person's actual voice);
receiving audio data from a recording device ([0026] an electronic device includes a sound receiving unit configured to receive sound input);
storing the received audio data in the computer memory ([0012] In some implementations, sound inputs are stored in memory as they are received and passed to an upstream detector so that a larger portion of the sound input can be analyzed);
analyzing the stored audio data using the new first classification model ([0143] the device 104 (and/or associated devices or services) adjusts the reference representation after each successful triggering event. In some implementations, the device 104 analyzes the sound input associated with each successful triggering event and determines if the reference representations should be adjusted based on that input (e.g., if certain conditions are met), and only adjusts the reference representation if it is appropriate to do so. In some implementations, the device 104 maintains a moving average of the reference representation over time) to identify one or more segments in the audio data that are indicative of dictation… ([0110] the sound-type detector 404 generates a spectrogram of a received sound input (e.g., using a Fourier transform), and analyzes the spectral components of the sound input to determine whether the sound input is likely to correspond to a particular type or category of sounds (e.g., human speech)).
Zimmerman further teaches: training a new second classification model, using the adapted second classification model, that is configured to process audio data using automatic speech recognition ([col 12, lines 4-9] the system 10 is monitored to detect whether there is sufficient training data to build an updated ASR model for a speaker. For example, the models in the ASR module can be updated when a speaker completed a particular number of dictations in a specified amount of time. At stage 218, the model builder builds new models);
storing the new second classification model in the computer memory… ([col 7, lines 23-28] Referring to FIG. 2, the CRP builder 28 includes a memory 50 and a CRP module (e.g., software) 52. The CRP module 52 includes memory and a processor for reading software code stored in the memory 50 and for executing instructions associated with this code for performing functions described below);
and analyzing the one or more identified segments that are indicative of dictation using the new second classification model to extract one or more features from the one or more identified segments ([col 4, lines 23-26] an automatic speech recognition (ASR) system is supplemented by a correction rate predictor that is statistically trained and developed from a set of features extracted from a set of dictations).

Regarding claim 16, it recites similar limitations as claim 10 and therefore is rejected similarly. The method of claim 10 adapts the second classification model and is an iterative process.

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of Zimmerman.
Regarding claim 17, Binder discloses: a system for processing audio data, the system comprising: one or more computer processors communicatively coupled to a computer network ([0067] The network communications interface 308 enables communication between the digital assistant system 300 with networks); one or more computer storage devices accessible to the one or more computer processors ([0166] the method 600 is performed at an electronic device including one or more processors and memory storing instructions for execution by the one or more processors (e.g., the electronic device 104)), the one or more computer storage devices having stored thereon:
an initial first classification model ([0151] The electronic device determines whether the sound input corresponds to a predetermined type of sound (506). As noted above, sounds are categorized as different “types” based on certain identifiable characteristics of the sounds) trained using a data structure comprising audio data that includes dictation ([0141] if an input representation matches the reference representation to a predetermined confidence level, the sound detector will determine that the sound input corresponds to a predetermined type of sound (e.g., the sound-type detector 404), or that the sound input includes predetermined content (e.g., the trigger sound detector 406). In order to tune the voice trigger system 400, in some implementations, the device adjusts the reference representation to which the input representation is compared. In some implementations, the reference representation is adjusted (or created) as part of a voice enrollment or “training” procedure, where a user outputs the trigger sound several times so that the device can adjust (or create) the reference representation) and non-dictation segments ([0142] In some implementations, only sound inputs that were determined to satisfy all or some of the triggering criteria with a certain confidence level are used to adjust the reference representation. Thus, when the voice trigger is less confident that a sound input corresponds to or includes a trigger sound, that voice input may be ignored for the purposes of adjusting the reference representation. This can be non-dictation.) to identify one or more audio segments that are indicative of dictation without using automatic speech recognition… ([0159] the first sound detector is a voice-activity detector that is configured to determine whether the sound input includes frequencies that are characteristic of a human voice (or other features, aspects, or properties of the sound input that are characteristic of a human voice));
and a recording device, wherein the one or more computer storage devices have computer instructions stored thereon that, when executed by the one or more computer processors, cause the one or more computer processors to perform operations comprising: receiving, over the computer network, the audio data from the recording device ([0026] an electronic device includes a sound receiving unit configured to receive sound input);
storing the audio data in the one or more computer storage devices ([0012] In some implementations, sound inputs are stored in memory as they are received and passed to an upstream detector so that a larger portion of the sound input can be analyzed);
analyzing the audio data using the initial first classification model to identify one or more segments in the audio data that are indicative of dictation, wherein segments not identified as dictation are excluded from conversion into text… ([0153] Upon a determination that the sound input includes the predetermined content, the electronic device initiates a speech-based service (514). In some implementations, the speech-based service is a voice-based digital assistant, as described in detail above. In some implementations, the speech-based service is a dictation service in which speech inputs are converted into text and included in and/or displayed in a text input field. If the sound input does not contain the predetermined output, it is excluded from text conversion.).
Binder does not explicitly, but Zimmerman teaches: and an initial second classification model trained to process the audio data ([col 10, lines 27-31] the feature set and predictor parameters may be adapted from the speaker-independent CR predictor by using the speaker's dictations in the CR predictor builder 28, for example, using a classification framework such as ANN training) using the automatic speech recognition… ([col 7, lines 62-66] The ASR module 60 uses the ASR models 69, stored in the memory 62 to compute a transcribed text, along with associated ASR output data such as word lattices, word alignments and energy values from the digital audio file);
and analyzing the one or more identified segments using the initial second classification model to extract one or more features from the one or more identified segments ([col 4, lines 23-26] an automatic speech recognition (ASR) system is supplemented by a correction rate predictor that is statistically trained and developed from a set of features extracted from a set of dictations).
Binder and Zimmerman are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder to combine the teachings of Zimmerman because doing so would allow for the use of automatic speech recognition to assist in extracting features from an audio segment that contains dictation, reducing human workload (Zimmerman [col 3, lines 48-56] Time and cost of editing automatically generated medical transcription documents can be reduced. Transcriptionist editing time can be reduced. Transcriptionist fatigue in editing transcribed documents can be reduced. Stress associated with typing/editing, including physical stress, can be reduced. Drafts of medical transcription documents can be generated for speakers with lower training data requirements. Portions of dictations can be drafted, as opposed to all of a dictation or none of a dictation).

Claims 18 is rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of Zimmerman, as applied to claim 17 above, and further in view of Balasubramaniam and Gelfenbeyn.
Regarding claim 18, the combination of Binder and Zimmerman teaches: the system of claim 17. The combination of Binder and Zimmerman does not explicitly, but Balasubramaniam teaches: wherein, when analyzing the audio data using the initial first classification model, the computer instructions cause the one or more computer processors to perform operations comprising: extracting one or more low-level acoustic features from the audio data, the low-level acoustic features comprising at least one of: pitch estimation, speaker distance, or change of speaker ([0111] Speaker change detection may include detecting speaker change points in the audio data 312 and portioning the audio data 312 according to the speaker change points. The segmenter module 314 may use audio characteristics to determine speaker change points);
providing the low-level acoustic features ([0212, 0214] At a first stage 710, an acoustic feature extraction process is applied to the audio utterances to generate lower level acoustic features… Stages 710 and 712 may be performed via a neural network model such as CNN model (e.g. neural network model 320 of FIG. 3)) to the initial first classification model ([0123] The neural network model 320 may include a generic audio embedding model. The generic audio embedding model may differentiate a wide range of voice characteristics and capture acoustic and linguistic content. The neural network model 320 may include a large-scale audio classification model).
Binder, Zimmerman, and Balasubramaniam are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder and Zimmerman to combine the teachings of Balasubramaniam because doing so would help differentiate speakers using a speaker clustering based on analyzing different low-level audio features, improving dictation processing and transcription generation (Balasubramaniam [0273] speaker identification subsystem may be configured to process audio of a conversation between healthcare professional and a patient and determine an identity for the healthcare professional (e.g. Dr. Smith) and an identity for the patient (e.g. John Doe). The speaker identification subsystem may be used together with speaker clustering functionalities to provide improved audio processing and conversation transcript).
The combination of Binder, Zimmerman, and Balasubramaniam does not explicitly, but Gelfenbeyn teaches: determining, using the initial first classification model, one or more probabilities that respective segments of the audio data are indicative of dictation ([0065] If the recognized input meets predetermined criteria (e.g., have weight, probability or confidence level higher than a predetermined threshold));
and selecting the one or more segments having a probability greater than a predetermined threshold as the identified dictation segments ([0065] the very first part of the audio input (in other words, “beginning part” of the user audio input) may be recognized using one default speech recognizer 220 (e.g., a free-dictation recognizer) to generate a recognized input. If the recognized input meets predetermined criteria (e.g., have weight, probability or confidence level higher than a predetermined threshold), the same default speech recognizer 220 may be selected for recognizing the first part of the user input. Alternatively, another speech recognizer 220 (e.g., rule-based speech recognizer) may be used to process the first part of the user audio input. The predetermined criteria can be audio with a certain word pattern like dictation.).
Binder, Zimmerman, Balasubramaniam, and Gelfenbeyn are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder, Zimmerman, and Balasubramaniam to combine the teachings of Gelfenbeyn because doing so would improve the reliability and accuracy of recognizing specific voice commands and patterns like dictation, leading to better dictation analysis (Gelfenbeyn [0009] this technology overcomes at least some drawbacks of the prior art systems improving reliability and accuracy for automatic recognition of user voice commands and, thereby, enhancing overall user experience of using CIS, chat agents and similar digital personal assistant systems).

Claims 26 is rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of Zimmerman, as applied to claim 17 above, and further in view of Pinter.
Regarding claim 26, it recites similar limitations as claim 10 and therefore is rejected similarly. Pinter also teaches: wherein the one or more computer processors are in communication with one or more display devices ([0058] The computer system 500 can further include a communications interface 518 by way of which the computer system 500 can connect to networks and receive data useful in executing the methods and system set out herein as well as transmitting information to other devices. The computer system 500 may include an output device 504by which information can be displayed) and the computer instructions cause the one or more computer processors to perform operations ([0057] The processor 514 can include one or more internal levels of cache 516 and a bus controller or bus interface unit to direct interaction with the bus 502. Memory 508 may include one or more memory cards and a control circuit (not depicted), or other forms of removable memory, and may store various software applications including computer executable instructions).

Claim 33 is rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of Zimmerman and Pinter.
Regarding claim 33, Binder discloses: an apparatus for adaptive audio processing, comprising: means for receiving audio data from a multiple-microphone recording device ([0053] the user device 104 includes an audio subsystem 226 coupled to one or more speakers 228 and one or more microphones 230 to facilitate voice-enabled functions); means for storing the audio data in a memory ([0020] the method includes storing at least a portion of the sound input in memory);
means for analyzing the audio data using a first classification model trained ([0141] sound detectors (e.g., the sound-type detector 404 and/or the trigger sound detector 406) may be configured to compare a representation of a sound input (e.g., the sound or utterance provided by a user) to one or more reference representations… the reference representation is adjusted (or created) as part of a voice enrollment or “training” procedure, where a user outputs the trigger sound several times so that the device can adjust (or create) the reference) to identify one or more segments indicative of dictation ([0009] a different type of sound detector (e.g., one that uses less power than the trigger sound detector) is used to monitor an audio channel to determine whether the sound input corresponds to a certain type of sound. Sounds are categorized as different “types” based on certain identifiable characteristics of the sounds. For example, sounds that are of the type “human voice” have certain spectral content, periodicity, fundamental frequencies, etc. Other types of sounds (e.g., whistles, hand claps, etc.) have different characteristics. Sounds of different types are identified using audio and/or signal processing techniques… Dictation is a specific type of sound that is human speech.) without using automatic speech recognition ([0110] the sound-type detector 404 includes a “voice activity detector” (VAD)).
Binder does not explicitly, but Zimmerman teaches: means for extracting features from the one or more identified segments using a second classification model ([col 8, lines 19-27] Features are extracted from the ASR output data for a dictation while the dictation is being recognized. Features include measures of background noise, measures of overall audio quality such as noise/signal ratio or average spectrum, standard measures of per-word confidence (e.g., the percentage of word-lattice paths that contain a word over a given period of the audio), per-word confidence measures combined across all of or a portion of a dictation, and scores from models used in a speech recognition process) trained to process the audio data using automatic speech recognition… ([col 11, lines 29-34] The ASR module 60, with the local CRP module 64, computes a metric that indicates the quality of a dictation for draft creation. Features of the dictation are combined with a classifier to that is trained to predict the correction rate for the dictation);
means for updating at least one of the first classification model and the second classification model based on the user input ([col 8-9, lines 64-6] The features are derived by re-computing the output data of the automatic transcription device 30 in a process in the CRP builder 28, preferably offline, or by storing this data at the time the dictation is being recognized by uploading it from the automatic transcription device 30 to the database 40. The correction rate predictor computed by the CRP builder 28 and used by the CRP module 64 is updated, e.g., periodically, as more dictations are gathered for the speaker over time. In this way, the models can track changes in the speaker's speaking style);
and means for storing the updated at least one model in the memory ([col 9, lines 7-9] The CRP module 64 preferably uses the updated model for the next transcription to be analyzed from the particular speaker).
Binder and Zimmerman are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder to combine the teachings of Zimmerman because doing so would allow for the use of automatic speech recognition to assist in extracting features from an audio segment that contains dictation, reducing human workload (Zimmerman [col 3, lines 48-56] Time and cost of editing automatically generated medical transcription documents can be reduced. Transcriptionist editing time can be reduced. Transcriptionist fatigue in editing transcribed documents can be reduced. Stress associated with typing/editing, including physical stress, can be reduced. Drafts of medical transcription documents can be generated for speakers with lower training data requirements. Portions of dictations can be drafted, as opposed to all of a dictation or none of a dictation).
The combination of Binder and Zimmerman does not explicitly, but Pinter teaches: means for generating a graphical user interface ([0035] The physician-side interface may display a patient record (e.g., within a clinical documentation tool). In some examples, the generated SOAP note may be displayed next to the patient record. The SOAP note may be editable so the physician can make changes to the SOAP note) that presents the one or more identified segments ([0026] the system may only populate the SOAP note with content from a specified participant such as, for example and without imputing limitation, the physician. In some examples, the audio can be processed by multiple neural networks or preprocessed by various services) and the extracted features to a user ([0028] the system may diarize audio and process speaker identity as further context and input for the deep learning neural network. In some examples, dedicated microphones on both the patient-side and physician-side of the system can inform the system which speaker is associated with what audio content through, for example, dedicated and predefined audio channels);
means for receiving user input indicating agreement or disagreement with the extracted features… ([0030] the physician may choose to add or change certain things in a live SOAP note as it is generated. The physician input can be integrated as another data source in the neural network. In some examples, the physician input can be used to update the neural network while the SOAP note is generated and thus increase the quality of the generated SOAP note as the encounter progresses).
Binder, Zimmerman, and Pinter are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder and Zimmerman to combine the teachings of Pinter because doing so would allow for the use of a graphical user interface for a user to interact with in order to provide feedback regarding determined dictations, leading to more accurate health records and dictation determinations using the user feedback (Pinter [0035] The physician-side interface may display a patient record (e.g., within a clinical documentation tool). In some examples, the generated SOAP note may be displayed next to the patient record. The SOAP note may be editable so the physician can make changes to the SOAP note. When satisfied, the physician may sign the note and click a “Send” button to automatically insert the SOAP note into an EMR for that patient. Further, as discussed above, the physician changes to the generated SOAP note can be fed back into the neural network in order to further improve SOAP note generation. In some examples, the neural network can train a physician-specific model based on multiple SOAP note changes received from a particular physician).

Claims 34 is rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of Zimmerman and Pinter, as applied to claim 33 above, and further in view of Lavilla et al. (US 20200160845 A1; hereinafter referred to as Lavilla).
Regarding claim 34, the combination of Binder, Zimmerman, and Pinter teaches: the apparatus of claim 33. The combination of Binder, Zimmerman, and Pinter does not explicitly, but Lavilla teaches: means for generating a graphical user interface that displays a plurality of identified dictation segments ([0058] nonlimiting example of a graphical user interface that displays labels that may be produced by operation 16 or downstream processing is shown in FIG. 4, described below. As previously noted, any type of content classifier(s) that produce semantic labels can be used in conjunction with the disclosed technologies to produce any type of semantic label. For example, the labels may indicate that the audio contains a particular dialect or a particular type of audio event), each associated with a time stamp ([0030] a start time of a segment may be defined by a start time or an end time of a particular class of speech content contained in the segment, and an end time of that same segment may be defined by a start time or an end time of the same content class or a different content class), and that enables selection of an individual segment for audio playback beginning at the corresponding time stamp ([0089] While this disclosure describes embodiments that analyze live audio streams, aspects of the disclosed technologies are equally applicable to other forms of audio data, including but not limited to pre-recorded audio stored in digital audio files. Also see Fig. 4.).
Binder, Zimmerman, Pinter, and Lavilla are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder, Zimmerman, and Pinter to combine the teachings of Lavilla because doing so would help improve classification of dictation by allowing intervals of specific speech content to be identified in real-time, leading to more accurate classification (Lavilla [0109] the system generates corrective labels to replace the initial labels at times 29-35 seconds and 43-45 seconds. In this way, embodiments of the disclosed technologies can provide both improved speed and accuracy of speech content classification systems. These improvements may be particularly useful in live streaming environments in which the label output needs to be determined quickly in order to keep up with the live stream).

Claim 35 is rejected under 35 U.S.C. 103 as being unpatentable over Binder in view of Zimmerman and Pinter, as applied to claim 33 above, and further in view of Hook et al. (US 20210193162 A1; hereinafter referred to as Hook).
Regarding claim 35, the combination of Binder, Zimmerman, and Pinter teaches: the apparatus of claim 33. The combination of Binder, Zimmerman, and Pinter does not explicitly, but Hook teaches: means for identifying segments as dictation when a ratio of a probability output by a dictation-trained language model ([0008] speech or conversation detection involves applying one or more speech models to distinguish speech from noise. A speech model is, in some embodiments, derived through a machine learning process involving training the speech model on a plurality of speech samples and noise samples) to a probability output by a non-dictation-trained language model exceeds a predetermined threshold ([0108] The classification of the feature vector in 1006 may, for example, involve calculating both the probability that the audio signal represents speech and the probability that the audio signal represents noise. If the probability of speech is greater than the probability of noise, the speech model may output the probability of speech to indicate that the audio signal has been classified as speech. Dictation can be represented by speech and non-dictation can be represented by noise.).
Binder, Zimmerman, Pinter, and Hook are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Binder, Zimmerman, and Pinter to combine the teachings of Hook because doing so would improve dictation classification by comparing different probabilities to determine whether a segment of audio contains speech or noise (Hook [0073] the processing unit440can determine the probability that the feature vector 432 represents speech and/or the probability that the feature vector 432 represents noise. In some embodiments, the processing unit440 may generate the result450 based on which of the two probabilities is greater).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Nathan Tengbumroong whose telephone number is (703)756-1725. The examiner can normally be reached Monday - Friday, 11:30 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NATHAN TENGBUMROONG/Examiner, Art Unit 2654   

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654
Read full office action
Prosecution Timeline

Apr 13, 2023
Application Filed
Apr 22, 2025
Non-Final Rejection — §103
Jun 02, 2025
Interview Requested
Jun 11, 2025
Examiner Interview Summary
Jun 11, 2025
Applicant Interview (Telephonic)
Jul 30, 2025
Response Filed
Sep 04, 2025
Final Rejection — §103
Oct 18, 2025
Interview Requested
Nov 07, 2025
Response after Non-Final Action
Nov 21, 2025
Request for Continued Examination
Dec 01, 2025
Response after Non-Final Action
Mar 05, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/173,495
Patent 12530536
Mixture-Of-Expert Approach to Reinforcement Learning-Based Dialogue Management
2y 5m to grant Granted Jan 20, 2026
17/876,156
Patent 12451142
NON-WAKE WORD INVOCATION OF AN AUTOMATED ASSISTANT FROM CERTAIN UTTERANCES RELATED TO DISPLAY CONTENT
2y 5m to grant Granted Oct 21, 2025
17/883,265
Patent 12412050
MULTI-PLATFORM VOICE ANALYSIS AND TRANSLATION
2y 5m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
43%
Grant Probability
99%
With Interview (+75.0%)
3y 0m
Median Time to Grant
High
PTA Risk
Based on 14 resolved cases by this examiner. Grant probability derived from career allow rate.