Last updated: April 19, 2026
Application No. 18/493,547
METHOD OF OPERATING SOUND RECOGNITION DEVICE IDENTIFYING SPEAKER AND ELECTRONIC DEVICE HAVING THE SAME

Non-Final OA §101§103§112
Filed
Oct 24, 2023
Examiner
MEIS, JON CHRISTOPHER
Art Unit
2654
Tech Center
2600 — Communications
Assignee
ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE
OA Round
1 (Non-Final)
This examiner grants 46% of cases after interview

— +59.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 22 resolved cases, 2023–2026
Examiner Intelligence

MEIS, JON CHRISTOPHER View full profile →
Grants 46% of resolved cases
Career Allow Rate
10 granted / 22 resolved
-16.5% vs TC avg
Strong +59% interview lift
Without
With
+59.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
30 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
24.9%
-15.1% vs TC avg
§103
49.7%
+9.7% vs TC avg
§102
12.9%
-27.1% vs TC avg
§112
10.6%
-29.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 22 resolved cases
Office Action

§101 §103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Claims 1-14 are pending.  Claims 1 and 10 are independent.
This Application was published as US 20240203446.
Apparent priority is 16 December 2022.
The instant Application is directed to a method of speaker identification by fusing emotion information.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: sound recognition device in claims 1-14.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
Paragraphs [0034] – [0035] of the specification state: “[0034]…For example, the electronic device 100 may be implemented as one of various electronic devices that process sound signals, such as a smart phone, a tablet personal computer (PC), a mobile phone, a desktop PC, a laptop computer, and a personal digital assistant (PDA). [0035] The electronic device 100 may include a sound sensor 110 and a sound recognition device 120.” Therefore, “sound recognition device” is interpreted as being implemented by any electronic device that processes sound signals, such as a computer or mobile phone.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 2 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 2 recites: “…a second segment determined to be the voice segment among the plurality of segments…”. However, claim 1 recites: “a first segment determined to be the voice segment among the plurality of segments…”. It is indefinite whether the first and second segments must be identical, but they are from different speakers. A suggested amendment is: “…a second segment determined to be a second voice segment among the plurality of segments…”

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-6 and 10-13 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Step 1: The independent Claims are directed to statutory categories: 
Claim 1 is a method claim and directed to the process category of patentable subject matter.
Claim 10 is a device claim and directed to the machine or manufacture category of patentable subject matter.
Step 2A, Prong One: Does the Claim recite a Judicially Recognized Exception? Abstract Idea? Are these Claims nevertheless considered Abstract as a Mathematical Concept (mathematical relationships, mathematical formulas or equations, mathematical calculations), Mental Process (concepts performed in the human mind (including an observation, evaluation, judgment, opinion), or Certain Methods of Organizing Human Activity (1-fundamental economic principles or practices (including hedging, insurance, mitigating risk), 2-commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations), 3- managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions) and fall under the judicial exception to patentable subject matter?)
The rejected Claims recite Mental Processes.
Step 2A, Prong Two: Additional Elements that Integrate the Judicial Exception into a Practical Application? Identifying whether there are any additional elements recited in the claim beyond the judicial exception(s), and evaluating those additional elements to determine whether they integrate the exception into a practical application of the exception. “Integration into a practical application” requires an additional element(s) or a combination of additional elements in the claim to apply, rely on, or use the judicial exception in a manner that imposes a meaningful limit on the judicial exception, such that the claim is more than a drafting effort designed to monopolize the exception. Uses the considerations laid out by the Supreme Court and the Federal Circuit to evaluate whether the judicial exception is integrated into a practical application.
The rejected Claims do not include additional limitations that point to integration of the abstract idea into a practical application and are therefore directed to a Mental Process.
Claim 1 is a generic automation of a mental process because a human agent can sense the emotional state of a speaker and use this information to identify the speaker. Prong Two of step 2A in the 101 analysis asks whether the abstract idea is integrated with a practical application. The answer is no in this instance because there is no technological solution in the Claim that “integrates” the abstract idea. The Claim only suggests that the abstract idea be applied. It does not describe an application. 
1. A method of operating a sound recognition device communicating with a sound sensor configured to recognize an utterance sound of a first speaker to generate a sound signal, the method comprising: receiving the sound signal from the sound sensor; (agent listens to an audio file)
dividing the sound signal into a plurality of segments; (agent writes down the times between sentences)
determining whether each of the plurality of segments is a voice segment or a non-voice segment; (agent determines if there is speech in each segment)
generating first emotion recognition information of the first speaker based on a first segment determined to be the voice segment among the plurality of segments; and (agent determines the speaker is angry)
identifying the first speaker by fusing the first segment and the first emotion recognition information. (Agent recognizes the speaker as either Bob or Larry. Agent knows that Bob is always angry, and Larry is always happy, so agent identifies the speaker as Bob.)
Step 2B: Search for Inventive Concept: Additional Element Do not amount to Significantly More: The limitations of "sound recognition device" and “sound sensor” are well-understood, routine, and conventional machine components that are being used for their well-understood, routine, and conventional and rather generic functions. These limitations are expressed parenthetically and lack nexus to the Claim language and as such are a separable and divisible mention to a machine. Additionally, the use of a sound sensor amounts to necessary data gathering. Accordingly, they are not sufficient to cause the Claim to amount to significantly more than the underlying abstract idea. 
The Dependent Claims do not add limitations that could help the Claim as a whole to amount to significantly more than the Abstract idea identified for the Independent Claim:
2. The method of claim 1, wherein the sound sensor is further configured to further recognize an utterance sound of a second speaker different from the first speaker to generate the sound signal, and wherein the method further comprising: generating second emotion recognition information of the second speaker based on a second segment determined to be the voice segment among the plurality of segments; and identifying the second speaker by fusing the second segment and the second emotion recognition information. (Agent listens to a second sentence in the recording, determines the speaker is happy, and identifies him as Larry.)
3. The method of claim 1, wherein the sound sensor is further configured to further recognize a non-utterance sound of the first speaker to generate the sound signal, and wherein the method further comprising: generating first situation recognition information based on a third segment determined to be the non-voice segment among the plurality of segments. (Agent hears a car horn in the background and determines that the speakers were driving a car.)
4. The method of claim 1, wherein the sound sensor is further configured to further recognize an ambient sound to generate the sound signal, and wherein the method further comprising: generating second situation recognition information based on a fourth segment determined to be the non-voice segment among the plurality of segments. (the car horn is an ambient sound)
5. The method of claim 4, wherein the second situation recognition information indicates one of a scene sound, an animal sound, a surrounding object sound, a music sound, and a natural sound. (the car horn is a surrounding object sound)
6. The method of claim 1, wherein the dividing of the sound signal into the plurality of segments includes: dividing the sound signal into a plurality of frames of a reference time unit; (agent divides the audio into minute frames)
generating situation information of each of the plurality of frames; (agent determines if there are car noises in each frame)
and generating the plurality of segments by grouping a series of frames having the same situation information among the plurality of frames. (agent groups the frames with car noises)
The additional limitations introduced by the Dependent Claims are not sufficient as additional elements that integrate the judicial exception into a practical application or as additional elements that cause the Claim as a whole to amount to substantially more than the underlying abstract idea.
With respect to Independent Claim 10, which has limitations similar to the limitations of Claim 1, the limitations of “electronic device” and “sound recognition device,” are expressed parenthetically and lack nexus to the Claim language and as such are a separable and divisible mention to a machine. Accordingly, they do not include additional limitations that cause the Claim as a whole to amount to more than the underlying abstract idea. 
The Dependent Claims 11-13 are similar to claims 1, 2, and 4, and do not add limitations that could integrate the judicial exception into a practical application or help the Claim as a whole to amount to significantly more than the Abstract idea identified for the Independent Claim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-2, 4-8 and 10-14  is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaskari et al. (US 20220093106 A1) in view of Daimo (US 20220383880 A1).

Regarding claim 1, Kaskari discloses: 1. A method of operating a sound recognition device (“[0047]… In some embodiments, the electronic system 610 may be, or may be coupled to, a mobile phone, a tablet, a laptop computer, a desktop computer, an automobile, a personal digital assistant (PDA), a television, a voice interactive device (e.g., a smart speaker, conference speaker system, etc.), a network or system access point, and/or other system of device configured to receive user voice input for authentication and/or identification.”) 
communicating with a sound sensor configured to recognize an utterance sound of a first speaker to generate a sound signal, the method comprising: receiving the sound signal from the sound sensor; ("[0025]... A process 100 includes receiving an audio input sample 110, representing a detected keyword uttered by a speaker. In some embodiments, the system includes one or more microphones sensing sound and converting the sound to electrical signals…" )
dividing the sound signal into a plurality of segments; determining whether each of the plurality of segments is a voice segment or a non-voice segment; ("[0025]...The received audio signal is processed through audio input circuitry and one or more digital audio processing systems, which may include a voice activity detector (VAD) configured to identify speech segments in the received audio signal…" )
generating first emotion recognition information of the first speaker based on a first segment determined to be the voice segment among the plurality of segments; and (Not explicitly disclosed)
identifying the first speaker by fusing the first segment and the first emotion recognition information. ("[0039]...In step 476, the system extracts features from recorded speech segments and inputs the features to a trained neural network to generate embedding vectors. In step 478, the system computes a confidence score for one or more stored speaker ID centroids and the user embedding vectors, and in step 480, compares the confidence score with a threshold to decide whether the speaker belongs to a specific ID..." )
	Kaskari does not explicitly disclose that emotion recognition is one of the features used to identify the speaker.
Daimo discloses: generating first emotion recognition information of the first speaker based on a first segment determined to be the voice segment among the plurality of segments; and ("[0057] DNN 122, having received the connected plurality of frames of the MFCCs, outputs an emotion label of the highest probability as an estimation result of emotion estimator 12..." )
identifying the first speaker by fusing the first segment and the first emotion recognition information. ("[0058] Speaker identification processor 13 outputs, based on the acoustic feature value calculated from the utterance data, a score for identifying the speaker of the utterance data, using the estimation result of emotion estimator 12." See also, Fig. 4 shows that the Acoustic feature value and Estimation result are used to identify the speaker.)
Kaskari and Daimo are considered analogous art to the claimed invention because they disclose methods of speaker identification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Kaskari to include emotion recognition as taught by Daimo. Doing so would have been beneficial to increase accuracy for an emotional utterance. (Daimo [0011]) 

Regarding claim 2, Kaskari discloses: 2. The method of claim 1, wherein the sound sensor is further configured to further recognize an utterance sound of a second speaker different from the first speaker to generate the sound signal, and wherein the method further comprising: generating second emotion recognition information of the second speaker based on a second segment determined to be the voice segment among the plurality of segments; and identifying the second speaker by fusing the second segment and the second emotion recognition information. ("[0006] In various embodiments, a method includes receiving a training batch of audio samples comprising a plurality of utterances for each of a plurality of speakers ... Computing the GNLL may include generating a centroid vector for each of a plurality of speakers, based at least in part on the embedding vectors..." )

Regarding claim 4, Kaskari discloses: 4. The method of claim 1, wherein the sound sensor is further configured to further recognize an ambient sound to generate the sound signal, (Kaskari discloses a microphone which can inherently recognize an ambient sound to generate a sound signal)
and wherein the method further comprising: generating second situation recognition information based on a fourth segment determined to be the non-voice segment among the plurality of segments. ("[0029]... Other audio feature extraction approaches may also be used in various embodiments (e.g., features related to speech recognition, noise, music, etc.) to extract additional information from the audio sample as relevant to a particular implementation." – Noise and music are non-voice. A feature can be considered situation recognition information.)

Regarding claim 5, Kaskari discloses: 5. The method of claim 4, wherein the second situation recognition information indicates one of a scene sound, an animal sound, a surrounding object sound, a music sound, and a natural sound. ("[0029]... Other audio feature extraction approaches may also be used in various embodiments (e.g., features related to speech recognition, noise, music, etc.) to extract additional information from the audio sample as relevant to a particular implementation." )

Regarding claim 6, Kaskari discloses: 6. The method of claim 1, wherein the dividing of the sound signal into the plurality of segments includes: dividing the sound signal into a plurality of frames of a reference time unit; ("[0026] The audio input sample 110 is fed to a neural network 120. In various embodiments, the input speech samples are derived from an audio signal in fixed length frames that are preprocessed for feature extraction (e.g., passing the audio signal through finite impulse response filter, partitioning the audio signal into frames, applying echo and noise cancellation/suppression, etc.), before input to the neural network 120." )
generating situation information of each of the plurality of frames; ("[0025]...The received audio signal is processed through audio input circuitry and one or more digital audio processing systems, which may include a voice activity detector (VAD) configured to identify speech segments in the received audio signal…" – voice activity is situation information. See also Fig. 1.)
and generating the plurality of segments by grouping a series of frames having the same situation information among the plurality of frames. ("[0039]...In step 474, the audio signals received from the microphones are processed to suppress noise, cancel echo, identify speech segments, enhance a speech target, and/or otherwise prepare the audio signal for input to a neural network trained for speech verification. In step 476, the system extracts features from recorded speech segments and inputs the features to a trained neural network to generate embedding vectors..." – the speech segments are grouped frames that contain voice activity.)

Regarding claim 7, Kaskari discloses: 7. The method of claim 1, wherein the generating of the first emotion recognition information of the first speaker based on the first segment determined to be the voice segment among the plurality of segments includes: generating a SER (Speech Emotion Recognition) embedding vector based on the first segment; and generating the first emotion recognition information based on the SER embedding vector. ("[0006] In various embodiments, a method includes receiving a training batch of audio samples comprising a plurality of utterances for each of a plurality of speakers (e.g., a first number of speakers and a second number of utterances per speaker), extracting features from the audio samples to generate a batch of features, processing the batch of features using a neural network to generate a plurality of embedding vectors configured to differentiate audio samples by speaker, computing a generalized negative log-likelihood loss (GNLL) value for the training batch based, at least in part, on the embedding vectors, and modifying weights of the neural network to reduce the GNLL value." )
Kaskari does not disclose that the embedding vector is for emotion.
Daimo discloses: Speech Emotion Recognition ("[0057] DNN 122, having received the connected plurality of frames of the MFCCs, outputs an emotion label of the highest probability as an estimation result of emotion estimator 12..." )
See claim 1 for motivation statement.

Regarding claim 8, Kaskari discloses: 8. The method of claim 7, wherein the identifying of the first speaker by fusing the first segment and the first emotion recognition information includes: generating a SI (Speaker Identification) embedding vector based on the first segment and the SER embedding vector; and identifying the first speaker based on the SI embedding vector. ("[0006] In various embodiments, a method includes receiving a training batch of audio samples comprising a plurality of utterances for each of a plurality of speakers (e.g., a first number of speakers and a second number of utterances per speaker), extracting features from the audio samples to generate a batch of features, processing the batch of features using a neural network to generate a plurality of embedding vectors configured to differentiate audio samples by speaker, computing a generalized negative log-likelihood loss (GNLL) value for the training batch based, at least in part, on the embedding vectors, and modifying weights of the neural network to reduce the GNLL value." - Fig. 2 further shows fusion of features for embedding vectors. )

Claim 10 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Additionally, an electronic device of the Claim are taught by Kaskari (“ELECTRONIC SYSTEM 610”, Fig. 6)	

Regarding claim 11, Kaskari discloses: 11. The electronic device of claim 10, wherein the sound recognition device includes: a segment manager configured to receive the sound signal, to divide the sound signal into a plurality of segments, and to determine whether each of the plurality of segments is a voice segment or a non-voice segment; an emotion information recognition device configured to receive the first segment determined to be the voice segment, and to generate the first emotion recognition information of the first speaker based on the first segment; and a speaker identification device configured to receive the first segment and the first emotion recognition information, and to fuse the first segment and the first emotion recognition information to identify the first speaker. (See claim 1 for mapping of functions. Instant application discloses in [0034] that all the functions are performed by the electronic device, which may be a phone, computer, PDA, etc. Kaskari discloses: “[0047]… In some embodiments, the electronic system 610 may be, or may be coupled to, a mobile phone, a tablet, a laptop computer, a desktop computer, an automobile, a personal digital assistant (PDA), a television, a voice interactive device (e.g., a smart speaker, conference speaker system, etc.), a network or system access point, and/or other system of device configured to receive user voice input for authentication and/or identification.” Kaskari further discloses: “[0058] Where applicable, various embodiments provided by the present disclosure may be implemented using hardware, software, or combinations of hardware and software. Also, where applicable, the various hardware components and/or software components set forth herein may be combined into composite components comprising software, hardware, and/or both without departing from the scope of the present disclosure. Where applicable, the various hardware components and/or software components set forth herein may be separated into sub-components comprising software, hardware, or both without departing from the scope of the present disclosure. In addition, where applicable, it is contemplated that software components may be implemented as hardware components and vice versa.”)

Claim 12 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.

Claim 13 is a system claim with limitations corresponding to the limitations of Claim 4 and is rejected under similar rationale.

Claim 14 is a system claim with limitations corresponding to the limitations of Claim 8 and is rejected under similar rationale.

Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaskari in view of Daimo, in further view of Sun et al. (US 20130304478 A1).

Regarding claim 3, Kaskari discloses: 3. The method of claim 1, wherein the sound sensor is further configured to further recognize a non-utterance sound of the first speaker to generate the sound signal, and  (Kaskari discloses a microphone, which would inherently be configured to generate a sound signal from either an utterance or non-utterance sound.)
wherein the method further comprising: generating first situation recognition information based on a third segment determined to be the non-voice segment among the plurality of segments. (Not explicitly disclosed.)
	Kaskari and Daimo do not disclose that a non-voice segment is used to generate situation recognition information.
Sun discloses: wherein the method further comprising: generating first situation recognition information based on a third segment determined to be the non-voice segment among the plurality of segments. ("[0028] As a second example, the side information extractor 120 may function as a health detector that detects the user's health condition in making the utterance. For instance, the side information may indicate whether it sounds like that the user is coughing, snuffling, or having a running nose, or indicate that the user may be sick because there is a recent doctor's appointment in the calendar. As a third example, the side information extractor 120 may function as an emotion detector that detects the user's emotion in making the utterance. For instance, the side information may indicate whether the user is happy, angry, or sad. As a fourth example, the side information extractor 120 may function as an event detector and detect a recent event of the user." )
Kaskari, Daimo, and Sun are considered analogous art to the claimed invention because they disclose methods of speaker identification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Kaskari in view of Daimo to detect non-utterance sounds such as coughing, as taught by Sun. Doing so would have been beneficial to identify if the user is sick. (Sun [0028]) 

Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaskari in view of Daimo, in further view of Gopinathan et al. (US 20150142446 A1).

Regarding claim 9, Kaskari discloses: 9. The method of claim 1, further comprising: extracting a frequency domain feature and a time domain feature of the first segment, and wherein the frequency domain feature includes an MFCC (Mel-frequency cepstral coefficient) value, and  ("[0029] In various embodiments, the extracted features may include features derived through one or more of modified group delay functions, spectral slope-based analysis, short-time Fourier transform analysis, cepstral analysis, complex cepstral analysis, linear prediction coefficients, linear prediction cepstrum coefficients, linear prediction cepstral coefficients, Mel frequency cepstral coefficients, discrete wavelet transform, perceptual linear prediction, Mel-scaled discrete wavelet analysis, and/or other audio feature analyses capable of generating features from audio input data to differentiate between a plurality of speakers. Other audio feature extraction approaches may also be used in various embodiments (e.g., features related to speech recognition, noise, music, etc.) to extract additional information from the audio sample as relevant to a particular implementation." )
wherein the time domain feature includes at least one of the loudness, speed, stress, pitch change, speech time, and pause time of the utterance sound. (not explicitly disclosed )
	Kaskari and Daimo do not explicitly disclose time domain features include loudness, speed, stress, pitch change, speech time, and pause time.
 Gopinathan discloses: wherein the time domain feature includes at least one of the loudness, speed, stress, pitch change, speech time, and pause time of the utterance sound. ("[0049]...The process may extract primary features from the voice files that now contain only the customers' voices 601. The primary features are classified based on the domain they are extracted from with time domain primary features capturing the variation of amplitude with respect to time (for example, Amplitude, Sound power, Sound intensity, Zero crossing rate, Mean crossing rate, Pause length ratio, Number of pauses, Number of spikes, Spike length ratio)..." – amplitude, power, and intensity are all representations of loudness.)
Kaskari, Daimo, and Gopinathan are considered analogous art to the claimed invention because they disclose methods of speaker identification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Kaskari in view of Daimo to use time domain features such as amplitude and pause length ratio, as taught by Gopinathan. Doing so would have been beneficial to provide additional input to predictive models. (Gopinathan [0015]) This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Cilingir et al. (US 20180366124 A1). Cilingir discloses a method of speaker identification using context information which can include emotional state.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JON C MEIS whose telephone number is (703)756-1566. The examiner can normally be reached Monday - Thursday, 8:30 am - 5:30 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JON CHRISTOPHER MEIS/Examiner, Art Unit 2654                                                

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654
Read full office action
Prosecution Timeline

Oct 24, 2023
Application Filed
Oct 22, 2025
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/881,473
Patent 12603087
VOICE RECOGNITION USING ACCELEROMETERS FOR SENSING BONE CONDUCTION
2y 5m to grant Granted Apr 14, 2026
18/303,296
Patent 12579975
Detecting Unintended Memorization in Language-Model-Fused ASR Systems
2y 5m to grant Granted Mar 17, 2026
17/979,989
Patent 12482487
MULTI-SCALE SPEAKER DIARIZATION FOR CONVERSATIONAL AI SYSTEMS AND APPLICATIONS
2y 5m to grant Granted Nov 25, 2025
18/020,514
Patent 12475312
FOREIGN LANGUAGE PHRASES LEARNING SYSTEM BASED ON BASIC SENTENCE PATTERN UNIT DECOMPOSITION
2y 5m to grant Granted Nov 18, 2025
18/065,374
Patent 12430329
TRANSFORMING NATURAL LANGUAGE TO STRUCTURED QUERY LANGUAGE BASED ON MULTI-TASK LEARNING AND JOINT TRAINING
2y 5m to grant Granted Sep 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
46%
Grant Probability
99%
With Interview (+59.0%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 22 resolved cases by this examiner. Grant probability derived from career allow rate.