Last updated: April 19, 2026
Application No. 18/532,054
SPEAKER IDENTIFICATION METHOD, SPEAKER IDENTIFICATION DEVICE, AND NON-TRANSITORY COMPUTER READABLE RECORDING MEDIUM

Non-Final OA §101§103
Filed
Dec 07, 2023
Examiner
SOLAIMAN, FOUZIA HYE
Art Unit
2653
Tech Center
2600 — Communications
Assignee
Panasonic Intellectual Property Corporation of America
OA Round
1 (Non-Final)
Interview Optional

— +55.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 63 resolved cases, 2023–2026
Examiner Intelligence

SOLAIMAN, FOUZIA HYE View full profile →
Grants 67% — above average
Career Allow Rate
42 granted / 63 resolved
+4.7% vs TC avg
Strong +56% interview lift
Without
With
+55.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
16 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
28.5%
-11.5% vs TC avg
§103
47.1%
+7.1% vs TC avg
§102
16.0%
-24.0% vs TC avg
§112
2.7%
-37.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 63 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Information Disclosure Statement
Acknowledgment is made of the information disclosure statements filed on 1/18/2024, It is noted by the Examiner that all of the references were considered.
Priority
The instant application, filed 04/04/2024, and claims foreign priority to JP2021-097888, filed 06/11/2021 and claims has PCT/JP2022/020878 filed 5/19/2022.

Drawings
The drawings submitted on 12/07/2023 have been considered and accepted


CLAIM INTERPRETATION
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “an acquisition part”, “first selection part”, “second selection part”, “a similarity calculation part”  and “an output part”  in claims 11, according to para [0044] in the filed application.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1- 12 are rejected under 35 U.S.C. § 101 because the claims are directed to a judicial exception (an abstract idea) and do not recite additional elements that amount to significantly more.
Summary of the statutory framework and guidance applied
The Office applies the two-step framework for subject matter eligibility consistent with current USPTO guidance.
Step 1: Determine whether the claim recites a statutory category (process, machine, manufacture, or composition of matter).
Step 2A, Prong One: If the claim recites a statutory category, determine whether the claim recites a judicial exception (abstract idea, law of nature, or natural phenomenon).
Step 2A, Prong Two: If a judicial exception is recited, determine whether the claim integrates the exception into a practical application.
Step 2B: If the claim does not integrate the exception into a practical application, determine whether the claim recites additional elements that amount to significantly more than the judicial exception.
Step 1 (statutory category)
Claims 1, 2–10 are directed to a method (process).
Claims 11 are directed to a device (machine).
Claim 12 is directed to a non-transitory computer-readable medium (manufacture). Each of these falls within a statutory category.

Step 2A, Prong One (judicial exception)
The claims recite selection of phoneme sequences from certain speaker, selection of databases associated with specific-speaker and selected database based on calculated differences. These elements constitute a judicial exception in the form of abstract mathematical concepts and data processing.
Step 2A, Prong Two (integration into a practical application)
The claims do not integrate the judicial exception into a practical application. The additional elements recited—generic computing components (processors, memory, computing device),—are described at a high level and perform conventional functions of receiving, inputting, and outputting data. The claims do not describe a specific improvement to the functioning of the computer or another technological improvement. They instead use generic computer implementation to carry out the abstract mathematical operations.
Step 2B (significantly more)
The claims recite no additional elements or combination of elements that amount to significantly more than the abstract idea. The claimed processors, memories, device, are well-understood, routine, and conventional activities and components in the field of natural language processing. The claims do not recite a non-conventional arrangement of components or otherwise specify how the claimed elements effect a technological improvement.

Limitation-by-limitation analysis for independent claims
Claim 1 (method) – Limitation analysis
“Acquiring input utterance data being utterance data concerning an utterance of a certain utterer;” — This is data gathering/input identification and is a conventional, pre-solution activity. Human can gather or collect data.
“performing voice recognition from the input utterance data” — Claim recites identify voice of the utterer which human mind can do.
“selecting, from among a plurality of registered utterance contents set in advance, a registered utterance content closest to a recognized utterance content indicated by a result of the voice recognition as a selected utterance content;” — Human can determine utterance content like phrase, phoneme sequence, accent, tone of the user/utterer. Human activity. 
“selecting, from among a plurality of databases respectively associated with the registered utterance contents, a database associated with the selected utterance content, each of the databases storing a feature quantity of utterance data concerning a registered utterance content having been uttered by a registered utterer” — Human can create multiple or plural database/libraries based on different speakers, and select dictionary associated with each speaker. For example, if there is three speakers, human can create 3 databases/libraries/dictionaries, each database holds/store information associated with each speaker. Human can select database based on user-specific content.
“calculating a similarity between a feature quantity of the input utterance data and the feature quantity stored in the selected database; and” — This recites math calculation which human can do using paper and pencil and using human mind.
“identifying the certain utterer on the basis of the similarity, and outputting a result of the identification.”  Human can recognize specific speaker in human mind. And outputting data is just post solution activity
Conclusion for claim 1: The claim is directed to abstract mental and mathematical activity. 

conventional computing; it does not integrate the judicial exception into a practical application and does not include additional elements that amount to significantly more.

Claim 11 (device) – Limitation analysis
The claim recites “an acquisition part”, “first selection part”, “second selection part”, “a similarity calculation part”  and “an output part”  in claims 11, which is generic hardware components (“[0044] The processor 3 includes, for example, a central processing unit, and has the acquisition part …”) configured to execute instructions that perform the same abstract data transformations and training steps recited in claim 1. Implementing the abstract method on generic hardware does not provide a practical application or significantly more.
Conclusion for claim 11: The device claim is ineligible for the same reasons as claim 1.

Claim 12 (computer-readable medium) – Limitation analysis
The claim recites instructions stored on a non-transitory medium that cause a processor to perform the same abstract steps recited in claim 1. The storage of instructions to perform an abstract method on a computer-readable medium does not render the subject matter eligible..because mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible.
Conclusion for claim 12: The claim is ineligible for the same reasons as claim 1.


Dependent claims (claims 2--10)
With respect to claim(s) 2, the claim(s) recite(s)  “wherein, in the selecting of the selected utterance content, when the registered utterance contents include a registered utterance content identical to the recognized utterance content, the identical registered utterance content is selected as the selected utterance content.”  Human identify phoneme sequence matched with registered utterance content and selected. No additional limitations are present.
With respect to claim(s) 3, the claim(s) recite(s) “wherein,in the selecting of the selected utterance content, when the registered utterance contents include no registered utterance content identical to the recognized utterance content, the closest registered utterance content is selected as the selected utterance content.” Human can identify phoneme sequence or keyword mismatch and closest to the sound element utterance select.  No additional limitations are present.
With respect to claim(s) 4, recites “wherein, in the selecting of the selected utterance content, a registered utterance content which includes all sound elements of the recognized utterance content is selected from among the registered utterance contents.” Human recognize all sound element in the utterance like phoneme, sound, sequence, vowel and pronunciation. No additional limitations are present.
With respect to claim(s) 5, recites “wherein, in the selecting of the selected utterance content, a registered utterance content which has configuration data closest to configuration data indicating a configuration of sound elements of the recognized utterance content is selected from among the registered utterance contents.” Human recognize all registered/recognized sound element of the phonetic representation and can analysis those. No additional limitations are present.
With respect to claim(s) 6 and 7, the claim(s) recite(s) “wherein the sound element includes a phoneme” and “wherein the sound element includes a vowel”,  which reads on a human mind recognize vowel in the phoneme sequence. No additional limitations are present.
With respect to claim(s) 8, the claim(s) recite(s) “wherein the sound element includes a phoneme sequence in each of n-syllabified phonemic units of an utterance content, "n" being an integer of two or larger.” Human identify how many syllable/s in the phoneme and represent them in the integer values. No additional limitations are present.
With respect to claim(s) 9, the claim(s) recite(s) “wherein the configuration data includes a vector which is defined by allocation of a value corresponding to an occurrence frequency of one or more sound elements of the recognized utterance content or the registered utterance content to a positional arrangement of all sound elements set in advance.”  Human can identify phoneme sequence arrangement and location of the vowel/consonant. No additional limitations are present.
With respect to claim(s) 10, the claim(s) recite(s) “wherein the value corresponding to the occurrence frequency is defined by an occurrence frequency proportion of each of the one or more sound elements that occupies a total number of sound elements of the recognized utterance content or the registered utterance content.”, which reads on a human identifying number of times each phoneme occurs in the total phoneme, and define them in math equation. No additional limitations are present.
These claims further do not remedy the judicial exception being integrated into a practical application and further fail to include additional elements that are sufficient to amount to significantly more than the judicial exception.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim/s 1-7, 9 11, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Hayakawa et al. US 20180075843 A1 in view of NISHIDA HIROTO, JP 2003152521 A . 
Regarding Claim 1 Hayakawa teaches: 
1. An utterer identification method for an utterer identification device, comprising:           acquiring input utterance data being utterance data concerning an utterance of a certain utterer; Hayakawa teaches  (“[0077] By confirming a presented keyword and performing a prescribed input operation by the user, a device connected to the voice recognition device 1 or a device in which the voice recognition device 1 is implemented may execute an operation corresponding to the keyword. Alternatively, the user may utter a voice indicating approval or disapproval. By recognizing the voice, the voice recognition device 1 may determine approval or disapproval. When the voice recognition device 1 determines that the user has uttered a voice indicating approval, the device connected to the voice recognition device 1 or the device in which the voice recognition device 1 is implemented may execute an operation corresponding to the keyword.”) (“[0079] The voice section detection unit 21 detects a voice section from an input voice signal (step S101). With respect to each frame in the voice section, the feature extraction unit 22 calculates a feature vector that includes a plurality of feature amounts representing characteristics of the voice of a user (step S102).”)  Hayakawa et al. US 20180075843 A1
                 performing voice recognition from the input utterance data; Hayakawa teaches   (“[0077] By confirming a presented keyword and performing a prescribed input operation by the user, a device connected to the voice recognition device 1 or a device in which the voice recognition device 1 is implemented may execute an operation corresponding to the keyword. …”) Hayakawa et al. US 20180075843 A1 
          selecting, from among a plurality of registered utterance contents set in advance, a registered utterance content closest to a recognized utterance content indicated by a result of the voice recognition as a selected utterance content;  Hayakawa teaches  (“0023] Therefore, the voice recognition device extracts a common phoneme string from voices that are uttered repeatedly by the user, exemplifying a speaker, and compares the extracted phoneme string with the phoneme strings of the respective keywords registered in a keyword dictionary to select the most resembling keyword. The voice recognition device presents the selected keyword to the user. The keyword may be an individual word or a phrase including a plurality of words.”) (“[0027] The processing unit 13 includes, for example, one or a plurality of processors, a memory circuit, and a peripheral circuit. By performing voice recognition processing, the processing unit 13 selects one of the keywords registered in a keyword dictionary on the basis of the voice signal and outputs information representing the selected keyword via the communication interface unit 15. Alternatively, the processing unit 13 may display the selected keyword via a display device (not illustrated). Details of the voice recognition process performed by the processing unit 13 will be described later.”) (“claim 2… wherein selection of the predetermined number of keywords includes selecting the prescribed number of keywords among the plurality of keywords in descending order of the degree of similarity for each keyword.”) (“[0084] As described thus far, when no keyword is recognized among the keywords registered in the keyword dictionary from the voice that the user uttered, the voice recognition device extracts a common phoneme string that appears in common between maximum-likelihood phoneme strings of a plurality of voice sections that have been uttered repeatedly. The voice recognition device calculates degrees of similarity between the common phoneme string and the phoneme strings of the respective keywords registered in the keyword dictionary in accordance with the DP matching, identifies a keyword corresponding to a maximum value among the degrees of similarity, and presents the identified keyword to the user. Thus, even when the user does not correctly utter a keyword registered in the keyword dictionary and utters a different phrase each time, the voice recognition device may identify a keyword that the user intended to make the voice recognition device recognize. Therefore, even when the user does not remember a keyword correctly, the voice recognition device may prevent the user from uttering repeatedly to try to utter the keyword.”) (“[0079] The voice section detection unit 21 detects a voice section from an input voice signal (step S101). With respect to each frame in the voice section, the feature extraction unit 22 calculates a feature vector that includes a plurality of feature amounts representing characteristics of the voice of a user (step S102).”)  Hayakawa et al. US 20180075843 A1
calculating a similarity between a feature quantity of the input utterance data and the feature quantity stored in the selected database; and  Hayakawa teaches   (“[0045] Specifically, with respect to each frame in the voice section, by inputting the feature vector of the frame to the GMM, the maximum-likelihood phoneme string search unit 23 calculates output probabilities of the respective HMM states corresponding to the respective phonemes for the frame. …”) (“[0079] The voice section detection unit 21 detects a voice section from an input voice signal (step S101). With respect to each frame in the voice section, the feature extraction unit 22 calculates a feature vector that includes a plurality of feature amounts representing characteristics of the voice of a user (step S102).”)  (“[0052] For example, with respect to each keyword registered in the keyword dictionary, the determination unit 24 calculates a degree of similarity P between the phoneme string of the keyword and a maximum-likelihood phoneme string in accordance with, for example, the following equation.”) (“[0064] For example, with respect to each keyword registered in the keyword dictionary, the matching unit 26 calculates a Levenshtein distance (also referred to as edit distance) between the phoneme string corresponding to the keyword and the common phoneme string. In so doing, the matching unit 26 calculates a shortest path that minimizes a cumulative distance using the DP matching.”) (“[0069] With respect to each of the keywords registered in the keyword dictionary, the matching unit 26 calculates a degree of similarity between the phoneme string of the keyword and the common phoneme string using a result of the DP matching. For example, the matching unit 26 calculates a degree of similarity P in accordance with the following equation.”) by Hayakawa  US 20180075843 A1 
identifying the certain utterer on the basis of the similarity, and Hayakawa teaches   the determination unit 24 determines whether or not the user uttered any keyword in the voice section and determines that the user has uttered a voice. (“[0050] By comparing the maximum-likelihood phoneme string of the voice section with the phoneme strings representing the utterances of the keywords registered in the keyword dictionary, the determination unit 24 determines whether or not the user uttered any keyword in the voice section.”)  (“[077] … When the voice recognition device 1 determines that the user has uttered a voice indicating approval, the device connected to the voice recognition device 1 or the device in which the voice recognition device 1 is implemented may execute an operation corresponding to the keyword.  …”) (“[0079] The voice section detection unit 21 detects a voice section from an input voice signal (step S101). With respect to each frame in the voice section, the feature extraction unit 22 calculates a feature vector that includes a plurality of feature amounts representing characteristics of the voice of a user (step S102).”) (“[0084] …The voice recognition device calculates degrees of similarity betw
een the common phoneme string and the phoneme strings of the respective keywords registered in the keyword dictionary in accordance with the DP matching, identifies a keyword corresponding to a maximum value among the degrees of similarity, and presents the identified keyword to the user. …”) (“[0049] The maximum-likelihood phoneme string search unit 23 outputs the obtained maximum-likelihood phoneme string to the determination unit 24.”) (“[0050] By comparing the maximum-likelihood phoneme string of the voice section with the phoneme strings representing the utterances of the keywords registered in the keyword dictionary, the determination unit 24 determines whether or not the user uttered any keyword in the voice section.”) (“[0079] The voice section detection unit 21 detects a voice section from an input voice signal (step S101). With respect to each frame in the voice section, the feature extraction unit 22 calculates a feature vector that includes a plurality of feature amounts representing characteristics of the voice of a user (step S102).”) Hayakawa et al. US 20180075843 A1
outputting a result of the identification. Hayakawa teaches  (“[0050] By comparing the maximum-likelihood phoneme string of the voice section with the phoneme strings representing the utterances of the keywords registered in the keyword dictionary, the determination unit 24 determines whether or not the user uttered any keyword in the voice section.”)  (“[0027] … By performing voice recognition processing, the processing unit 13 selects one of the keywords registered in a keyword dictionary on the basis of the voice signal and outputs information representing the selected keyword via the communication interface unit 15. Alternatively, the processing unit 13 may display the selected keyword via a display device (not illustrated). …”) (“[0075] The presentation unit 27 presents the identified keyword to the user. For example, the presentation unit 27 makes a display device (not illustrated) connected to the voice recognition device 1 display the keyword. …”) (“[0102] … The control unit 114 displays an identified keyword received from the server 120 on a display (not illustrated) or plays back a synthesized voice signal corresponding to the identified keyword via a speaker (not illustrated).”) Hayakawa et al. US 20180075843 A1

Hayakawa does not explicitly teach plurality of databases.
NISHIDA HIROTO  teaches:  
    selecting, from among a plurality of databases respectively associated with the registered utterance contents, a database associated with the selected utterance content, .NISHIDA HIROTO  teaches (“[0041] FIG. 5 is a sequence diagram of the speech recognition system according to Embodiment 1 of the present invention. A detailed flow in the case where the user-specific voice recognition data is actually downloaded will be described with reference to the above drawings.” Page 6) (“[0085] Next, the process proceeds to step 1004, and the user creates a text of a person name to be registered in a telephone directory for using voice recognition.” Page 12, para 85)   (“[0051] … Data Base 200 and Data Base 300 in FIG. 1 can select different audio data according to the gender of the user using the user terminal device 1, select appropriate data according to the gender, and download the data to the user terminal device 1. The database of the host computer 3, which is the second computer, is configured as a kind of database divided for each gender. This allows the user to download a database of a type suitable for the user without needing to analyze the characteristics of his / her voice or specialized analysis knowledge.” Page 8, para 51.) by  NISHIDA HIROTO, JP 2003152521 A   
 
NISHIDA HIROTO  teaches:
each of the databases storing a feature quantity of utterance data concerning a registered utterance content having been uttered by a registered utterer; 
.NISHIDA HIROTO  teaches (“[0041] FIG. 5 is a sequence diagram of the speech recognition system according to Embodiment 1 of the present invention. A detailed flow in the case where the user-specific voice recognition data is actually downloaded will be described with reference to the above drawings.” Page 6) (“[0085] Next, the process proceeds to step 1004, and the user creates a text of a person name to be registered in a telephone directory for using voice recognition.” Page 12, para 85)   (“[0051] … Data Base 200 and Data Base 300 in FIG. 1 can select different audio data according to the gender of the user using the user terminal device 1, select appropriate data according to the gender, and download the data to the user terminal device 1. The database of the host computer 3, which is the second computer, is configured as a kind of database divided for each gender. This allows the user to download a database of a type suitable for the user without needing to analyze the characteristics of his / her voice or specialized analysis knowledge.” Page 8, para 51.) by  NISHIDA HIROTO, JP 2003152521 A   
NISHIDA HIROTO is considered to be analogous to the claimed invention because it relates to integrating special effects with a text source.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Hayakawa  to further incorporate the teachings of  NISHIDA HIROTO in order to include selecting database based on different age speaker. 
One could have been motivated to do so because system has the effect of improving the recognition rate for voice differences due to age. (“[0012] According to a third aspect of the present invention, there is provided the voice recognition system according to the first aspect, wherein the database of the second computer has a database divided for each age. For example, it has the effect of improving the recognition rate for voice differences due to age.” Page 3, para 12) by  NISHIDA HIROTO, JP 2003152521 A  

Claim 11 is a device claim with a limitation similar to the limitation of method Claim 1 and is rejected under similar rationale. Additionally,
Regarding Claim 11 Hayakawa teaches:

11. An utterer identification device, comprising: an acquisition part that acquires input utterance data being utterance data concerning an utterance of a certain utterer .  Hayakawa  teaches an acquisition part , a recognition part, a first selection part, a second selection part , a similarity calculation part and an output part. All parts(i.e CPU). (“[0027] The processing unit 13 includes, for example, one or a plurality of processors, a memory circuit, and a peripheral circuit. By performing voice recognition processing, the processing unit 13 selects one of the keywords registered in a keyword dictionary on the basis of the voice signal and outputs information representing the selected keyword via the communication interface unit 15.  …”) by Hayakawa  US 20180075843 A1 

Claim 12 is a non-transitory computer readable medium claim with a limitation similar to the limitation of method Claim 1 and is rejected under similar rationale.
Regarding Claim 12 Hayakawa teaches:
12. A non-transitory computer readable recording medium storing an utterer identification program that causes a computer to serve as an utterer identification device, the utterer identification program comprising: causing the computer to execute: Hayakawa teaches  (“A non-transitory computer-readable recording …”) by Hayakawa  US 20180075843 A1

Regarding Claim 2 the combination teaches the method claim 1 as identified above.
Hayakawa further teaches:2. The utterer identification method according to claim 1, wherein, in the selecting of the selected utterance content, when the registered utterance contents include a registered utterance content identical to the recognized utterance content, the identical registered utterance content is selected as the selected utterance content. Hayakawa teaches selects one of the keywords registered in a keyword dictionary on the basis of the voice signal. (“[0027] The processing unit 13 includes, for example, one or a plurality of processors, a memory circuit, and a peripheral circuit. By performing voice recognition processing, the processing unit 13 selects one of the keywords registered in a keyword dictionary on the basis of the voice signal and outputs information representing the selected keyword via the communication interface unit 15. Alternatively, the processing unit 13 may display the selected keyword via a display device (not illustrated). ...”) (“[0050] By comparing the maximum-likelihood phoneme string of the voice section with the phoneme strings representing the utterances of the keywords registered in the keyword dictionary, the determination unit 24 determines whether or not the user uttered any keyword in the voice section.”) by Hayakawa et al.  US 20180075843 A1

Regarding Claim 3  the combination teaches the method claim 1 as identified above.
Hayakawa further teaches:
The combination teaches the method claim 1 as identified above. 
3. The utterer identification method according to claim 1, wherein, in the selecting of the selected utterance content, when the registered utterance contents include no registered utterance content identical to the recognized utterance content, the closest registered utterance content is selected as the selected utterance content. 
Hayakawa teaches  (“[0084] As described thus far, when no keyword is recognized among the keywords registered in the keyword dictionary from the voice that the user uttered, the voice recognition device extracts a common phoneme string that appears in common between maximum-likelihood phoneme strings of a plurality of voice sections that have been uttered repeatedly. The voice recognition device calculates degrees of similarity between the common phoneme string and the phoneme strings of the respective keywords registered in the keyword dictionary in accordance with the DP matching, identifies a keyword corresponding to a maximum value among the degrees of similarity, and presents the identified keyword to the user. Thus, even when the user does not correctly utter a keyword registered in the keyword dictionary and utters a different phrase each time, the voice recognition device may identify a keyword that the user intended to make the voice recognition device recognize. Therefore, even when the user does not remember a keyword correctly, the voice recognition device may prevent the user from uttering repeatedly to try to utter the keyword.”) Hayakawa et al. US 20180075843 A1

Regarding Claim 4 the combination teaches the method claim 1 as identified above. 
Hayakawa further teaches:
4. The utterer identification method according to claim 1, wherein, in the selecting of the selected utterance content, a registered utterance content which includes all sound elements of the recognized utterance content is selected from among the registered utterance contents. Hayakawa teaches (“[0027] The processing unit 13 includes, for example, one or a plurality of processors, a memory circuit, and a peripheral circuit. By performing voice recognition processing, the processing unit 13 selects one of the keywords registered in a keyword dictionary on the basis of the voice signal and outputs information representing the selected keyword via the communication interface unit 15. …”) (“[0050] By comparing the maximum-likelihood phoneme string of the voice section with the phoneme strings representing the utterances of the keywords registered in the keyword dictionary, the determination unit 24 determines whether or not the user uttered any keyword in the voice section.”) (“[0051] FIG. 3 is a diagram illustrating an example of a keyword dictionary. In keyword dictionary 300, with respect to each keyword, a character string representing a written form of the keyword and a phoneme string representing a pronunciation of the keyword are registered. For example, for a keyword “Jitaku e kaeru (Japanese pronunciation, meaning “Return my home” in English)”, a phoneme string “jitakuekaeru” of the keyword is registered.”) (“[0084] As described thus far, when no keyword is recognized among the keywords registered in the keyword dictionary from the voice that the user uttered, the voice recognition device extracts a common phoneme string that appears in common between maximum-likelihood phoneme strings of a plurality of voice sections that have been uttered repeatedly. The voice recognition device calculates degrees of similarity between the common phoneme string and the phoneme strings of the respective keywords registered in the keyword dictionary in accordance with the DP matching, identifies a keyword corresponding to a maximum value among the degrees of similarity, and presents the identified keyword to the user. …”) by Hayakawa et al.  US 20180075843 A1 

Regarding Claim 5 the combination teaches the method claim 1 as identified above.
 Hayakawa further teaches:
5. The utterer identification method according to claim 1, wherein, in the selecting of the selected utterance content, a registered utterance content which has configuration data closest to configuration data indicating a configuration of sound elements of the recognized utterance content is selected from among the registered utterance contents. Hayakawa teaches (“[0045] Specifically, with respect to each frame in the voice section, by inputting the feature vector of the frame to the GMM, the maximum-likelihood phoneme string search unit 23 calculates output probabilities of the respective HMM states corresponding to the respective phonemes for the frame. In addition, before inputting a feature vector into the GMM, the maximum-likelihood phoneme string search unit 23 may, for the feature vector calculated from each frame, perform normalization, referred to as cepstral mean normalization (CMN), in which, with respect to each dimension of the feature vector, a mean value is estimated and the estimated mean value is subtracted from a value at the dimension.”)  (“[0080] On the basis of the feature vectors of the respective frames, the maximum-likelihood phoneme string search unit 23 searches for a maximum-likelihood phoneme string corresponding to a voice uttered in the voice section (step S103). On the basis of the maximum-likelihood phoneme string and a keyword dictionary, the determination unit 24 determines whether or not any keyword registered in the keyword dictionary is detected in the voice section (step S104). When any keyword is detected (Yes in step S104), the processing unit 13 outputs information representing the keyword and finishes the voice recognition process.”) by Hayakawa et al.  US 20180075843 A1


Regarding Claim 9 the combination teaches the method claim 5 as identified above. 
Hayakawa further teaches:
9. The utterer identification method according to claim 5, wherein the configuration data includes a vector which is defined by allocation of a value corresponding to an occurrence frequency of one or more sound elements of the recognized utterance content or the registered utterance content to a positional arrangement of all sound elements set in advance.   Hayakawa teaches (“[0071] Alternatively, the matching unit 26 may calculate the degree of similarity P based on a degree of coincidence between the phoneme string of the keyword of interest and the common phoneme string in accordance with the equation (2). In this case, C is the number of coincident phonemes between the common phoneme string and the phoneme string of the keyword of interest and D is the number of phonemes that are included in the phoneme string of the keyword of interest but not included in the common phoneme string. In addition, S is the number of phonemes that are included in the phoneme string of the keyword of interest and are different from the phonemes at corresponding positions in the common phoneme string.”)  (“[0043] … The maximum-likelihood phoneme string is a phoneme string in which respective phonemes included in a voice are arranged in a sequence of utterances thereof and that are estimated to be most probable.”) (“In the equation, C is the number of coincident phonemes between the maximum-likelihood phoneme string and the phoneme string of the keyword of interest and D is the number of phonemes that are included in the phoneme string of the keyword of interest but not included in the maximum-likelihood phoneme string. In addition, S is the number of phonemes that are included in the phoneme string of the keyword of interest and are different from phonemes at corresponding positions in the maximum-likelihood phoneme string.”) (“[0056] When two or more maximum-likelihood phoneme strings have been saved in the storage unit 14, i.e., the user has uttered keywords repeatedly while no keyword has been recognized, the common phoneme string extraction unit 25 extracts a string in which common phonemes to the maximum-likelihood phoneme strings are arranged in a sequence of utterances (hereinafter, simply referred to as common phoneme string).”) (“[0058] After a phoneme(s) representing silence and/or a phoneme(s) that appear(s) in only either one of the maximum-likelihood phoneme strings has/have been deleted from the respective maximum-likelihood phoneme strings, the common phoneme string extraction unit 25 extracts coincident phonemes between the two maximum-likelihood phoneme strings in order from the heads of the two maximum-likelihood phoneme strings. The common phoneme string extraction unit 25 sets a string in which the extracted phonemes are arranged from the head as a common phoneme string.”) (“[0093] In the variation, the common phoneme string extraction unit 25 may extract a common phoneme string by extracting phonemes each of which is common to a majority of maximum-likelihood phoneme strings among three or more maximum-likelihood phoneme strings and arranging the extracted phonemes in a sequence of utterances. …”)    by Hayakawa et al.  US 20180075843 A1

Regarding Claim 6 the combination teaches the method claim 4 as identified above. 
Hayakawa further teaches:
6. The utterer identification method according to claim 4, wherein the sound element includes a phoneme. Hayakawa teaches (“[0023] Therefore, the voice recognition device extracts a common phoneme string from voices that are uttered repeatedly by the user, exemplifying a speaker, and compares the extracted phoneme string with the phoneme strings of the respective keywords registered in a keyword dictionary to select the most resembling keyword. The voice recognition device presents the selected keyword to the user. The keyword may be an individual word or a phrase including a plurality of words.”) by Hayakawa et al.  US 20180075843 A1 

Regarding Claim 7 the combination teaches the method claim 4 as identified above. 
Hayakawa further teaches:
7. The utterer identification method according to claim 4, wherein the sound element includes a vowel. the maximum-likelihood phoneme strings 401 and 402, each of the phonemes “sp”, “silB”, and “silE” is a phoneme representing silence.  Hayakawa  teaches (“[0059] FIG. 4 is a diagram illustrating an example of maximum-likelihood phoneme strings and a common phoneme string. Illustrated as FIG. 4, it is assumed that, in the first utterance, a user uttered, “Etto jitaku, ja nakatta, ie ni kaeru (Japanese pronunciation, meaning “Uh, my home, no, return to a house” in English)”. For the utterance, a maximum-likelihood phoneme string 401 is calculated. In the second utterance, it is assumed that, the user uttered, “Chigau ka. Jitaku, jibun no sunde iru tokoro, ni kaeru (Japanese pronunciation, meaning “No, that's wrong. My home, the place where I live, return there” in English)”. For the utterance, a maximum-likelihood phoneme string 402 is calculated. In the maximum-likelihood phoneme strings 401 and 402, each of the phonemes “sp”, “silB”, and “silE” is a phoneme representing silence.”) (“[0060] … a common phoneme string (“oitakuertknikaeuq”) 420 to be obtained.”) Notes: each of the phonemes “sp”, “silB”, and “silE” and “oitakuertknikaeuq”  this common phoneme has vowel in it. by Hayakawa et al.  US 20180075843 A1 



Claim/s 8 are rejected under 35 U.S.C. 103 as being unpatentable over Hayakawa, and NISHIDA HIROTO in view of Ogawa et al.  US 7657430 B2.

Regarding Claim 8 the combination  teaches the method claim 4 as identified above. 
The combination  does not explicitly teach the sound element includes a phoneme sequence in each of n-syllabified phonemic units of an utterance content, "n" being an integer of two or larger.  
Ogawa teaches teaches:
8. The utterer identification method according to claim 4, wherein the sound element includes a phoneme sequence in each of n-syllabified phonemic units of an utterance content, "n" being an integer of two or larger.  FIG. 1-2,Ogawa teaches (“(37) For example, three sounds "AKA", "AO", and "MIDORI" are input to the word extracting unit 2. The word extracting unit 2 classifies these three sounds to three corresponding clusters, an "AKA" cluster 21, an "AO" cluster 22, and a "MIDORI" cluster 23, respectively. Concurrently, the word extracting unit 2 assigns representative syllable sequences ("A/KA", "A/O", and "MI/DO/RI" in the case shown in FIG. 6) and IDs ("1", "2", and "3" in the case shown in FIG. 6) to the clusters.”) (“(47) The phonetic typewriter 45 further performs speech recognition of the input sound on a syllable basis using the feature parameters supplied from the feature extraction module 43 while referencing the acoustic model database 51, and then outputs the syllable sequence obtained by the speech recognition to both matching module 44 and network generating module 47. For example, from the utterance "WATASHINONAMAEWAOGAWADESU", a syllable sequence "WA/TA/SHI/NO/NA/MA/E/WA/O/GA/WA/DE/SU" is obtained. Any commercially available phonetic typewriter can be used as the phonetic typewriter 45.”) (“(52) The acoustic model database 51 stores an acoustic model representing acoustic features of individual phonemes and syllables of a language for the utterance to be recognized. For example, a Hidden Markov Model (HMM) may be used as an acoustic model. The dictionary database 52 stores a word dictionary describing information about pronunciations and a model describing chains of the phonemes and syllables for the words or phrases to be recognized.”) (“(70) Referring back to FIG. 10, at step S55, the phonetic typewriter 45 recognizes the feature parameters extracted by the feature extraction module 43 in the process of step S53 on a phoneme basis independently from the process of step S54, and outputs the acquired syllable sequence to the matching module 44. For example, when an utterance "WATASHINONAMAEWAOGAWADESU", where "OGAWA" is an unknown word, is input to the phonetic typewriter 45, the phonetic typewriter 45 outputs a syllable sequence "WA/TA/SHI/NO/NA/MA/E/WA/O/GA/WA/DE/SU". At step S55, a syllable sequence may be acquired using the processing result at step S54.”)    (“7. A computer-readable recording medium storing a program, the program processing an input utterance and registering an unknown word contained in the input utterance into a dictionary database on the basis of the processing result, the program including the steps of: (a) recognizing the input utterance; (b) determining whether the recognition result of the input utterance obtained by step (a) contains an unknown word on the basis of an acoustic model representing acoustic features of individual phonemes and syllables of a language; (c) determining whether the recognition result determined at step (b) to contain an unknown word is rejected or not for acquisition and registering into the dictionary database; …”) by Ogawa et al.  US 7657430 B2 
Ogawa is considered to be analogous to the claimed invention because it relates to voice quality conversion devices.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Hayakawa, and NISHIDA HIROTO to further incorporate the teachings of Ogawa in order to include Vowel feature in the system. 
One could have been motivated to do so because system recognition accuracy is improved.(“(128) …  As can be seen from the experimental result in FIG. 18, the recognition accuracy was 48.5%, which is improved compared to that of 40.2% by the &lt;OOV&gt; pronunciation acquiring method by use of a sub-word sequence shown in FIG. 3. …” col. 15, lines 14-18) by Ogawa et al.  US 7657430 B2

Claim/s 10 are rejected under 35 U.S.C. 103 as being unpatentable over Hayakawa, and NISHIDA HIROTO in view of Beach et al. US 10395640 B1

Regarding Claim 10 the combination teaches the method claim 9 as identified above. 
The combination does not explicitly teach  an occurrence frequency proportion of each of the one or more sound elements that occupies a total number of sound elements. 
 
Beach teaches:
10. The utterer identification method according to claim 9, wherein the value corresponding to the occurrence frequency is defined by an occurrence frequency proportion of each of the one or more sound elements that occupies a total number of sound elements of the recognized utterance content or the registered utterance content. Beach teaches  (“ For example, it is possible to generate a total count of how many times each phoneme occurs in the sample set (training text and audio pairs), how often the phoneme was recognized correctly, the phoneme's overall accuracy (as well as the accuracy in total—the average of all the phoneme's average accuracy), and the “incorrect phonemes” the “correct phoneme” was confused with. Additionally, aggregate statistics can be generated, such as the average accuracy of all phonemes. Additionally, the comparison may be used to identify the phonemes with the highest and lowest accuracies. …”) (“When this comparison is performed for a number of text and audio samples, it is possible to generate the statistics described above for each phoneme. For example, it is possible to generate a total count of how many times each phoneme occurs in the sample set (training text and audio pairs), how often the phoneme was recognized correctly, the phoneme's overall accuracy (as well as the accuracy in total—the average of all the phoneme's average accuracy) …” col. 9, lines 27-34) (“… (26) FIGS. 8A and 8B show exemplary charts generated using one possible measurement consistent with the technology of the present application. The charts are presented as simple graphs with the phoneme accuracy as the Y-axis and the 40 possible phonemes associated with the English language as the X-axis. The phoneme accuracy is presented as a simple percentage, which is generated by the number of correct phoneme identifications divided by the total number of phoneme presentations times. …” col. 9, lines 45-55) Beach et al. US 10395640 B1
Beach is considered to be analogous to the claimed invention because it relates to relates generally to speech recognition systems.
Therefore, it would have been obvious for someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Hayakawa,  and NISHIDA HIROTO to further incorporate the teachings of Beach in order to include total number of sound element. 
One could have been motivated to do so because system recognize phoneme accuracy.(“(6) …  The audio phoneme sequence and the text phoneme sequence are compared to determine a phoneme average accuracy.” col. 2, lines 42-43) Beach et al. US 10395640 B1


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FOUZIA HYE SOLAIMAN whose telephone number is (571)270-5656. The examiner can normally be reached M-F (8-5)AM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D. Shah can be reached at (571) 270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/F.H.S./Examiner, Art Unit 2653                                                                                                                                                                                                        
/Paras D Shah/Supervisory Patent Examiner, Art Unit 2653                                                                                                                                                                                                        
01/30/2026
Read full office action
Prosecution Timeline

Dec 07, 2023
Application Filed
Jan 30, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/174,120
Patent 12592217
SYSTEM AND METHOD FOR SPEECH PROCESSING
2y 5m to grant Granted Mar 31, 2026
18/116,994
Patent 12579976
USER TERMINAL, DIALOGUE MANAGEMENT SYSTEM, CONTROL METHOD OF USER TERMINAL, AND DIALOGUE MANAGEMENT METHOD
2y 5m to grant Granted Mar 17, 2026
17/888,243
Patent 12555563
SYSTEMS AND METHODS FOR CHARACTER-TO-PHONE CONVERSION
2y 5m to grant Granted Feb 17, 2026
17/666,645
Patent 12542149
METHOD AND APPARATUS FOR IMPROVING SPEECH INTELLIGIBILITY IN A ROOM
2y 5m to grant Granted Feb 03, 2026
18/932,524
Patent 12537017
COMPUTERIZED SCORING METHOD OF FEATURE EXTRACTION-BASED FOR COVERTNESS OF IMITATED MARINE MAMMAL SOUND SIGNAL
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
67%
Grant Probability
99%
With Interview (+55.5%)
3y 0m
Median Time to Grant
Low
PTA Risk
Based on 63 resolved cases by this examiner. Grant probability derived from career allow rate.