DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d).
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/03/2024 is being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim(s) 1-10 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Independent claims 1, 9, and 10 recite “recognizing a voice from an acoustic signal and obtaining a voice recognition result”, “analyzing the acoustic signal and extracting a feature quantity”, “clipping, from the voice recognition result, a feature-quantity sequence”, “calculating a second speaker embedding vector”, “calculating one or more similarity degrees”, “determining, based on the one or more similarity degrees, which speaker among the one or more registered speakers utters”, “controlling, based on a registered speaker who is determined from the one or more similarity degrees and based on the voice recognition result, the device”. These limitations, as drafted, are a process that, under a broadest reasonable interpretation, covers the abstract idea of “mental processes” because they cover concepts performed in the human mind, including observation, evaluation, judgement, and opinion. See MPEP 2106.04(a)(2). That is, other than reciting “a memory”, “one or more hardware processors”, “a voice recognizing unit”, “an analyzing unit”, “a clipping unit”, “an embedding vector calculating unit”, “a similarity degree calculating unit”, “a determining unit”, and “a device control unit”, nothing in the claimed elements preclude the steps from practically being performed by a person listening to an individual speak, analyzing the voice and using pen and paper to extract information from the voice, comparing the voice to another voice, determining if the voice is a registered voice, and then performing an arbitrary action with a device based on whether or not the voice is a registered voice.
This judicial exception is not integrated into a practical application because the additional elements “a memory”, “one or more hardware processors”, “a voice recognizing unit”, “an analyzing unit”, “a clipping unit”, “an embedding vector calculating unit”, “a similarity degree calculating unit”, “a determining unit”, and “a device control unit” are all recited at a high- level of generality. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. Thus, the claims as a whole are directed to an abstract idea (Step 2A, prong two).
Claims 1, 9, and 10 do not include any additional elements that are sufficient to amount to significantly more than the judicial exception because, as discussed above with respect to integration of the abstract idea into a practical applications, the additional elements of “a memory”, “one or more hardware processors”, “a voice recognizing unit”, “an analyzing unit”, “a clipping unit”, “an embedding vector calculating unit”, “a similarity degree calculating unit”, “a determining unit”, and “a device control unit” amount to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept (Step 2B).
Dependent claims 2-8 are directed to the displaying of information, similarity calculations, conversion of voice into a text string, and the registration of individuals in the system. That is, nothing in the claimed elements, preclude the steps from practically being performed by a person listening to an individual speak, analyzing the voice and using pen and paper to extract information from the voice, comparing the voice to another voice, determining if the voice is a registered voice, and then performing an arbitrary action with a device based on whether or not the voice is a registered voice. Even when considered individually and in combination, the additional elements in claims 1-10 represent mere instructions to implement an abstract idea or other exception on a computer and insignificant extra-solution activity, which do not provide an inventive concept (Step 2B).
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-2 and 8-10 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Murata et al. (US Patent Application Publication No. 2024/0013791), hereinafter referred to as Murata.
Regarding claim 1, Murata discloses a voice recognition device comprising: a memory (Murata para [0032]) that is used to store a first speaker embedding vector of each of one or more given registered speakers (Murata Fig. 2 reference character 15b speaker vector extraction unit), and individual setting of each of the one or more registered speakers for use in controlling a device ("In an interaction with a smart speaker or the like, a speaker can be automatically verified by using an utterance log. Then, a family member can be specified from spoken voice, and information presentation or recommendation according to the speaker can be performed," Murata para [0002]);
and one or more hardware processors configured to function as: a voice recognizing unit that recognizes a voice from an acoustic signal and obtains a voice recognition result (Murata Fig. 2 reference character 15 control unit);
an analyzing unit that analyzes the acoustic signal and extracts a feature quantity indicating a feature of a waveform of the acoustic signal (Murata Fig. 2 reference character 15a acoustic feature extraction unit);
a clipping unit that, from the voice recognition result, clips a feature-quantity sequence included in an utterance section (Murata Fig. 2 reference character 15b speaker vector extraction unit);
an embedding vector calculating unit that calculates a second speaker embedding vector using the feature-quantity sequence (Murata Fig. 2 reference character 14b speaker vector extraction model);
a similarity degree calculating unit that calculates one or more similarity degrees for the second speaker embedding vector and one or more first speaker embedding vectors (Murata Fig. 2 reference character 14c speaker similarity calculation sub-model);
a determining unit that, based on the one or more similarity degrees, determines which speaker among the one or more registered speakers utters (Murata Fig. 3 reference character 15e estimation unit);
and a device control unit that, based on a registered speaker who is determined from the one or more similarity degrees and based on the voice recognition result, controls the device according to the individual setting read from the memory ("In an interaction with a smart speaker or the like, a speaker can be automatically verified by using an utterance log. Then, a family member can be specified from spoken voice, and information presentation or recommendation according to the speaker can be performed," Murata para [0002]).
Regarding claim 2, Murata discloses all of the limitations of claim 1. Murata further discloses wherein the one or more hardware processors are configured to further function as a display control unit that displays, in a display device ("The output unit 12 is realized by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like," Murata para [0030]), information identifying the registered speaker who is determined from the one or more similarity degrees (Murata Fig. 2 reference character 12 output unit),
and when each of the one or more similarity degrees is equal to or smaller than a first threshold value (Murata para [0047]), displays, in the display device, information indicating that a reliability of an identification accuracy of the speaker is equal to or smaller than the first threshold value ("The output unit 12 is realized by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like," Murata para [0030] and Murata Fig. 2 reference character 12 output unit).
Regarding claim 8, Murata discloses all of the limitations of claim 1. Murata further discloses wherein the voice recognition result includes an acoustic score indicating a probability that a voice at each timing corresponds to each phoneme (Murata Fig. 7 reference character 14d phoneme recognition model and Murata Fig. 9 reference character 14c obtains the similarity, i.e. the probability that the voice corresponds to those phonemes),
and the embedding vector calculating unit calculates the second speaker embedding vector from the acoustic score at each timing and from a feature quantity at each timing included in the feature-quantity sequence (Murata Fig. 9 shows the embedded vectors being calculated (reference character 14b) for each phoneme and "Specifically, the speaker recognition apparatus 10 of the present embodiment calculates a speaker similarity by further using phonological information of a registered utterance and a verification utterance. Here, the phonological information is, for example, a phoneme sequence of an utterance. Alternatively, the phonological information may be a phoneme posterior probability sequence output as a latent variable, a phoneme bottleneck feature, or the like," Murata para [0059]).
As to claim 9, method claim 9 and system claim 1 are related as system and method of using same, with each claimed element’s function corresponding to the system step. Accordingly, claim 9 is similarly rejected under the same rationale as applied above with respect to the system claim.
As to claim 10, computer program product claim 10 and system claim 1 are related as system and computer program product of using same, with each claimed element’s function corresponding to the system step. Accordingly, claim 10 is similarly rejected under the same rationale as applied above with respect to the system claim.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Murata, in view of Hayakawa (US Patent No. 7,657,431).
Regarding claim 3, Murata discloses all of the limitations of claim 1. Murata further discloses the similarity degree calculating unit calculates a similarity degree with (Murata Fig. 2 reference character 14c speaker similarity calculation sub-model) a first speaker embedding vector of a registered speaker (Murata Fig. 2 reference character 14b speaker vector extraction model).
However, Murata fails to disclose wherein the memory further stores therein a combination of the registered speaker and a keyword, and for whom the keyword is included in the voice recognition result, and does not calculate the similarity degree with the first speaker embedding vector of a registered speaker for whom the keyword is not included in the voice recognition result.
Hayakawa teaches a voice authentication method and system.
Hayakawa teaches wherein the memory further stores therein a combination of the registered speaker and a keyword (Hayakawa Fig. 1 reference characters 20, 21, and 15),
and for whom the keyword is included in the voice recognition result (Hayakawa Fig. 3 reference characters Op20, Op22, and Op23),
and does not calculate the similarity degree with the first speaker embedding vector of a registered speaker for whom the keyword is not included in the voice recognition result (Hayakawa Fig. 1 if the keyword is not input at reference character Op12, the flow would stop at that point, i.e. no similarity would be calculated).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Murata’s teaching of a voice recognition system by including Hayakawa’s teaching of authenticating a speaker by their usage of a keyword. This allows for a secure and personalized method to access smart devices and services, reducing the risk of error in these types of systems. This would also enhance security of these systems by verifying both the voice print (i.e., the waveform) and the keyword itself.
Claim(s) 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Murata, in view of Kracun et al. (US Patent No. 12,027,160), hereinafter referred to as Kracun.
Regarding claim 4, Murata discloses all of the limitations of claim 1. Murata further discloses and based on the language comprehension result, the similarity degree calculating unit selects one or more first speaker embedding vectors for which the similarity degree is to be calculated, and calculates one or more similarity degrees with the selected one or more first speaker embedding vectors (Murata Fig. 2 reference character 14c speaker similarity calculation sub-model).
However, Murata fails to disclose wherein the voice recognizing unit converts the voice recognition result into a character string, and further obtains, from the character string, a language comprehension result of being comprehended based on a language comprehension model.
Kracun teaches a system and method for using hotwords to improve performance of machine learning models and the determining of whether automated assistants are to be initiated.
Kracun teaches wherein the voice recognizing unit converts the voice recognition result into a character string ("For example, on-device speech recognizer 142 can process audio data that captures a spoken utterance, utilizing on-device speech recognition model 142A, to generate recognized text 143A that corresponds to the spoken utterance. On-device NLU engine 144 performs on-device natural language understanding, optionally utilizing on-device NLU model 144A, on the recognized text 143A to generate NLU data 145A," Kracun col. 14 lines 42-49),
and further obtains, from the character string, a language comprehension result of being comprehended based on a language comprehension model ("For example, on-device speech recognizer 142 can process audio data that captures a spoken utterance, utilizing on-device speech recognition model 142A, to generate recognized text 143A that corresponds to the spoken utterance. On-device NLU engine 144 performs on-device natural language understanding, optionally utilizing on-device NLU model 144A, on the recognized text 143A to generate NLU data 145A," Kracun col. 14 lines 42-49).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Murata’s teaching of a voice recognition system by including Kracun’s teaching of converting the voice recognition into text. This allows for textual processing, i.e. inputting the converted text into a natural language input or natural language understanding models/engines, where text is more acceptable as an input to these systems. This would make it more efficient and easier to generate a response to the initial voice/audio input, as these models do not readily accept audio as an input.
Claim(s) 5-7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Murata, in view of Yamazaki et al. (US Patent No. 8,977,547), hereinafter referred to as Yamazaki.
Regarding claim 5, Murata discloses all of the limitations of claim 1. However, Murata fails to disclose wherein the one or more hardware processors are configured to further function as a registering unit that registers the one or more first speaker embedding vectors in the memory, and with respect to N number of utterances by a same speaker where N≥1, the registering unit calculates each first speaker embedding vector and registers, as the first speaker embedding vector of the same speaker, statistic of the each first speaker embedding vector.
Yamazaki teaches a voice recognition system for registration of voiced utterances.
Yamazaki teaches wherein the one or more hardware processors are configured to further function as a registering unit that registers the one or more first speaker embedding vectors in the memory (Yamazaki Fig. 2), and with respect to N number of utterances by a same speaker where N≥1, the registering unit calculates each first speaker embedding vector and registers, as the first speaker embedding vector of the same speaker, statistic of the each first speaker embedding vector (Yamazaki Fig. 2 reference characters ST4-ST9).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Murata’s teaching of a voice recognition system by including Yamazaki’s teaching of registering an individual and their voice in a voice recognition system. Registering a voice in a voice recognition system provides personalized security, higher convenience, and better accessibility. This allows systems and devices, such as smart homes, to be personally tailored to those living in the home. This would result in faster, more secure communication and system functionality.
Regarding claim 6, Murata, in view of Yamazaki, discloses all of the limitations of claim 5. However, Murata fails to disclose wherein the registering unit prompts the same speaker to make repeated utterances, calculates a first speaker embedding vector corresponding to each utterance, and prompts stopping of utterance when dispersion of the each first speaker embedding vector becomes equal to or smaller than a second threshold value.
Yamazaki teaches wherein the registering unit prompts the same speaker to make repeated utterances (Yamazaki Fig. 2 reference characters ST2-ST6, keeps repeating until N greater than or equal to M and Fig. 3 reference character 15 re-utterance request unit),
calculates a first speaker embedding vector corresponding to each utterance, and prompts stopping of utterance when dispersion of the each first speaker embedding vector becomes equal to or smaller than a second threshold value ("Next, the voice input unit 11 stores the input registering utterance in the registering voice data storage unit 12 as registering voice data (step ST5)," Yamazaki col. 3 line 67 through col. 4 line 3 and Yamazaki col. 4 line 60 through col. 5 line 11).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Murata’s teaching of a voice recognition system by including Yamazaki’s teaching of repeating utterances and utilization of a threshold value to determine when the registered voice is acceptable. Requiring the same speaker to make multiple or repeated utterances allows for the system to lessen the variation of what it has to register, resulting in higher accuracy and quicker registration times, thus streamlining the process. The usage of a threshold value to confirm when an utterance is acceptable for registration allows for improved accuracy of the utterance and its later usage to act as a passcode or security level. Setting a specific threshold allows the user to be able to control their own level of security, adding personalization as well.
Regarding claim 7, Murata, in view of Yamazaki, discloses all of the limitations of claim 5. However, Murata fails to disclose wherein, using a second speaker embedding vector having the similarity degree to be equal to or greater than a third threshold value, the registering unit updates a first speaker embedding vector having the similarity degree to be equal to or greater than the third threshold value.
Yamazaki teaches wherein, using a second speaker embedding vector having the similarity degree to be equal to or greater than a third threshold value, the registering unit updates a first speaker embedding vector having the similarity degree to be equal to or greater than the third threshold value (Yamazaki Fig. 5 reference characters ST7a-ST10-1 shows comparing the similarity to a threshold TI, and if no, it compares N to threshold Tn, where if no, it requests a re-utterance thereby updating the utterance data until the conditions are met).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Murata’s teaching of a voice recognition system by including Yamazaki’s teaching of using yet another threshold value to confirm if an utterance is acceptable for registration. The usage of a threshold value to confirm when an utterance is acceptable for registration allows for improved accuracy of the utterance and its later usage to act as a passcode or security level. Setting a specific threshold allows the user to be able to control their own level of security, adding personalization as well.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
US Patent No. 11,417,344
US Patent No. 10,446,155
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM MICHAEL WEAVER whose telephone number is (571)272-7062. The examiner can normally be reached Monday-Friday, 8AM-5PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ADAM MICHAEL WEAVER/Examiner, Art Unit 2658
/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658