DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
All objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.
Status of the Claims
Prior to entry of the amendment(s) and/or consideration of the argument(s), the status of the claims is as follows.
Claim(s) 1-20 is/are pending.
Claim(s) 1-2, 4-12, and 14-19 is/are rejected under 35 U.S.C. 102(a)(1) and 102 (a)(2) as being anticipated by Khoury (U.S. Pat. App. Pub. No. 2021/0326421, hereinafter Khoury).
Claims 3, 13, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Khoury as applied to claim(s) 1, 11, and 18 above, and further in view of Rakshit (U.S. Pat. App. Pub. No. 2020/0341804, hereinafter Rakshit)
Response to Arguments
Applicant’s arguments regarding the prior art rejections under 35 U.S.C. §102/103, see pages 8-10 of the Response to Non-Final Office Action dated 13 June 2025, which was received on 09 September 2025 (hereinafter Response and Office Action, respectively), have been fully considered.
With respect to the rejection(s) of claim(s) 1, and mutatis mutandis claims 11 and 18, under 35 U.S.C. §102 as being anticipated by Khoury, applicant asserts that Khoury fails to teach or suggest “clustering the utterances into a plurality of clusters, wherein each cluster included in the plurality of clusters is uttered by a different speaker” as recited in the independent claims. (Response, pg. 8). Specifically, applicant asserts that, in light of the “claim mapping, to teach or suggest the above limitations of claim 1, Khoury would have to disclose clustering the utterances into a plurality of clusters, wherein each cluster included in the plurality of clusters is uttered by a different speaker. Importantly, Khoury contains no such teaching. In particular, Khoury merely discloses extracting features and embeddings and clustering similar embeddings extracted from the utterances, with each cluster representing a collection of utterances similar to a particular speaker” asserting that Khoury is limited to “clustering… based on similarity to a particular speaker” (Response, pg. 9-10, citing Khoury at [0119], [0134], [0137].) Respectfully, however, this interpretation of Khoury is unduly limited and is not persuasive.
Khoury discloses the recited limitations. As explained in the Office Action, the analytics system can “cluster embeddings with voiceprints using sequential clustering algorithms” where the “clusters represent a collection of utterances similar to a particular speaker (e.g., a speaker cluster)” and “where a voiceprint represents the centroid for the speaker cluster.” (Office Action, pg. 3). Respectfully, this does not mean that clusters are being generated and merged for only one speaker. On the contrary, Khoury is understood as merging clusters based on the described similarity for each “particular speaker”, which is, in and of itself, a well-known portion of profile generation as considered in the context of voice-printing technology.
Though applicant cites paras. [0119], [0134] and [0137] of Khoury in support of the contention that clustering in Khoury is done only with relation to a particular speaker, implied as a single central speaker, this argument is not supported by the related paragraphs describing the unsupervised clustering in Khoury. (See Khoury, ¶ [0132]-[0142]). For example, Khoury at paragraphs [0132]-[0137] describe the various embodiments of the unsupervised clustering. Though Khoury does include embodiments which “receive a subscriber identifier forwarded from the content server 111” where the system “queries the analytics database 104 and retrieves voiceprints associated with the received subscriber identifier”, this embodiment does not correspond to the embodiment relied on in the Office Action. For further clarity, an explanation regarding the relevant paragraphs is provided here.
At paragraphs [0132]-[0137], Khoury provides for individual-level clustering not just regarding a single subscriber but numerous subscribers, as well as non-subscribing parties to a conversation. (Khoury, [0132]-[0137]). Khoury, though including embodiments where “a subscriber identifier [is] forwarded from the content server 111” and voiceprints are associated to the subscriber identifier, indicates that “each voiceprint and associated unique speaker identifier is linked to at least one subscriber identifier.” (Khoury, [0132], emphasis added). Though the voiceprints are initially broadly linked based on available information in an inclusive fashion and the system “retrieves voiceprints associated with the received subscriber identifier,” this isn’t restricted to only one subscriber identifier (i.e., in cases of more than one “subscriber identifier”, more than one subscriber identifier is expected to be received as “forwarded from the content server.”) (Id.)
In the alternative, “the analytics server 102 may not receive a subscriber identifier forwarded from the content server 111,” and instead of “evaluating the similarity of the embeddings and the voiceprints associated with putative speaker clusters… the analytics server 102 will evaluate the similarity of the embeddings with a set of voiceprints.” (Khoury, ¶ [0132]-[0133]). Khoury further explains that “The speech portions of the audio signal … may include utterances of multiple speakers” and the “analytics server 102 clusters inbound and/or stored embeddings by, for example, randomly generating a centroid and associating embeddings with the centroid” and “analytics server 102 may cluster the voiceprints” and “evaluate the similarity of the clusters to the voiceprints” and merge the similar voiceprints based on satisfying a threshold. (Khoury, ¶ [0134]-[0135]).
As well “the analytics server 102 may treat each embedding as a centroid” and cluster “the embeddings based on the distances of the centroid embedding to the other embeddings… using sequential clustering algorithms” thus forming “a collection of utterances similar to a particular speaker (e.g., a speaker cluster) where a voiceprint represents the centroid for the speaker cluster” and the “analytics server 102 identifies speaker clusters using speaker identifiers.” (Khoury, ¶ [0136]-[0137]) As such, though it is agreed that applicant is describing a possible embodiment, said embodiment is one alternative among many and is not understood as limiting of the invention as a whole.
As indicated in the Office Action, the clustering and merging is performed for each of the particular speakers, not just one. From the perspective of any single merged cluster, the “remaining non-merged clusters for particular speakers” represents different speakers, each of which being clustered and merged with relation to their own profiles. (Office Action, pg. 3-4). Thus, Khoury starts with a plurality of clusters, generated based on “cluster embeddings with voiceprints using sequential clustering algorithms”. These clusters represent utterances for specified voiceprints, from each of the particular speakers. Khoury then merges voiceprints if “the similarity of the clusters satisfies one or more thresholds”, which results in a merged cluster for each of the particular speakers, where the resulting clusters which are merged separately or left unmerged (at least from the perspective of a target particular speaker) reflect different particular speakers. The merging generates the single cluster for a particular speaker and each particular speaker is expected to undergo the merging process as part of the unsupervised clustering.
Though not considered necessary for the rejection, each of these clusters are respectively incorporated into speaker profiles, where “If the analytics server 102 determines that the maximum similarity score for a particular embedding fails the low similarity threshold, then the analytics server 102 determines that the speaker is likely a new unknown user. The analytics server 102 will generate a new speaker profile using the particular embedding” which “includes a voiceprint, a speaker cluster (e.g., embeddings associated with the voiceprint), a speaker identifier, and metadata.” (Khoury, ¶ [0145]). To distinguish a particular speaker associated with each cluster, the system of Khoury considers all “particular speakers” as individuals, generating “merged” clusters for each particular speaker, and then determining authorization based on determining that the merged cluster, of a plurality of clusters, is associated with the authorized speaker. Therefore, the rejection of claims 1, 11, and 18 is/are maintained in light of applicant’s arguments, and the rejection is further clarified in the response below.
Applicant further argues that the rejection(s) of dependent claims 2-10, 12-17, and 19-20 should be withdrawn for at least the same reasons as independent claims 1, 11, and 18. Applicant’s arguments/arguments in light of the amended claims are not persuasive for the same reasons as described above with reference to the independent claims. As such, the rejections of claims 2-10, 12-17, and 19-20 under 35 U.S.C. §102 and 35 U.S.C. §103 are maintained and further clarified in the rejection presented below.
The Applicant has not provided any further statement and therefore, the Examiner directs the Applicant to the below rationale.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-2, 4-12, and 14-19 is/are rejected under 35 U.S.C. 102(a)(1) and 102 (a)(2) as being anticipated by Khoury.
Regarding claim 1, Khoury discloses A computer-implemented method for processing audio commands for a virtual personal assistant (Systems and methods described with reference to an analytics system 101; Khoury, ¶ [0075]), the method comprising: segmenting speech detected in an audio input signal into a plurality of speech segments (the analytics system “may pre-process the audio data (e.g., filtering the audio signal to reduce noise, parsing the audio signal into frames or sub-frames, performing various normalizations or scaling operations), execute voice activity detection (VAD) software or VAD machine learning, and/or extract features (e.g., one or more spectro-temporal features) from portions (e.g., frames, segments)” as performed on “the audio signal.”; Khoury, ¶ [0116]) wherein the speech included in a given speech segment is uttered by a single speaker (The analytics system “may determine the beginning and end of an utterance” based on “speaker segmentation”, where an utterance refers to a single utterance by a single speaker; Khoury, ¶ [0119]); clustering the speech segments included in the plurality of speech segments into a plurality of clusters, (The analytics system can “cluster embeddings with voiceprints using sequential clustering algorithms” where the “clusters represent a collection of utterances similar to a particular speaker (e.g., a speaker cluster)” and “where a voiceprint represents the centroid for the speaker cluster.”; Khoury, ¶ [0136]-[0137]) wherein each cluster included in the plurality of clusters is uttered by a different speaker (“the analytics server 102,” in the absence of “a subscriber identifier,” may evaluate “the similarity of the embeddings and the voiceprints associated with putative speaker clusters” and “the analytics server 102 will evaluate the similarity of the embeddings with a set of voiceprints.” where the “speech portions of the audio signal… may include utterances of multiple speakers” and the “analytics server 102 clusters inbound and/or stored embeddings by, for example, randomly generating a centroid and associating embeddings with the centroid” where the “analytics server 102 may cluster the voiceprints” and “evaluate the similarity of the clusters to the voiceprints” and “may merge the two voiceprints into a single voiceprint (e.g., taking the average of the voiceprints)” if “the similarity of the clusters satisfies one or more thresholds”. and clusters may be “based on relative distances between the embeddings and the centroid”, which is not based on a specific subscriber identifier. As such, the merging is performed for each of the particular speakers, and from the perspective of any single merged cluster, the remaining non-merged clusters for particular speakers represents different speakers, each of which being clustered and merged with relation to their own profiles.; Khoury, ¶ [0133]-[0137]); determining that a first cluster included in the plurality of clusters is uttered by a first authorized speaker (The analytics system further “tracks the similarity scores between each of the embeddings and centroids in a similarity matrix,” where the voiceprint is the centroid of each speaker cluster and the “voiceprint associated with a speaker” is created “using a mature embedding cluster (e.g., a cluster based on enrollment embeddings)” containing “sufficient biometric information for the analytics server 102 to identify the speaker using the speaker identifier,” and where “the content server 111 references the output of the analytics server 102 to limit access a particular subscription” where the “analytics server 102 may use the speaker profile (and associated speaker identifier, speaker characteristics, speaker-independent characteristics, metadata) to determine whether the speaker is authorized to access the subscriber account based upon authentication rules and limitations associated with the particular subscriber identifier.{a first authorized speaker}”; Khoury, ¶ [0135], [0140], [0161]); and causing a first audio command included in the first cluster to execute (the analytics system further “authenticates a user according to the speaker profile of the known registered user 329” where “the downstream operations identify, authenticate and/or authorize a particular speaker using the speaker identifier” and “may perform different functions depending on whether the speaker is a known registered user or a new unknown user” including “execute, unlock functionality, or perform operations based on identified and/or authenticated registered users.”; Khoury, ¶ [0215]).
Regarding claim 2, the rejection of claim 1 is incorporated. Khoury discloses all of the elements of the current invention as stated above. Khoury further discloses further comprising: determining that a second cluster included in the plurality of clusters is uttered by a second authorized speaker (“the downstream operations identify, authenticate and/or authorize a particular speaker using the speaker identifier” as performed for each speaker, and “features may also include various data related to the users or speakers (e.g., subscriber identifiers, speaker identifiers, biometrics)” where speaker/subscriber identifiers is plural indicating that any or all speakers may be identified as authorized speakers/subscriber (thus authorized speakers), and where a plurality of authorized speakers necessarily includes a first authorized speaker and a second authorized speaker.; Khoury, ¶ [0215], [0220]); and causing a second audio command present in the second cluster to execute (“downstream applications may execute, unlock functionality, or perform operations based on identified and/or authenticated registered users,” where the performance of operations for the second authorized user in response to “input audio signals to perform various operations” is a second audio command; Khoury, ¶ [0215], [0217]).
Regarding claim 4, the rejection of claim 1 is incorporated. Khoury discloses all of the elements of the current invention as stated above. Khoury further discloses further comprising: determining that a second cluster included in the plurality of clusters is uttered by a non-authorized speaker (As described with reference to an example, “a child may play a recording of the child’s parents speaking to overcome age-related restrictions of an over-the-top service” and, based on the speaker identification described above, the “system may detect that the child is playing the recording”, which as understood in the context of the clustering of speakers around a voiceprint centroid, is the detection of a second cluster which is insufficiently similar to a cluster/voiceprint for an authorized user. Thus, the system is determining that the child/voice recording is from an “unauthorized user {non-authorized speaker}”; Khoury, ¶ [0065], [0067]); and preventing a second audio command included in the second cluster from executing (“The system may detect that the child is playing the recording and, instead of identifying the speaker profile of the child’s parent, generate an alert and/or transmit a signal to the over-the-top service indicating the child is attempting to spoof his or her parents to stop the service from providing age-restricted content to the child. Thus, the system may determine whether an individual is spoofing another individual and generate an alert to stop the service from providing unauthorized users with content.”; Khoury, ¶ [0067]).
Regarding claim 5, the rejection of claim 1 is incorporated. Khoury discloses all of the elements of the current invention as stated above. Khoury further discloses further comprising: determining that a second cluster included in the plurality of clusters is uttered by a non-authorized speaker (“the analytics server 102 (or content server 111) may determine whether the user authorized to watch content based on the identified users” where “identify the content restrictions assigned to the user and applies the content restrictions on the content requested, queried, or presented to the user {uttered by a non-authorized speaker}” such as, for example, “the analytics server 102 may identify an eight-year-old boy watching television.”; Khoury, ¶ [0173], [0101]); determining that a mode is activated that allows audio commands from non-authorized speakers to execute (“The analytics server 102 identifies another speaker having an elevated privileges based upon the analysis of the audio signal, where the analytics server 102 has identified speaker profiles of two speakers having voiceprints matching the embeddings extracted for the two speakers. For example, the analytics server 102 identifies the child’s parent in the same audio signal as the eight-year-old boy,” where the presence of an authorized person may be considered a mode which allows for commands from the unauthorized person (e.g., the eight year old).; Khoury, ¶ [0173]); and causing a second audio command included in the second cluster to execute (“The presence of the adult male in close proximity to the eight-year-old boy results in the analytics server 102 (or content server 111) authorizing the eight-year-old boy to watch certain content.”; Khoury, ¶ [0173]).
Regarding claim 6, the rejection of claim 1 is incorporated. Khoury discloses all of the elements of the current invention as stated above. Khoury further discloses wherein determining that the first cluster is uttered by the first authorized speaker comprises: comparing one or more voice features associated with a first speech segment included in the first cluster with a plurality of voice footprints associated with a plurality of authorized speakers (The analytics system further “tracks the similarity scores between each of the embeddings and centroids in a similarity matrix,” where the voiceprint is the centroid of each speaker cluster {a plurality of voice footprints} and where each instance of the “voiceprint associated with a speaker” is created “using a mature embedding cluster (e.g., a cluster based on enrollment embeddings)” containing “sufficient biometric information for the analytics server 102 to identify the speaker using the speaker identifier,” and where each of the associated voiceprints of the associated speaker profiles may be “authorized to access the subscriber account based upon authentication rules and limitations associated with the particular subscriber identifier.{associated with a plurality of authorized accounts}”; Khoury, ¶ [0135], [0140], [0161]); and determining that the one or more voice features associated with the first speech segment match a first voice footprint included in the plurality of voice footprints that is associated with the first authorized speaker (As previously indicated, Khoury discloses that any or all of the embeddings extracted from the speech segments {voice features associated with the first speech segment} may be “associated with a speaker” through the voiceprint, the voiceprint being among a plurality of “enrolled voiceprints” where “the analytics server 102 identifies the voiceprints associated with the received subscriber identifier (e.g., suspect voiceprints), creates a similarity matrix between pairs of embeddings, clusters similar embeddings based on the distances of each of the embeddings, creates a similarity matrix of similarity scores” and “analytics server 102 may use the speaker profile (and associated speaker identifier, speaker characteristics, speaker-independent characteristics, metadata) to determine whether the speaker is authorized to access the subscriber account... associated with the particular subscriber identifier {a first authorized speaker}”; Khoury, ¶ [0112]-[0113], [0161]).
Regarding claim 7, the rejection of claim 1 is incorporated. Khoury discloses all of the elements of the current invention as stated above. Khoury further discloses further comprising: prompting the first authorized speaker to speak one or more utterances (“The analytics server 102 may” create “the voiceprint during an active enrollment phase “ where “the server may prompt the user for enrollment signals”; Khoury, ¶ [0167], [0196]); receiving an audio input signal of the first authorized speaker speaking the one or more utterances (“the server may prompt the user for enrollment signals” including “prompting the user to speak a particular phrase [or] prompting the user to speak naturally,” and, in response “the server receives several enrollment signals” including “the enrollment utterances” which are in response to the particular phrase; Khoury, ¶ [0045], [0196]-[0197]); determining one or more voice features of the first authorized speaker based on the audio input signal (“the server extracts embeddings from the enrollment signal by applying a machine-learning architecture comprising various machine-learning models” where the embeddings “are mathematical representations of the biometric information (or features of the biometric information) in the enrollment signal”; Khoury, ¶ [0199]); generating a voice footprint associated with the first authorized speaker based on the one or more voice features (“the server creates a voiceprint for the user (sometimes referred to as an enrollee voiceprint)” by “statistically or algorithmically combin[ing] the enrollment embeddings to extract the voiceprint for the enrolled speaker-user.”; Khoury, ¶ [0203]); and storing the voice footprint of the first authorized speaker (“the server” maintains/accesses “the voiceprints” which are “stored in the speaker profile database.”; Khoury, ¶ [0210]).
Regarding claim 8, the rejection of claim 7 is incorporated. Khoury discloses all of the elements of the current invention as stated above. Khoury further discloses further comprising: generating a user identifier for the first authorized speaker (“The server may generate a new speaker profile and speaker identifier using the particular voiceprint.”; Khoury, ¶ [0214]); and associating the user identifier with the voice footprint (“The speaker profile includes a voiceprint, a speaker cluster (e.g., embeddings associated with the voiceprint), a speaker identifier, and metadata.”; Khoury, ¶ [0214]).
Regarding claim 9, the rejection of claim 7 is incorporated. Khoury discloses all of the elements of the current invention as stated above. Khoury further discloses wherein the one or more utterances comprise a fixed utterance spoken by the first authorized speaker a specified number of times (“The server may employ any number of algorithms for determining whether the voiceprint is mature. For example, the server compares a maturity factor (e.g., number of utterances) against a preconfigured maturity threshold corresponding to the maturity factor (e.g., threshold number of utterances)” where “a maturity threshold” may be “based on the single maturity factor (e.g., a number of enrollment utterances 1202)”; Khoury, ¶ [0224]-[0225]).
Regarding claim 10, the rejection of claim 7 is incorporated. Khoury discloses all of the elements of the current invention as stated above. Khoury further discloses wherein the one or more utterances comprise an utterance spoken by the first authorized speaker for a specified duration of time (“the server may prompt the user for enrollment signals once or at a periodic time interval (e.g., every six months)” where the enrollment signal can include “prompting the user to speak a particular phrase”; Khoury, ¶ [0196]).
Regarding claim 11, Khoury discloses A device, comprising (Systems and methods described with reference to an analytics system 101, as implemented, at least in part, through an end-user device 114; Khoury, ¶ [0075], [0082]): a microphone (“Various types of inputs represent the ways that users interact with end-user devices 114, such as sound or audio data captured by a microphone of the end-user device 114”; Khoury, ¶ [0082]); one or more memories storing instructions; and one or more processors coupled to the one or more memories and, (The end user device 114 can include “a processor and non-transitory storage media” which are “configured to perform the various processes described herein”; Khoury, ¶ [0103]) when executing the instructions: segment speech detected in an audio input signal received via the microphone into a plurality of speech segments (the analytics system “may pre-process the audio data (e.g., filtering the audio signal to reduce noise, parsing the audio signal into frames or sub-frames, performing various normalizations or scaling operations), execute voice activity detection (VAD) software or VAD machine learning, and/or extract features (e.g., one or more spectro-temporal features) from portions (e.g., frames, segments)” as performed on “the audio signal.”; Khoury, ¶ [0116]) wherein the speech included in a given speech segment is uttered by a single speaker (The analytics system “may determine the beginning and end of an utterance” based on “speaker segmentation”, where an utterance refers to a single utterance by a single speaker; Khoury, ¶ [0119]); cluster the speech segments included in the plurality of speech segments into a plurality of clusters, (The analytics system can “cluster embeddings with voiceprints using sequential clustering algorithms” where the “clusters represent a collection of utterances similar to a particular speaker (e.g., a speaker cluster)” and “where a voiceprint represents the centroid for the speaker cluster.”; Khoury, ¶ [0136]-[0137]) wherein each cluster included in the plurality of clusters is uttered by a different speaker (“the analytics server 102,” in the absence of “a subscriber identifier,” may evaluate “the similarity of the embeddings and the voiceprints associated with putative speaker clusters” and “the analytics server 102 will evaluate the similarity of the embeddings with a set of voiceprints.” where the “speech portions of the audio signal … may include utterances of multiple speakers” and the “analytics server 102 clusters inbound and/or stored embeddings by, for example, randomly generating a centroid and associating embeddings with the centroid” where the “analytics server 102 may cluster the voiceprints” and “evaluate the similarity of the clusters to the voiceprints” and “may merge the two voiceprints into a single voiceprint (e.g., taking the average of the voiceprints)” if “the similarity of the clusters satisfies one or more thresholds”. and clusters may be “based on relative distances between the embeddings and the centroid”, which is not based on a specific subscriber identifier. As such, the merging is performed for each of the particular speakers, and from the perspective of any single merged cluster, the remaining non-merged clusters for particular speakers represents different speakers, each of which being clustered and merged with relation to their own profiles.; Khoury, ¶ [0133]-[0137]); determine that a first cluster included in the plurality of clusters is uttered by a first authorized speaker (The analytics system further “tracks the similarity scores between each of the embeddings and centroids in a similarity matrix,” where the voiceprint is the centroid of each speaker cluster and the “voiceprint associated with a speaker” is created “using a mature embedding cluster (e.g., a cluster based on enrollment embeddings)” containing “sufficient biometric information for the analytics server 102 to identify the speaker using the speaker identifier,” and where “the content server 111 references the output of the analytics server 102 to limit access a particular subscription” where the “analytics server 102 may use the speaker profile (and associated speaker identifier, speaker characteristics, speaker-independent characteristics, metadata) to determine whether the speaker is authorized to access the subscriber account based upon authentication rules and limitations associated with the particular subscriber identifier.{a first authorized speaker}”; Khoury, ¶ [0135], [0140], [0161]); and cause a first audio command included in the first cluster to execute (the analytics system further “authenticates a user according to the speaker profile of the known registered user 329” where “the downstream operations identify, authenticate and/or authorize a particular speaker using the speaker identifier” and “may perform different functions depending on whether the speaker is a known registered user or a new unknown user” including “execute, unlock functionality, or perform operations based on identified and/or authenticated registered users.”; Khoury, ¶ [0215]).
Regarding claim 12, the rejection of claim 11 is incorporated. Claim 12 is substantially the same as claim 2 and is therefore rejected under the same rationale as above.
Regarding claim 13, the rejection of claim 11 is incorporated. Claim 13 is substantially the same as claim 3 and is therefore rejected under the same rationale as above.
Regarding claim 14, the rejection of claim 11 is incorporated. Claim 14 is substantially the same as claim 4 and is therefore rejected under the same rationale as above.
Regarding claim 15, the rejection of claim 11 is incorporated. Claim 15 is substantially the same as claim 5 and is therefore rejected under the same rationale as above.
Regarding claim 16, the rejection of claim 11 is incorporated. Claim 16 is substantially the same as claim 6 and is therefore rejected under the same rationale as above.
Regarding claim 17, the rejection of claim 11 is incorporated. Claim 17 is substantially the same as claim 7 and is therefore rejected under the same rationale as above.
Regarding claim 18, Khoury discloses One or more non-transitory computer-readable media storing program instructions that, when executed by one or more processors, cause the one or more processors to perform steps of (Systems and methods described with reference to an analytics system 101, and “the steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium” which can be “designed to implement the systems and methods based on the description herein”; Khoury, ¶ [0075], [0423]-[0424]): segmenting speech detected in an audio input signal into a plurality of speech segments (the analytics system “may pre-process the audio data (e.g., filtering the audio signal to reduce noise, parsing the audio signal into frames or sub-frames, performing various normalizations or scaling operations), execute voice activity detection (VAD) software or VAD machine learning, and/or extract features (e.g., one or more spectro-temporal features) from portions (e.g., frames, segments)” as performed on “the audio signal.”; Khoury, ¶ [0116]) wherein the speech included in a given speech segment is uttered by a single speaker (The analytics system “may determine the beginning and end of an utterance” based on “speaker segmentation”, where an utterance refers to a single utterance by a single speaker; Khoury, ¶ [0119]); clustering the speech segments included in the plurality of speech segments into a plurality of clusters, (The analytics system can “cluster embeddings with voiceprints using sequential clustering algorithms” where the “clusters represent a collection of utterances similar to a particular speaker (e.g., a speaker cluster)” and “where a voiceprint represents the centroid for the speaker cluster.”; Khoury, ¶ [0136]-[0137]) wherein each cluster included in the plurality of clusters is uttered by a different speaker (“the analytics server 102,” in the absence of “a subscriber identifier,” may evaluate “the similarity of the embeddings and the voiceprints associated with putative speaker clusters” and “the analytics server 102 will evaluate the similarity of the embeddings with a set of voiceprints.” where the “speech portions of the audio signal … may include utterances of multiple speakers” and the “analytics server 102 clusters inbound and/or stored embeddings by, for example, randomly generating a centroid and associating embeddings with the centroid” where the “analytics server 102 may cluster the voiceprints” and “evaluate the similarity of the clusters to the voiceprints” and “may merge the two voiceprints into a single voiceprint (e.g., taking the average of the voiceprints)” if “the similarity of the clusters satisfies one or more thresholds”. and clusters may be “based on relative distances between the embeddings and the centroid”, which is not based on a specific subscriber identifier. As such, the merging is performed for each of the particular speakers, and from the perspective of any single merged cluster, the remaining non-merged clusters for particular speakers represents different speakers, each of which being clustered and merged with relation to their own profiles.; Khoury, ¶ [0133]-[0137]); determining that a first cluster included in the plurality of clusters is uttered by a first authorized speaker (The analytics system further “tracks the similarity scores between each of the embeddings and centroids in a similarity matrix,” where the voiceprint is the centroid of each speaker cluster and the “voiceprint associated with a speaker” is created “using a mature embedding cluster (e.g., a cluster based on enrollment embeddings)” containing “sufficient biometric information for the analytics server 102 to identify the speaker using the speaker identifier,” and where “the content server 111 references the output of the analytics server 102 to limit access a particular subscription” where the “analytics server 102 may use the speaker profile (and associated speaker identifier, speaker characteristics, speaker-independent characteristics, metadata) to determine whether the speaker is authorized to access the subscriber account based upon authentication rules and limitations associated with the particular subscriber identifier.{a first authorized speaker}”; Khoury, ¶ [0135], [0140], [0161]); and causing a first audio command included in the first cluster to execute (the analytics system further “authenticates a user according to the speaker profile of the known registered user 329” where “the downstream operations identify, authenticate and/or authorize a particular speaker using the speaker identifier” and “may perform different functions depending on whether the speaker is a known registered user or a new unknown user” including “execute, unlock functionality, or perform operations based on identified and/or authenticated registered users.”; Khoury, ¶ [0215]).
Regarding claim 19, the rejection of claim 18 is incorporated. Claim 19 is substantially the same as claim 2 and is therefore rejected under the same rationale as above.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 3, 13, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Khoury as applied to claim(s) 1, 11, and 18 above, and further in view of Rakshit (U.S. Pat. App. Pub. No. 2020/0341804, hereinafter Rakshit).
Regarding claim 3, the rejection of claim 1 is incorporated. Khoury discloses all of the elements of the current invention as stated above. However, Khoury fails to expressly recite further comprising: determining that a second cluster included in the plurality of clusters and uttered by a second authorized speaker has a higher priority than the first cluster; and causing a second audio command present in the second cluster to execute before the first audio command executes.
Rakshit teaches systems and methods for “managing multiple requests made to digital assistants on shared client devices.” (Rakshit, ¶ [0001]). Regarding claim 3, Rakshit teaches further comprising: determining that a second cluster included in the plurality of clusters and uttered by a second authorized speaker has a higher priority than the first cluster (“Priorities for the overlapping requests are determined by the computer system” and the “overlapping requests are processed by the computer system based on the priorities determined for the overlapping requests.”; Rakshit, ¶ [0028]); and causing a second audio command present in the second cluster to execute before the first audio command executes (“The priorities determined for the first request and the second request can result in sending the second response before sending the first response even though the first response is generated earlier.”; Rakshit, ¶ [0070]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the speaker analytics systems of Khoury to incorporate the teachings of Rakshit to include further comprising: determining that a second cluster included in the plurality of clusters and uttered by a second authorized speaker has a higher priority than the first cluster; and causing a second audio command present in the second cluster to execute before the first audio command executes. The automated assistant of Rakshit allows for priorities to be determined for “overlapping requests based on a set of priority considerations for the overlapping requests,” which allows the system to recognize user urgency and need over temporal order and results in a more desirable experience, and in many cases outcome, for multiple users, as recognized by Rakshit. (Rakshit, ¶ [0026]-[0028]).
Regarding claim 13, the rejection of claim 11 is incorporated. Claim 13 is substantially the same as claim 3 and is therefore rejected under the same rationale as above.
Regarding claim 20, the rejection of claim 18 is incorporated. Claim 20 is substantially the same as claim 3 and is therefore rejected under the same rationale as above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Sohn (U.S. Pat. App. Pub. No. 2020/0051572) systems and methods for registering a new user through authentication by a registered user.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627. The examiner can normally be reached 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached at (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Sean E Serraguard/Patent Examiner, Art Unit 2657