Detailed Action
This communication is in response to the Arguments and Amendments filed on 11/14/2025.
Claims 1-20 are pending and have been examined.
Claims 1-20 are rejected.
Claims 1 and 11 are independent.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 8/14/2025, 10/23/2025, 11/17/2025, 12/04/2025, 1/07/2026, 2/24/2026 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Arguments and Amendments
Applicant has amended the independent claims to include “dialogue between a caller and an agent by executing a voice activity detection engine for detecting instances of speech of the caller or the agent in the inbound audio data; detecting,” “executing the voice activity detection engine,” “and during the call”
Regarding the Rejections under 35 U.S.C. 101 Applicant notes
Independent claims 1 and 11, as amended, satisfy at least Prong 1 or Prong 2 of Step 2A. As discussed during the Examiner Interview, the claims are patent-eligible under Step 2A, Prong 1, at least because the claims describe operations that cannot practicably be performed in a human mind.
Examiner notes that no such agreement was reached during the interview.
Applicant notes the claims are patent-eligible under Step 2A, Prong 2, at least because they recite features that describe technical improvements to technological shortcomings described in the Specification and improve computing functioning of existing approaches.
Examiner notes the claims recite functions that can be performed in the human mind.
Applicant notes The USPTO's 2024 Guidance Update and MPEP 2106.04(a)(2) explain that a claim does not recite a mental process when it contains features that cannot practically be performed in the human mind, for instance when the human mind is not equipped to perform the claim limitations. The amended claims include steps such as: segmenting call audio into agent and caller speech regions; determining response delays between question/answer pairs; calculating running statistical measures (variance, mean, interquartile range) of response delays; adjusting response delay calculations based on dialogue context; and comparing these delays to expected human response times derived from historical data.
Examiner notes a human can perform these functions using pen and paper.
Applicant notes The Specification describes a concrete technical problem arising in conventional call authentication and fraud detection systems, which are vulnerable to machine-generated speech (deepfakes) and other automated attacks that can defeat traditional speaker verification or call analytics. See, e.g., Specification at [0003]-[0005]. The Specification describes a technical solution that includes a multi-stage, computer-implemented process that implements particular computing operations for segmenting audio using a voice activity detection (VAD) engine, identifying speech segments associated with distinct speakers, determining timing between the speech segments, and statistically analyzing response delays to distinguish between human and machine-generated speech in real-world call center systems. See id. at [0006]-[0009], [0020], [0049]-[0050].
Examiner notes that the claims recite limitations that can be performed in the human mind.
Applicant notes The amended claims reflect these technical improvements, reciting operations for implementing deepfake detection based on response delay analysis. See id. at [0018]-[0020], [0049]-[0050]. As shown in independent claim 1, the claims describe a particular sequence of technical operations, including: obtaining inbound audio data for a call; detecting agent and caller speech regions using a voice activity detection engine to generate speech segments of the agent and speech segments of the caller; determining a response delay between an agent's speech segment and a caller's speech segment; and comparing that response delay against an expected human response time to identify a deepfake of machine-generated speech. Id. at [0049]-[0050]. The claims therefore describe a technological solution to technological shortcomings discussed in the Specification.
Examiner notes that the claims recite limitations that can be performed in the human mind.
Applicant notes The Examples of the Updated Guidance support this conclusion. The analysis of the pending Application is analogous to the eligible claims in USPTO Example 47, claim 3 (network anomaly detection), and Example 48, claim 2 (speech separation), where the claims were found to integrate the abstract idea into a practical application by reciting a specific improvement to a technical field. Here, the claims as a whole provide a particular solution to the technical problem of deepfake detection and the improvement to functions for fraud detection technology that cannot be performed in a human mind.
Examiner notes the functions claimed can be performed in the human mind.
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on the primary reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Hence, new grounds for rejection have been made over Dropuljic (US Patent Number US 12206820 B2), in view of PHATAK (US Patent Number US 20210280171 A1).
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding independent Claim 1, the claim recites “1. A computer-implemented method for detecting machine-based speech in calls, comprising:
obtaining, by a computer, inbound audio data for a call, including a plurality of speech segments corresponding to dialogue between a caller and an agent by executing a voice activity detection engine for detecting instances of speech of the caller or the agent in the inbound audio data;
detecting, by the computer, executing the voice activity detection engine, from the plurality of speech segments of the inbound audio data, a speech region including a first speech segment corresponding to the agent and a second speech segment corresponding to the caller;
determining, by the computer, a response delay between the first segment corresponding to the agent and the second segment corresponding to the caller;
identifying, by the computer, the caller as a deepfake during the call, in response to determining that the response delay fails to satisfy an expected response time for a human speaker.”
The limitations of “obtaining…”, “detecting…”, “determining …”, “identifying …” as drafted covers a human mental activity or process.
More specifically, a human is capable of
A human is capable of obtaining, inbound audio data for a call, including a plurality of speech segments corresponding to a dialogue between a caller and an agent using the human auditory system.
A human is capable of detecting, from the plurality of speech segments of the inbound audio data, a speech region including a first speech segment corresponding to the agent and a second speech segment corresponding to the caller using the human auditory system and natural language processing within the human mind.
A human is capable of determining, a response delay between the first segment corresponding to the agent and the second segment corresponding to the caller using natural language processing and logic and reasoning within the human mind.
A human is capable of identifying, the caller as a deepfake, in response to determining that the response delay fails to satisfy an expected response time for a human speaker using logic and reasoning within the human mind.
Regarding independent Claim 11, Claim 11 is a System claim with limitations similar to that of claim 1 and is rejected under the same rationale.
This judicial exception is not integrated into a practical application. In particular, claims 1 and 11 recites the additional element of “processor” and “computer” as per the independent claims. For example, in [0339] of the as filed specification, there is description of a processor-executable software module which may reside on a computer-readable or processor-readable storage medium… instructions or data structures and that may be accessed by a computer or processor. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a processor and computer is noted as a general computer as noted. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Further, the additional limitation in the claims noted above are directed towards insignificant solution activity. The claims are not patent eligible.
With respect to claims 2 and 12, the claims relate to identifying based on the plurality of timestamps, the response delay corresponding to a time difference between the first speech segment corresponding to a question by the agent and the second speech segment corresponding to an answer by the caller. This relates to a human performing natural language understanding and logic and reasoning within the human mind to distinguish a delay corresponding to a question by the agent and the second speech segment corresponding to an answer by the caller. No additional limitations are present.
With respect to claims 3 and 13, the claims relate to detecting a plurality of timestamps defining the first speech segment and the second speech segment in the speech region. This relates to a human performing natural language understanding and logic and reasoning within the human mind or pen and paper to define speech segments with timestamps. No additional limitations are present.
With respect to claims 4 and 14, the claims relate to determining a plurality of statistical measures of response delays based on a plurality of timestamps derived from the plurality of speech segments of the inbound audio data. This relates to a human performing natural language understanding and logic and reasoning within the human mind or pen and paper to apply statistical measures with timestamps. Statistical measurements are a mathematical process. No additional limitations are present.
With respect to claims 5 and 15, the claims relate to at least one of: (i) a running variance, (ii) a running inter-quartile range, or (iii) a running mean. This is a mathematical process. No additional limitations present.
With respect to claims 6 and 16, the claims relate to determining the response delay includes adjusting the response delay based on a context of the dialogue between the caller and the agent. This relates to a human performing natural language understanding and logic and reasoning within the human mind to determine that a response delay includes an adjustment. No additional limitations present.
With respect to claim 7 and 17, the claims relate to extracting a transcription containing text in chronological sequence of each caller speech segment and each agent speech segment.
This relates to a human performing natural language understanding and pen and paper to extract a transcription. No additional limitations present.
With respect to claim 8 and 18, the claims relate to determining the expected response time for the human speaker based on historical data of audio data between callers and agents. This relates to a human using natural language understanding and logic and reasoning to determine an expected response. No additional limitations present.
With respect to claim 9 and 19, the claims relate to identifying the caller as human, in response to determining that the response delay satisfies an expected response time for a human speaker. This relates to a human performing natural language understanding and logic and reasoning to determine and identify, using the delay, that a caller is human. No additional limitations present.
With respect to claims 10 and 20, the claims relate to generating an indication indicating the caller as one of the deepfake or human. This relates to a human using pen and paper or the vocal system to indicate that a caller is a deepfake or human. No additional limitations present.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 8-11 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Dropuljic (US Patent Number US 12206820 B2), in view of PHATAK (US Patent Number US 20210280171 A1).
Regarding claim 1, DROPULJIC teaches 1. A computer-implemented method for detecting machine-based speech in calls, comprising:
obtaining, by a computer, inbound audio data for a call, including a plurality of speech segments corresponding to a dialogue between a caller and an agent; by executing a voice activity detection engine for detecting instances of speech of the caller or the agent in the inbound audio data; (see Dropuljic (7:64-8:33) “(33) The extended voice activity detection module 202 is a subsystem that is configured to receive an input audio signal associated with a call. The audio signal, or audio stream, may be from the origination device 106 or from the destination device 114, or it may be combined audio signal from both devices. In some embodiments, if the call includes audio and visual data, the audio signal may be separated from the visual data for analysis by the extended voice activity detection module 202. (34) The extended voice activity detection module 202 is further configured to analyze the received audio signal for the presence of speech or other sounds. In response to detecting the presence of speech, the extended voice activity detection module 202 outputs a speech signal, which may include one or more portions or intervals of the audio signal that include the detected speech. In response to detecting the presence of other sounds, the extended voice activity detection module 202 outputs audio characteristics. These audio characteristics may include information about the presence of other sounds in the audio signal, such as mumbling, background noise, melodies, or other audible characteristics. The audio characteristics may also include information regarding the absence of sounds and noises in the audio signal (e.g., the extended voice activity detection module 202 may output audio characteristics indicating that the audio signal is silence or too short of a sample to be analyzed for speech). (35) In various embodiments, the extended voice activity detection module 202 employs one or more machine learning or other artificial intelligence techniques to identify or extract features from the audio signal. In some embodiments, the extended voice activity detection module 202 may be derived from a combination of machine learning models, deep learning models, or other types of speech or audio-feature extraction systems (e.g., a Root Mean Square detector) that are trained using known speech or audio characteristic samples.”) detecting, by the computer, executing the voice activity detection engine, from the plurality of speech segments of the inbound audio data, a speech region including a first speech segment corresponding to the agent and a second speech segment corresponding to the caller; (see Dropuljic (8:6-33) “(34) The extended voice activity detection module 202 is further configured to analyze the received audio signal for the presence of speech or other sounds. In response to detecting the presence of speech, the extended voice activity detection module 202 outputs a speech signal, which may include one or more portions or intervals of the audio signal that include the detected speech. In response to detecting the presence of other sounds, the extended voice activity detection module 202 outputs audio characteristics. These audio characteristics may include information about the presence of other sounds in the audio signal, such as mumbling, background noise, melodies, or other audible characteristics. The audio characteristics may also include information regarding the absence of sounds and noises in the audio signal (e.g., the extended voice activity detection module 202 may output audio characteristics indicating that the audio signal is silence or too short of a sample to be analyzed for speech). (35) In various embodiments, the extended voice activity detection module 202 employs one or more machine learning or other artificial intelligence techniques to identify or extract features from the audio signal. In some embodiments, the extended voice activity detection module 202 may be derived from a combination of machine learning models, deep learning models, or other types of speech or audio-feature extraction systems (e.g., a Root Mean Square detector) that are trained using known speech or audio characteristic samples.”)
Dropuljic does not specifically teach determining, by the computer, a response delay between the first segment corresponding to the agent and the second segment corresponding to the caller; However Phatak does teach this limitation (see Phatak [0049] “The neural network architecture includes the task-specific models configured to extract corresponding speaker-independent embeddings and a DP vector based upon the speaker-independent embeddings. The server applies the task-specific models on the speech portions (or speech-only abridged audio signal) and then again on the non-speech portions (or speechless-only abridged audio signal). The analytics server 102 applies certain types of task-specific models (e.g., audio event neural network) to substantially all of input audio signal (or, at least, that has not been parsed by the VAD). For example, the task-specific models of the neural network architecture include a device-type neural network and an audio event neural network. In this example, the analytics server 102 applies the device-type neural network on the speech portions and then again on the non-speech portions to extract speaker-independent embeddings for the speech portions and the non-speech portions. The analytics server 102 then applies the device-type neural network on the input audio signal to extract an entire audio signal embedding.”) and identifying, by the computer, the caller as a deepfake during the call, (see Phatak [0163] Where the neural network architecture 1200 has detected spoofing in the inbound audio signal, the neural network architecture applies spoofing service recognition layers 1208. The spoofing service recognition layers 1208 is a multi-class classifier trained to determine a likely spoofing service used to generate the spoofed inbound signal. The spoof service recognition layers 1208 determines a spoofing service likelihood score based on a relative distance (e.g., similarity, difference) between the DP vector of the input audio signal and DP vectors of trained spoof service recognition classifications or stored spoof service recognition DP vectors. During the training phase, the spoofing service recognition layers 1208 generate predicted outputs (e.g., predicted classifications, predicted similarity score) for training audio signals, which are used to determine a level of error according to training labels that indicate the expected outputs. The hyper-parameters of the spoofing service recognition layers 1208 or other layers of the neural network architecture 1200 to minimize the level of error for the spoofing service recognition layers 1208. The spoofing service recognition layers 1208 determine the particular spoofing service applied to generate the spoofed inbound audio signal at test time, where a spoofing service likelihood score satisfies a spoof detection threshold.”) (see Phatak Abstract “The DP vector is a low dimensional representation of the each of the speaker-independent characteristics of the audio signal and applied in various downstream operations.”) in response to determining that the response delay fails to satisfy an expected response time for a human speaker. (see Phatak [0049] “The neural network architecture includes the task-specific models configured to extract corresponding speaker-independent embeddings and a DP vector based upon the speaker-independent embeddings. The server applies the task-specific models on the speech portions (or speech-only abridged audio signal) and then again on the non-speech portions (or speechless-only abridged audio signal). The analytics server 102 applies certain types of task-specific models (e.g., audio event neural network) to substantially all of input audio signal (or, at least, that has not been parsed by the VAD). For example, the task-specific models of the neural network architecture include a device-type neural network and an audio event neural network. In this example, the analytics server 102 applies the device-type neural network on the speech portions and then again on the non-speech portions to extract speaker-independent embeddings for the speech portions and the non-speech portions. The analytics server 102 then applies the device-type neural network on the input audio signal to extract an entire audio signal embedding.”) (see Phatak [0054] “During the training phase and, during the enrollment phase in some embodiments, one or more fully-connected layers, classification layers, and/or output layers of each task-specific model generate predicted outputs (e.g., predicted classifications, predicted speaker-independent feature vectors, predicted speaker-independent embeddings, predicted DP vectors, predicted similarity scores) for the training audio signals (or enrollment audio signals). Loss layers perform various types of loss functions to evaluate the distances (e.g., differences, similarities) between predicted outputs (e.g., predicated classifications) to determine a level error between the predicted outputs and corresponding expected outputs indicated by training labels associated with the training audio signals (or enrollment audio signals). The loss layers, or other functions executed by the analytics server 102, tune or adjust the hyper-parameters of the neural network architecture until the distance between the predicted outputs and the expected outputs satisfies a training threshold.”) (see Phatak [0046] As an example, the neural network architecture may comprise neural network layers for VAD operations that parse a set of speech portions and a set of non-speech potions from each particular input audio signal. The analytics server 102 may train the classifier of the VAD separately (as a distinct neural network architecture) or along with the neural network architecture (as a part of the same neural network architecture). When the VAD is applied to the features extracted from the input audio signal, the VAD may output binary results (e.g., speech detection, no speech detection) or contentious values (e.g., probabilities of speech occurring) for each window of the input audio signal, thereby indicating whether a speech portion occurs at a given window. The server may store the set of one or more speech portions, set of one or more non-speech portions, and the input audio signal into a memory storage location, including short-term RAM, hard disk, or one or more databases 104, 112.”) (see Phatak [0037] “…such as verifying the identity of the other end-user or indicating whether the other end-user is using a spoofing service.”)
DROPULJIC and Phatak are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of DROPULJIC to incorporate the teachings of Phatak to include determining, by the computer, a response delay between the first segment corresponding to the agent and the second segment corresponding to the caller; identifying, by the computer, the caller as a deepfake, in response to determining that the response delay fails to satisfy an expected response time for a human speaker. Doing so allows for verifying the caller as a spoofing service as recognized by Phatak in [0037].
Regarding independent Claim 11, Claim 11 is a System claim with limitations similar to that of claim 1 and is rejected under the same rationale. Furthermore, DROPULJIC teaches A system for detecting machine-based speech in calls, comprising: a computer having one or more processors and configured to: (see DROPULJIC (5:1-19) “(16) The term “call” or “phone call” or “telephone call” may be used interchangeably and as used herein refers to audio signals sent by a sender to a recipient, and may be used interchangeably with respect to “communication” herein unless the context clearly dictates otherwise. A call may include audio data alone or it may include audio and video data (e.g., as a video call). Moreover, a call may include audio data sent as a message, such as a voicemail or audio message, or an MMS message having audio data. The sender or recipient of a call may be a person, a machine, or an application, and may be referred to as the origination device and the destination device, respectively. Thus, calls may be communications sent by one person to another person, communications sent by a machine or application to a person, etc. Because calls can be bi-directional, the sender or origination device refers to the person, device, or entity that initiated the call and the recipient or destination device refers to the person, device, or entity that accepts a call initiated by the sender or origination device.”) (see DROPULIIC (17:63-18:9) “(99) Memory 610 may include one or more various types of non-volatile and/or volatile storage technologies. Examples of memory 610 may include, but are not limited to, flash memory, hard disk drives, optical drives, solid-state drives, various types of random access memory (RAM), various types of read-only memory (ROM), other computer-readable storage media (also referred to as processor-readable storage media), or the like, or any combination thereof. Memory 610 may be utilized to store information, including computer-readable instructions that are utilized by CPU 622 to perform actions, including embodiments described herein. In various embodiments, CPU 622 or GPU 620, or some combination thereof, may perform embodiments described herein.”)
Regarding claim 8, DROPULJIC in view of Phatak teaches 8. The method of claim 1,
Furthermore, Phatak teaches, further comprising determining, by the computer, the expected response time for the human speaker based on historical data of audio data between callers and agents. (see Phatak [0160-0162] “Audio intake layers 1202 receive one or more input audio signals (e.g., training audio signals, enrollment audio signals, inbound audio signal) and perform various pre-processing operations, including applying a VAD on an input audio signal and extracting various types of features. The VAD detects speech portions and non-speech portions and outputs a speech-only audio signal and a speechless-only audio signal. The intake layers 1202 may also extract the various types of features from the input audio signal, the speech-only audio signal, and/or the speechless-only audio signal. [0161] DP vector layers 1204 are applied on the input audio signal. The DP vector layers include task-specific models that extract various speaker-independent embeddings from the input audio signal. The DP vector layers 1204 then extract the DP vector for the input audio signal by concatenating (or otherwise combining) the various speaker-independent embeddings. [0162] Spoof detection layers 1206 define a binary classifier trained to determine a spoof detection likelihood score. The spoof detection layers 1206 are trained to determine whether an audio source is spoofing an input audio signal (spoof detection) or that no spoofing has been detected. The spoof detection layers 1206 determines a likelihood score based on a relative distance (e.g., similarity, difference) between the DP vector of the input audio signal and DP vectors of trained spoof detection classifications or stored spoof detection DP vectors. During the training phase, the spoof detection layers 1206 generate predicted outputs (e.g., predicted classifications, predicted similarity score) for training audio signals, which are used to determine a level of error according to training labels that indicate the expected outputs. The loss function or other server-executed process adjusts the hyper-parameters of the spoof detection layers 1206 or other layers of the neural network architecture 1200 to minimize the level of error for the spoof detection layers 1206. The spoof detection layers 1206 determine whether a given inbound audio signal, at test time, has a spoof detection likelihood score that satisfies a spoof detection threshold.”)
DROPULJIC and Phatak are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of DROPULJIC to incorporate the teachings of Phatak to include determining, by the computer, the expected response time for the human speaker based on historical data of audio data between callers and agents. Doing so allows for verifying the caller as a spoofing service as recognized by Phatak in [0037].
Regarding claim 9, DROPULJIC in view of Phatak teaches 9. The method of claim 1,
Furthermore, Phatak teaches, further comprising identifying, by the computer, the caller as human, in response to determining that the response delay satisfies an expected response time for a human speaker. (see Phatak [0160-0162] “Audio intake layers 1202 receive one or more input audio signals (e.g., training audio signals, enrollment audio signals, inbound audio signal) and perform various pre-processing operations, including applying a VAD on an input audio signal and extracting various types of features. The VAD detects speech portions and non-speech portions and outputs a speech-only audio signal and a speechless-only audio signal. The intake layers 1202 may also extract the various types of features from the input audio signal, the speech-only audio signal, and/or the speechless-only audio signal. [0161] DP vector layers 1204 are applied on the input audio signal. The DP vector layers include task-specific models that extract various speaker-independent embeddings from the input audio signal. The DP vector layers 1204 then extract the DP vector for the input audio signal by concatenating (or otherwise combining) the various speaker-independent embeddings. [0162] Spoof detection layers 1206 define a binary classifier trained to determine a spoof detection likelihood score. The spoof detection layers 1206 are trained to determine whether an audio source is spoofing an input audio signal (spoof detection) or that no spoofing has been detected. The spoof detection layers 1206 determines a likelihood score based on a relative distance (e.g., similarity, difference) between the DP vector of the input audio signal and DP vectors of trained spoof detection classifications or stored spoof detection DP vectors. During the training phase, the spoof detection layers 1206 generate predicted outputs (e.g., predicted classifications, predicted similarity score) for training audio signals, which are used to determine a level of error according to training labels that indicate the expected outputs. The loss function or other server-executed process adjusts the hyper-parameters of the spoof detection layers 1206 or other layers of the neural network architecture 1200 to minimize the level of error for the spoof detection layers 1206. The spoof detection layers 1206 determine whether a given inbound audio signal, at test time, has a spoof detection likelihood score that satisfies a spoof detection threshold.”)
DROPULJIC and Phatak are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of DROPULJIC to incorporate the teachings of Phatak to include further comprising identifying, by the computer, the caller as human, in response to determining that the response delay satisfies an expected response time for a human speaker. Doing so allows for verifying the caller as a spoofing service as recognized by Phatak in [0037].
Regarding claim 10, DROPULJIC in view of Phatak teaches 10. The method of claim 1,
Furthermore, Phatak teaches, further comprising generating, by the computer, an indication, for a user interface, indicating the caller as one of the deepfake or human. (see Phatak [0037] Embodiments described with respect to FIG. 1 are merely examples employing the speaker-independent embeddings and deep-phoneprinting and are not necessarily limiting on other potential embodiments. The description of FIG. 1 mentions circumstances in which end-users place calls to the service provider system 110 through various communications channels to contact and/or interact with services offered by the service provider. But the operations and features of the various deep-phoneprinting implementations described herein may be applicable to many circumstances for evaluating speaker-independent aspects of an audio signal. For instance, the deep-phoneprinting audio processing operations described herein may be implemented within various types of devices and need not be implemented within a larger infrastructure. As an example, an IoT device 114d may implement the various processes described herein when capturing an input audio signal from an end-user or when receiving an input audio signal from another end-user via a TCP/IP network. As another example, end-user devices 114 may execute locally installed software implementing the deep-phoneprinting processes described herein allowing, for example, deep-phoneprinting processes in user-to-user interactions. A smartphone 114b may execute the deep-phoneprinting software when receiving an inbound call from another end-user to perform certain downstream operations, such as verifying the identity of the other end-user or indicating whether the other end-user is using a spoofing service.”) (see Phatak [0062] “The provider server 111 of the provider system 110 executes software processes for interacting with the end-users through the various channels. The processes may include, for example, routing calls to the appropriate agent devices 116 based on an inbound caller's comments, instructions, IVR inputs, or other inputs submitted during the inbound call. The provider server 111 can capture, query, or generate various types of information about the inbound audio signal, the caller, and/or the end-user device 114 and forward the information to the agent device 116. A graphical user interface (GUI) of the agent device 116 displays the information to an agent of the service provider. The provider server 111 also transmits the information about the inbound audio signal to the analytics system 101 to preform various analytics processes on the inbound audio signal and any other audio data. The provider server 111 may transmit the information and the audio data based upon preconfigured triggering conditions (e.g., receiving the inbound phone call), instructions or queries received from another device of the system 100 (e.g., agent device 116, admin device 103, analytics server 102), or as part of a batch transmitted at a regular interval or predetermined time.”)
DROPULJIC and Phatak are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of DROPULJIC to incorporate the teachings of Phatak to include generating, by the computer, an indication, for a user interface, indicating the caller as one of the deepfake or human. Doing so allows for verifying the caller as a spoofing service as recognized by Phatak in [0037].
As to claim 18, claim 18 is a system claim with limitations similar to that of Claim 8 and is rejected under the same rationale.
As to claim 19, claim 19 is a system claim with limitations similar to that of Claim 9 and is rejected under the same rationale.
As to claim 20, claim 20 is a system claim with limitations similar to that of Claim 10 and is rejected under the same rationale.
Claims 2-4 and 12-14 are rejected under 35 U.S.C. 103 as being unpatentable over Dropuljic (US Patent Number US 12206820 B2), in view of PHATAK (US Patent Number US 20210280171 A1), and further in view of Kolbegger (US Patent Number US 20140270114 A1).
As to Claim 2, DROPULJIC in view of Phatak teaches 2. The method of claim 1,
DROPULJIC in view of Phatak do not specifically teach wherein determining the response delay includes identifying, by the computer, based on the plurality of timestamps, the response delay corresponding to a time difference between the first speech segment corresponding to a question by the agent and the second speech segment corresponding to an answer by the caller. However, Kolbegger does teach this limitation, (see Kolbegger [0066] “In FIG. 4B, a waveform 410B represents the time-in-speech on the agent channel and a waveform 420B represents the time-in-speech on the client channel. The waveform 410B shows a longer time-in-speech, which tends to indicate to the facility that the agent is asking preliminary information identifying the client such as, "With whom am I speaking to," "Can you spell that?", and/or "May I have your order number?" Moreover, the short pauses between the agent's and caller's time-in-speech tend to indicate to the facility that either the client is searching for information and/or that the agent is pulling up, looking for or reviewing information and/or putting the client on hold. The waveform 420B shows a shorter time-in-speech, which tends to indicate to the facility that the client is responding to the agents' preliminary questions.”) (see Kolbegger [0046] “At block 340, the clustered-frame representation or manifests for each channel are correlated by the facility 140 such that the visual representation of the client channel and the visual representation of the agent channel are presented together in one representation that depicts both channels simultaneously. The two channels may be correlated by a variety of different means, including syncing a start time for the client channel with a start time for the agent channel, by relying on embedded or captured time stamps at various points in the audio, etc.”)
DROPULJIC in view of Phatak and Kolbegger are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of DROPULJIC and Phatak to incorporate the teachings of Kolbegger to include determining the response delay includes identifying, by the computer, based on the plurality of timestamps, the response delay corresponding to a time difference between the first speech segment corresponding to a question by the agent and the second speech segment corresponding to an answer by the caller. Doing so allows for a particular speech or call feature to more easily be distinguished from another speech or call feature as recognized by Kolbegger in [0044].
As to Claim 3, DROPULJIC in view of Phatak teaches 3. The method of claim 1,
DROPULJIC in view of Phatak do not specifically teach wherein detecting the speech region further includes detecting, by the computer, a plurality of timestamps defining the first speech segment and the second speech segment in the speech region. However, Kolbegger does teach this limitation, (see Kolbegger [0066] “In FIG. 4B, a waveform 410B represents the time-in-speech on the agent channel and a waveform 420B represents the time-in-speech on the client channel. The waveform 410B shows a longer time-in-speech, which tends to indicate to the facility that the agent is asking preliminary information identifying the client such as, "With whom am I speaking to," "Can you spell that?", and/or "May I have your order number?" Moreover, the short pauses between the agent's and caller's time-in-speech tend to indicate to the facility that either the client is searching for information and/or that the agent is pulling up, looking for or reviewing information and/or putting the client on hold. The waveform 420B shows a shorter time-in-speech, which tends to indicate to the facility that the client is responding to the agents' preliminary questions.”) (see Kolbegger, [0046] “At block 340, the clustered-frame representation or manifests for each channel are correlated by the facility 140 such that the visual representation of the client channel and the visual representation of the agent channel are presented together in one representation that depicts both channels simultaneously. The two channels may be correlated by a variety of different means, including syncing a start time for the client channel with a start time for the agent channel, by relying on embedded or captured time stamps at various points in the audio, etc.”)
DROPULJIC in view of Phatak and Kolbegger are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of DROPULJIC and Phatak to incorporate the teachings of Kolbegger to include wherein detecting the speech region further includes detecting, by the computer, a plurality of timestamps defining the first speech segment and the second speech segment in the speech region. Doing so allows for a particular speech or call feature to more easily be distinguished from another speech or call feature as recognized by Kolbegger in [0044].
As to Claim 4, DROPULJIC in view of Phatak teaches 4. The method of claim 1,
DROPULJIC in view of Phatak do not specifically teach wherein determining the response delay includes determining, by the computer, a plurality of statistical measures (see Kolbegger [0065] “FIG. 4B depicts a clustered-frame representation of an example channel interaction pattern 400B known as an "exchange of personal data." A measure of time 430B (e.g., in milliseconds, seconds, minutes, etc.) is represented along the x-axis. The "exchange of personal data" represented by the pattern 400B typically occurs at the beginning of a call and reflects a lot of back- and forth in time-of-speech between the agent-channel and the client-channel.”) of response delays based on a plurality of timestamps derived from the plurality of speech segments of the inbound audio data. However, Kolbegger does teach this limitation, (see Kolbegger [0066] “In FIG. 4B, a waveform 410B represents the time-in-speech on the agent channel and a waveform 420B represents the time-in-speech on the client channel. The waveform 410B shows a longer time-in-speech, which tends to indicate to the facility that the agent is asking preliminary information identifying the client such as, "With whom am I speaking to," "Can you spell that?", and/or "May I have your order number?" Moreover, the short pauses between the agent's and caller's time-in-speech tend to indicate to the facility that either the client is searching for information and/or that the agent is pulling up, looking for or reviewing information and/or putting the client on hold. The waveform 420B shows a shorter time-in-speech, which tends to indicate to the facility that the client is responding to the agents' preliminary questions.”) (see Kolbegger, [0046] “At block 340, the clustered-frame representation or manifests for each channel are correlated by the facility 140 such that the visual representation of the client channel and the visual representation of the agent channel are presented together in one representation that depicts both channels simultaneously. The two channels may be correlated by a variety of different means, including syncing a start time for the client channel with a start time for the agent channel, by relying on embedded or captured time stamps at various points in the audio, etc.”)
DROPULJIC in view of Phatak and Kolbegger are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of DROPULJIC and Phatak to incorporate the teachings of Kolbegger to include determining the response delay includes determining, by the computer, a plurality of statistical measures. Doing so allows for a particular speech or call feature to more easily be distinguished from another speech or call feature as recognized by Kolbegger in [0044].
As to claim 12, claim 12 is a system claim with limitations similar to that of Claim 2 and is rejected under the same rationale.
As to claim 13, claim 13 is a system claim with limitations similar to that of Claim 3 and is rejected under the same rationale.
As to claim 14, claim 14 is a system claim with limitations similar to that of Claim 4 and is rejected under the same rationale.
Claims 5, 6, 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Dropuljic (US Patent Number US 12206820 B2), in view of PHATAK (US Patent Number US 20210280171 A1), and further in view of Kolbegger (US Patent Number US 20140270114 A1), and further in view of Gross (US Patent Number US 20160328547 A1).
As to Claim 5, DROPULJIC in view of Phatak and further in view of Kolbegger teaches 5. The method of claim 4,
DROPULJIC in view of Phatak and further in view of Kolbegger do not specifically teach wherein the plurality of statistical measures further comprises at least one of: (i) a running variance, (ii) a running inter-quartile range, or (iii) a running mean. However, GROSS does teach this limitation (see GROSS [0179] “Through sufficient training samples the system should be able to identify, with controllable confidence levels, an appropriate set of sentences that are likely to weed out a machine imposter. Moreover candidate sentences which are confusing or take too long for a human can be eliminated as well. Again it is preferable that the challenge sentences include primarily samples that are rapidly processed and articulated as measured against a human reference set. The respective times required by human and machines can also be measured and compiled to determine minimum, maximum, average, mean and threshold times. For example, it may be desirable to select challenge sentences in which the time difference between human and machine articulations is greatest.”)
DROPULJIC in view of Phatak and further in view of Kolbegger and Gross are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of DROPULJIC, Phatak and Kolbegger to incorporate the teachings of Gross to include the plurality of statistical measures further comprises at least one of: (i) a running variance, (ii) a running inter-quartile range, or (iii) a running mean. Doing so allows for a better selection and optimization for discriminating against machines as recognized by Gross in [0006].
Regarding claim 6, DROPULJIC in view of Phatak teaches 6. The method of claim 1,
DROPULJIC in view of Phatak do not specifically teach wherein determining the response delay includes adjusting, by the computer, the response delay based on a context of the dialogue between the caller and the agent. However, GROSS does teach this limitation (see Gross [0083] After understanding the sentence, the machine imposter may also have to annotate the output of the desired articulation with appropriate prosodic elements at step 530. Acoustical aspects of prosodic structure also include modulation of fundamental frequency (F0), energy, relative timing of phonetic segments and pauses, and phonetic reduction or modification. For example a phoneme may have a variable duration which is highly dependent on context, such as preceding and following phonemes, phrase boundaries, word stress, phrase boundaries, etc.”)
DROPULJIC in view of Phatak and further in view of Kolbegger and Gross are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of DROPULJIC, Phatak and Kolbegger to incorporate the teachings of Gross to include determining the response delay includes adjusting, by the computer, the response delay based on a context of the dialogue between the caller and the agent. Doing so allows for a better selection and optimization for discriminating against machines as recognized by Gross in [0006].
As to claim 15, claim 15 is a system claim with limitations similar to that of Claim 5 and is rejected under the same rationale.
As to claim 16, claim 16 is a system claim with limitations similar to that of Claim 6 and is rejected under the same rationale.
Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Dropuljic (US Patent Number US 12206820 B2), in view of PHATAK (US Patent Number US 20210280171 A1), and further in view of Newstadt (US Patent Number US-10657971-B1).
Regarding claim 7, DROPULJIC in view of Phatak teaches 7. The method of claim 1,
DROPULJIC in view of Phatak do not specifically teach further comprising extracting, by the computer, a transcription containing text in chronological sequence of each caller speech segment and each agent speech segment. However, Newstadt does teach this limitation (see Newstadt (12:55-13:14) “(59) As explained in connection with FIGS. 3 and 5, the systems and methods described herein may identify fraudulent and/or nuisance phone calls by using data collected by client applications installed within a community of users. The systems and methods described herein may collect data that captures identifying characteristics of the call as well as the user community's response to the call. In some embodiments, the systems and methods described herein may combine data across the community into a call reputation rating and then provide the reputation rating back to users to identify suspicious calls as the calls are received. In some embodiments, users may install the systems described herein as a client application on their device. In various embodiments, this client application may be a mobile application on a phone or a desktop application on a laptop or desktop. In some embodiments, the application may listen in on all audio or video calls made to the user's device. In some examples, the application may listen for the quality of the call, amount of time between answering the call and the caller speaking, time lag between the user speaking and the caller responding, background noise, and the tone, qualities, and emotional content of the caller's voice. In some embodiments, using voice recognition, the application may also monitor the content of the call itself, such as the specific words the caller is using as they initiate the conversation. For example, does the caller say, “Hi John,” or does the caller say, “for a small fee?””)
DROPULJIC in view of Phatak and further in view of Newstadt are in the same field of endeavor of signal processing, therefore It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of combination of DROPULJIC, Phatak to incorporate the teachings of Newstadt to include extracting, by the computer, a transcription containing text in chronological sequence of each caller speech segment and each agent speech segment. Doing so allows for improve the functioning of a computing device by detecting potentially malicious calls with increased accuracy and thus reducing the computing device's user's likelihood of victimization by malicious callers as recognized by Newstadt in (4:6-9).
As to claim 17, claim 17 is a system claim with limitations similar to that of Claim 7 and is rejected under the same rationale.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KRISTEN MICHELLE MASTERS whose telephone number is (703)756-1274. The examiner can normally be reached M-F 8:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KRISTEN MICHELLE MASTERS/Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659