Last updated: April 19, 2026
Application No. 18/670,596
CONTENT DETECTION FOR VOICE CALLS

Non-Final OA §103
Filed
May 21, 2024
Examiner
JONES, CARISSA ANNE
Art Unit
2691
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
Interview Optional

— +25.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 24 resolved cases, 2023–2026
Examiner Intelligence

JONES, CARISSA ANNE View full profile →
Grants 83% — above average
Career Allow Rate
20 granted / 24 resolved
+21.3% vs TC avg
Strong +25% interview lift
Without
With
+25.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
30 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
3.1%
-36.9% vs TC avg
§103
76.0%
+36.0% vs TC avg
§102
11.6%
-28.4% vs TC avg
§112
4.9%
-35.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases
Office Action

§103
DETAILED ACTION
This action is in response to the application filed 05/21/2024. Claims 1 – 20 are pending and have
been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 20 is objected to because of the following informalities: system claim cannot depend on computer-readable storage medium claim. Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 – 5, 8  – 12 and 15 - 19 are rejected under 35 U.S.C. 103 as being unpatentable over Trim et al. (U.S. Pub. No. 2023/0239400, hereinafter “Trim”) in view of Chao (U.S. Pub. No. 2025/0141822).
Regarding Claim 1, Trim teaches
A method of determining (see Trim Abstract, Computer-implemented methods), by a cloud-based service provider, characteristics of a voice call in a telecommunications network (see Trim Paragraph [0003], The computer-implemented method may include one or more processors configured for receiving voice call data corresponding to an incoming telephone call placed to a user device, wherein the voice call data comprises caller voice data. Further, the computer-implemented method may include one or more processors configured for converting the caller voice data to caller text data comprising one or more text phrases. Further the computer-implemented method may include one or more processors configured for determining that the one or more text phrases satisfies a first condition, and responsive to determining that the one or more text phrases satisfies the first condition, transmitting a user alert to the user device and [0034], Server 125 can be a standalone computing device, a management server, a web server, or any other electronic device or computing system capable of receiving, sending, and processing data and capable of communicating with user device 120 and/or database 124 via network 110. In other embodiments, server 125 represents a server computing system utilizing multiple computers as a server system, such as a cloud computing environment), the method comprising:
accessing, by the cloud-based service provider, a voice call that is in process in the telecommunications network, wherein the voice call is serviced by a mobile operator of the telecommunications network (see Trim Figure 4, item 402, receive voice call data corresponding to an incoming telephone call, placed to a user device, wherein the voice call data comprises voice data, and Paragraph [0029], distributed data processing environment 100 includes user device 120, server 125, and database 124, interconnected over network 110. Network 110 operates as a computing network that can be, for example, a local area network (LAN), a wide area network (WAN), or a combination of the two, and can include wired, wireless, or fiber optic connections. In general, network 110 can be any combination of connections and protocols that will support communications between user device 120, server 125, and database 124);
for a specified segment of the voice call, converting, by the cloud-based service provider, a sample of the voice call to text (see Trim Figure 4, item 404, convert the caller voice data to caller text data comprising one or more text phrases, and Paragraph [0058], computer-implemented method 400 may include one or more processors configured to convert 404 the caller voice data to caller text data comprising one or more text phrases);
based on the converted sample of text, generating, by the cloud-based service provider, a prompt for input to a language model, wherein the prompt is usable to prompt the language model to determine a likelihood that the voice call meets one or more characteristics, wherein the prompt includes an example of a different voice call that meets the one or more characteristics (see Trim Figure 3 and Paragraph [0051], model 300 may include trained model 340 configured as a Natural Language Processing (NLP) engine to interpret call training data 310 phrases embedded as 768 dim BERT phrase embeddings 320 and active incoming call data received at user device 120. An NLP engine is a core component that interprets statements at any given time and converts the statements to structured inputs that the system can process. NLP engines may contain advanced machine learning algorithms to identify intent in caller statements and further matches caller intent to a list of available classifications determined and saved within the system. For example, NLP engines may use either finite state automatic models or deep learning models to generate system-generated responses to caller statements. NLP engine may include an intent classifier and an entity extractor, wherein the intent classifier may be configured to interpret the natural language of a statement and the entity extractor may be configured to extract key information or keywords from the statement, Figure 4, item 406, determine that the one or more text phrases satisfies a first condition, Paragraph [0055], the machine learning model (e.g., trained model 340) may include a deep learning model, as described above herein, wherein the deep learning models are trained on various features (e.g., sentence embeddings, syntactic features) configured to generate model output data in response to receiving and processing NL text data. The model output data may include a binary classification indicating whether the NL text data corresponds to a fraudulent call or corresponds to a legitimate call. This determination improves the user experience in situations where the user would prefer not to field numerous fraudulent calls, and Paragraph [0059], computer-implemented method 400 may include one or more processors configured to determine 406 that the one or more text phrases satisfies a first condition. Furthermore, computer-implemented method 400 may be configured for transmitting the caller text data to a pattern matching module, determining a phrase score for each of the one or more text phrases based on the one or more text phrases and a plurality of fraudulent phrases, and determining that the one or more text phrases satisfies the first condition. In an embodiment, a condition may include that the one or more text phrases exceeds a predetermined threshold based on the risk of the call. For example, if the phrase score for each of the one or more text phrases exceeds a predetermined threshold, then a risk score for the call may be determined based on an aggregation of the text phrase scores. Even further for example, if one of the text phrases has a risk score that exceeds a predetermined threshold, then the risk score for the call may be determined to be a critical risk score, regardless of the risk scores for the remaining text phrases);
inputting the prompt to the language model to determine the likelihood that the voice call meets the one or more characteristics (see Trim Figure 4, item 406 and item 408, determining that the one or more text phrases satisfies a first condition, responsive to determining that the one or more text phrases satisfies the first condition, transmitting a user alert to the user device, Paragraph [0052], model 300 may include one or more processors configured for identifying caller voice data corresponding to a natural language (NL) text in the text data. For example, an NLP engine may be configured to process the text data to identify NL text in the voice data and to process the text data to identify NL text in the caller voice data, and Paragraph [0053], model 300 may further include one or processors configured for processing, by trained model 300, the embedded call training data 310 to generate model output data corresponding to a voice phrase classification 350. Further, model 300 may be configured for determining the NL text as the call training data 310 that corresponds to a fraudulent call if the voice phrase classification 350 satisfies a condition. The voice phrase classification 350 may correspond to a first class (class 1) indicating that the call is a fraudulent call or a second class (class 0) indicating that the voice data corresponds to a legitimate call. A class 0 text data comprising a NL utterance might be a legitimate call from an agent of a user’s financial institution, insurance agency, or some other official entity through which user has an established relationship. Class 1 NL utterances are not legitimate calls because they were placed by unsolicited callers seeking to obtain information from the user that the user may not have authorized. A condition may include a binary classification or a score corresponding to a binary classification); and
initiating, by the cloud-based service provider, an action based on the determined likelihood (see Trim Figure 4, item 408, responsive to determining that the one or more text phrases satisfies the first condition, transmitting a user alert to the user device, Paragraph [0060], responsive to determining that the one or more text phrases satisfies a condition, computer-implemented method 400 may include one or more processors configured to transmit 408 a user alert to the user device. Furthermore, prior to transmitting the user alert to the user device, computer-implemented method 400 may be configured for placing the telephone call on hold and outputting the user alert to the user device, wherein the user alert may include a message to the user to not share user information with the caller on the incoming call, Paragraph [0061], responsive to determining that the one or more text phrases satisfies the first condition, computer-implemented method 400 may be configured for outputting a caller audible message to solicit caller information from a caller originating the incoming telephone call, receiving the caller information from the caller, and storing the caller information in a database, wherein the caller information may include one or more of a call back number and a caller entity name).

Trim discusses the possibility of multiple AI methods that can be implemented, which may include characteristics of a large language model (see Trim Paragraph [0044]), however it does not expressively teach using a large language model.

However, Chao teaches
A large language model (see Chao Paragraph [0077], anti-spam engine 108 used in communications may include check component 202 and recommendation component 204. The anti-spam engine 108 may receive, among other things, text corresponding to messages from user devices, such as UE 102, within a network cell associated with a base station 104. For example, a user may receive a text message from an unknown source or one that includes suspicious attachments or links. The user may wish to confirm the text message is legitimate. To do so, the user may select the message or a portion of the message (i.e., text included in the message, an attachment included in the message, a link included in the message, or a combination thereof) to send to the anti-spam engine 108. In aspects, the anti-spam engine 108 is an AI-powered language model capable of generating a human-like response based on the message or a portion of the message selected by the user for analysis. Examples of such AI-powered language models include, but are not limited to, OPENAI GPT, GOOGLE LAMDA, MICROSOFT BING AI, DEEPMIND SPARROW, ANTHROPIC CLAUDE 2, GOOGLE BARD, META LLAMA, MICROSOFT NVIDIA MEGATRON-TURING, GPT 3.5/4 (text) CHAT COMPLETIONS (conversation) DALL-E2 (image) WHISPER (transcription and translation), and AWS SAGEMAKER) 

Thus, Trim and Chao each disclose a language model used to identify fraud in communications. A person of ordinary skill in the art before the effective filing date of the claimed invention would have recognized that the large language model of Chao could have been substituted for the language model of Trim because both the large language model in Chao and the language model in Trim serve the purpose of automating the task of pattern matching and recognition. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution. Finally, the substitution achieves the predictable result of implementing artificial intelligence to perform pattern matching in order to automate the task and increase efficiency. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the large language model of Chao for the language model of Trim according to known methods to yield the predictable result of implementing artificial intelligence to perform pattern matching and recognition.

Regarding Claim 2, Trim in view of Chao teaches
The method of claim 1, wherein the large language model is a generative pre-trained transformer (GPT) model (see Chao Paragraph [0077], anti-spam engine 108 used in communications may include check component 202 and recommendation component 204. The anti-spam engine 108 may receive, among other things, text corresponding to messages from user devices, such as UE 102, within a network cell associated with a base station 104. For example, a user may receive a text message from an unknown source or one that includes suspicious attachments or links. The user may wish to confirm the text message is legitimate. To do so, the user may select the message or a portion of the message (i.e., text included in the message, an attachment included in the message, a link included in the message, or a combination thereof) to send to the anti-spam engine 108. In aspects, the anti-spam engine 108 is an AI-powered language model capable of generating a human-like response based on the message or a portion of the message selected by the user for analysis. Examples of such AI-powered language models include, but are not limited to, OPENAI GPT, GOOGLE LAMDA, MICROSOFT BING AI, DEEPMIND SPARROW, ANTHROPIC CLAUDE 2, GOOGLE BARD, META LLAMA, MICROSOFT NVIDIA MEGATRON-TURING, GPT 3.5/4 (text) CHAT COMPLETIONS (conversation) DALL-E2 (image) WHISPER (transcription and translation), and AWS SAGEMAKER).

Regarding Claim 3, Trim in view of Chao teaches
The method of claim 1, wherein the characteristics are indicative of fraudulent activity (see Trim Figure 4, item 406 and item 408, determining that the one or more text phrases satisfies a first condition, responsive to determining that the one or more text phrases satisfies the first condition, transmitting a user alert to the user device, Paragraph [0052], model 300 may include one or more processors configured for identifying caller voice data corresponding to a natural language (NL) text in the text data. For example, an NLP engine may be configured to process the text data to identify NL text in the voice data and to process the text data to identify NL text in the caller voice data, and Paragraph [0053], model 300 may further include one or processors configured for processing, by trained model 300, the embedded call training data 310 to generate model output data corresponding to a voice phrase classification 350. Further, model 300 may be configured for determining the NL text as the call training data 310 that corresponds to a fraudulent call if the voice phrase classification 350 satisfies a condition. The voice phrase classification 350 may correspond to a first class (class 1) indicating that the call is a fraudulent call or a second class (class 0) indicating that the voice data corresponds to a legitimate call. A class 0 text data comprising a NL utterance might be a legitimate call from an agent of a user’s financial institution, insurance agency, or some other official entity through which user has an established relationship. Class 1 NL utterances are not legitimate calls because they were placed by unsolicited callers seeking to obtain information from the user that the user may not have authorized. A condition may include a binary classification or a score corresponding to a binary classification).

Regarding Claim 4, Trim in view of Chao teaches
The method of claim 1, wherein the action is one of sending an SMS message, terminating the call, and interjecting an audio message into the call (see Trim Figure 4, item 408, responsive to determining that the one or more text phrases satisfies the first condition, transmitting a user alert to the user device, Paragraph [0060], responsive to determining that the one or more text phrases satisfies a condition, computer-implemented method 400 may include one or more processors configured to transmit 408 a user alert to the user device. Furthermore, prior to transmitting the user alert to the user device, computer-implemented method 400 may be configured for placing the telephone call on hold and outputting the user alert to the user device, wherein the user alert may include a message to the user to not share user information with the caller on the incoming call, Paragraph [0061], responsive to determining that the one or more text phrases satisfies the first condition, computer-implemented method 400 may be configured for outputting a caller audible message to solicit caller information from a caller originating the incoming telephone call, receiving the caller information from the caller, and storing the caller information in a database, wherein the caller information may include one or more of a call back number and a caller entity name).

Regarding Claim 5, Trim in view of Chao teaches
The method of claim 1, wherein the action comprises sending information pertaining to the voice call to a third party (see Trim Paragraph [0011], If the call is determined to exceed a risk factor threshold, embodiments of the present invention may be configured to route the user to involve a software agent/chatbot to intercede the call and prevent the user from sharing confidential information or simply wasting time with the call).

Regarding Claim 8, it is rejected similarly as Claim 1. The system can be found in Trim (Abstract, system).

Regarding Claim 9, Trim in view of Chao teaches
The computing system of claim 8, wherein the machine learning model is a large language model (LLM) (see Chao Paragraph [0077], anti-spam engine 108 used in communications may include check component 202 and recommendation component 204. The anti-spam engine 108 may receive, among other things, text corresponding to messages from user devices, such as UE 102, within a network cell associated with a base station 104. For example, a user may receive a text message from an unknown source or one that includes suspicious attachments or links. The user may wish to confirm the text message is legitimate. To do so, the user may select the message or a portion of the message (i.e., text included in the message, an attachment included in the message, a link included in the message, or a combination thereof) to send to the anti-spam engine 108. In aspects, the anti-spam engine 108 is an AI-powered language model capable of generating a human-like response based on the message or a portion of the message selected by the user for analysis. Examples of such AI-powered language models include, but are not limited to, OPENAI GPT, GOOGLE LAMDA, MICROSOFT BING AI, DEEPMIND SPARROW, ANTHROPIC CLAUDE 2, GOOGLE BARD, META LLAMA, MICROSOFT NVIDIA MEGATRON-TURING, GPT 3.5/4 (text) CHAT COMPLETIONS (conversation) DALL-E2 (image) WHISPER (transcription and translation), and AWS SAGEMAKER).

Regarding Claims 10 - 12, they are rejected similarly as Claims 3 - 5, respectively. The system can be found in Trim (Abstract, system).

Regarding Claims 15 - 19, they are rejected similarly as Claims 1 - 5, respectively. The computer-readable storage medium can be found in Trim (Paragraph [0067], computer-readable storage media).

Claims 6, 13 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Trim et al. (U.S. Pub. No. 2023/0239400, hereinafter “Trim”) in view of Chao (U.S. Pub. No. 2025/0141822) and Qian et al. (CN Pub. No. 117634471, hereinafter “Qian”).
Regarding Claim 6, Trim in view of Chao teaches all the limitations of claim 1, but does not expressively teach
The method of claim 1, wherein the sample of the voice call is converted to text as a series of utterances delineated by a pause having a threshold duration.

However, Qian teaches
The method of claim 1, wherein the sample of the voice call is converted to text as a series of utterances delineated by a pause having a threshold duration (see Qian Page 6, As shown in Figure 4, in step 4, for the conversation content string processed in step 3, the audio file is scanned in chronological order, and the continuous non-speech pause duration is detected based on the voice activity detection algorithm. If the pause duration is greater than the preset pause threshold , it will be segmented into an independent sentence, and the text converted by automatic speech recognition will be segmented at the same time; if the length of the segmented sentence is greater than the preset sentence threshold, the segment will continue to be segmented. Among them, the pause threshold value range is 1 to 3 seconds, and the recommended value is 1.5 seconds; the sentence threshold value range is 30 to 120 words, and the recommended value is 60 words). 

It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of a method of detecting fraudulent activity in a voice call using machine learning (as taught in Trim in view of Chao), with converting talk to text and dividing speech segments according to pauses in discussion (as taught in Qian), the motivation being to separate and process the speaker’s speech content in order to identify dialogue (see Qian Pages 3 - 4).

Regarding Claim 13, it is rejected similarly as Claim 6. The system can be found in Trim (Abstract, system).

Regarding Claim 20, it is rejected similarly as Claim 6. The computer-readable storage medium can be found in Trim (Paragraph [0067], computer-readable storage media).

Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Trim et al. (U.S. Pub. No. 2023/0239400, hereinafter “Trim”) in view of Chao (U.S. Pub. No. 2025/0141822) and Monteleone et al. (U.S. Pub. No. 2025/0190703, hereinafter “Monteleone”).
Regarding Claim 7, Trim in view of Chao teaches all the limitations of claim 1, but does not expressively teach
The method of claim 1, further comprising executing, by the cloud-based service provider, a bot configured to generate the prompt.

However, Monteleone teaches
The method of claim 1, further comprising executing, by the cloud-based service provider, a bot configured to generate the prompt (see Monteleone Paragraph [0054], A word analyzer 508 and a non-word analyzer 510 are then used to process the multi-mode user input when it is verbal and non-verbal communication, respectively. According to an exemplary embodiment, the word analyzer 508 and the non-word analyzer 510 normalize the multi-mode input for analysis by converting all input utterances (verbal and non-verbal communication) into vector representations (referred to hereinafter as ‘utterance vectors’). Text input is then directly analyzed. Whereas, speech and/or voice utterances are converted to text using an artificial intelligence (AI) speech-to-text service). 

It would have been obvious to one of ordinary skill in the art before the effective filing date of
the claimed invention to combine the teaching of a method of detecting fraudulent activity in a voice call using machine learning (as taught in Trim in view of Chao), with implementing a bot to generate a prompt (as taught in Monteleone), the motivation being to implement task automation in order provide more time for users to work on other less routine matters (see Monteleone Paragraph [0002]).

Regarding Claim 14, it is rejected similarly as Claim 7. The system can be found in Trim (Abstract, system).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Refer to PTO-892, Notice of References Cited for a listing of analogous art.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CARISSA A JONES whose telephone number is (703)756-1677. The examiner can normally be reached Telework M-F 6:30 AM - 4:00 PM CT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at 5712727503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CARISSA A JONES/Examiner, Art Unit 2691                                                                                                                                                                                                        
/DUC NGUYEN/Supervisory Patent Examiner, Art Unit 2691
Read full office action
Prosecution Timeline

May 21, 2024
Application Filed
Jan 07, 2026
Non-Final Rejection — §103
Mar 03, 2026
Applicant Interview (Telephonic)
Mar 03, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/299,777
Patent 12598267
IMAGE CAPTURE APPARATUS AND CONTROL METHOD
2y 5m to grant Granted Apr 07, 2026
18/354,967
Patent 12598354
INFORMATION PROCESSING SERVER, RECORD CREATION SYSTEM, DISPLAY CONTROL METHOD, AND NON-TRANSITORY RECORDING MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/124,682
Patent 12593004
DISPLAY METHOD, DISPLAY SYSTEM, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM STORING PROGRAM
2y 5m to grant Granted Mar 31, 2026
18/163,371
Patent 12556468
QUALITY TESTING OF COMMUNICATIONS FOR CONFERENCE CALL ENDPOINTS
2y 5m to grant Granted Feb 17, 2026
18/297,357
Patent 12556655
Efficient Detection of Co-Located Participant Devices in Teleconferencing Sessions
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+25.0%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allow rate.