Last updated: April 19, 2026

Application No. 18/813,647

AUDIO ANALYSIS SYSTEM WITH QUERY PROCESSING

Non-Final OA §102§103

Filed

Aug 23, 2024

Examiner

ROBERTS, SHAUN A

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Sri International

OA Round

1 (Non-Final)

Interview Optional

— +10.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 647 resolved cases, 2023–2026

Examiner Intelligence

ROBERTS, SHAUN A View full profile →

Grants 76% — above average

Career Allow Rate

491 granted / 647 resolved

+13.9% vs TC avg

Moderate +10% lift

Without

With

+10.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

31 currently pending

Career history

678

Total Applications

across all art units

Statute-Specific Performance

§101

7.6%

-32.4% vs TC avg

§103

49.2%

+9.2% vs TC avg

§102

29.5%

-10.5% vs TC avg

§112

3.5%

-36.5% vs TC avg

Black line = Tech Center average estimate • Based on career data from 647 resolved cases

Office Action

§102 §103

DETAILED ACTION
1.	This action is responsive to Application no.18/813,647 filed 8/23/24.  All claims have been examined and are currently pending.
Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
3.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

4.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

5.	Claims 1-8, 10, 12-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Horton et al (2020/0411013).
	
Regarding claim 1 Horton et al (2020/0411013) teaches A method (0065: system and method; 0133: processing engine for implementing the system and method; figure 5), comprising: 
obtaining, by a computing system and from one or more sources that provide media files over one or more networks, a plurality of media files that each includes speech of one or more speakers (0133/fig 5: processing engine for implementing;
65; 69: persons participating; [0125] Further embodiments will now be disclosed that provide enrollment and authentication of individuals both inside and outside secure facilities using audio data in any format, for example, live or pre-recorded video, audio, or any other type of call or recording, originating from any source, for example, land line or wireless live calls, pre-recorded calls, video or audio messages, and video or audio recordings of any type; 136;
140 audio files with speakers); 
processing, by the computing system, the plurality of media files to generate indexed data, wherein the indexed data includes, for each media file of the plurality of media files, a corresponding embedding for each speaker of the one or more speakers identified in the media file and a corresponding transcript of speech for each language identified in the speech in the media file (131 transcribe and index audio data for keyword searching; 140: the system may also provide robust transcription and indexing of audio files; files may be subject to segmentation analysis, voice print generation, speaker identification and assignment, and location in the biometric voice print database;
20: biometric voice print (BVP); 38-39 speaker identification to biometric voice prints; 
41: transcribing the audio files; [0043] Transcribing the audio files to a text format may include using an automatic language detection process to distinguish languages spoken by the speakers.; [0129] separate persons participating in the call, referred to as speakers, in the audio data; [0130] create individual biometric voice print databases;
139: custom indices of the biometric voiceprint database 508 related to speaker identities, key words, or any other fields or records of the biometric voice print database;
141 language detection; 142; 143; 146 keyword indexing); 

receiving, by the computing system, an indication of at least one of a selection of a particular speaker from the one or more speakers or a selection of a particular keyword from a plurality of keywords ([0093] h) matching 34 the speaker to a BVP by identifying all existing voice samples matched by the incoming voice sample.; 148 – receiving words and phrases;
149 a user may upload an audio file which a speaker presents a threat to an emergency operator. The user may then request that the system identify the speaker from the pool of voice prints in the biometric voice print database 508 assigned to some or all speakers that meet certain criteria); 

generating, by the computing system, one or more correlations based on the indexed data, wherein the one or more correlations include at least one of an association among the one or more speakers or an association among keywords detected in the transcripts as spoken by the one or more speakers (20: The detection system can be used to verify the identity of persons both inside and outside the facility and to detect known persons-of-interest (POI) in subsequent calls 93; 95: An occasion where this is useful is if the inmate is on a watch list and the agency wants to know who is leaving him a voice mail in real time; 146-148: keywords; when the established criteria is met; recognizing phrases…discussions.
149 a user may upload an audio file which a speaker presents a threat to an emergency operator. The user may then request that the system identify the speaker from the pool of voice prints in the biometric voice print database 508 assigned to some or all speakers that meet certain criteria); and 
outputting, by the computing system, based on the one or more correlations, an indication regarding the one or more correlations  
(0020: The detection system can be used to verify the identity of persons both inside and outside the facility and to detect known persons-of-interest (POI) in subsequent calls. ;The secure facility handling the call will be notified in real time if a caller is a known person inside the facility, a person-of-interest, a previous caller, or a new caller
[0039] The method may also include assigning speaker identification records to biometric voice prints with authenticated speakers, and storing the speaker identification records and the biometric voice prints in the biometric voice print database.
142: The automatic language detection process provides the ability to offer notification, alerting and routing options based on the spoken languages, such as real time notification when a speaker utters certain words or phrases in a particular language
148 notifications and alerts).

Regarding claim 2 Horton teaches The method of claim 1, wherein processing the plurality of media files into indexed data includes clustering excerpts from the plurality of media files (120: The original call assigned to the calling number is then reassigned to the newly identified speaker, forming a call cluster. Additional calls from this or other numbers identified as including this speaker may be added to the cluster with the goal of grouping the audio files by speaker).  

Regarding claim 3 Horton teaches The method of claim 1, wherein processing the plurality of media files into indexed data includes: 
extracting embeddings from overlapping windows from each of the plurality of media files (20: BVP; 38-39 speaker ID; 74: BVPs for the caller as well as the called party;
fig 4; 96: window; overlap); and 
applying clustering to the embeddings (120 cluster…grouping the audio files by speaker, using the speaker embeddings/ID).  

Regarding claim 4 Horton teaches The method of claim 1, further comprising: 
generating master data by matching keywords included in a watchlist to the transcripts (131; 139; 146-148 keyword indexing
[0148] Some exemplary topics and phrases may include “Sexual Misconduct”, “Threat Potential”, and “Drug Related.” The lexicons may be modified by the user to create custom and potentially shared templates from which notifications and alerts may be automatically generated when the established criteria is met. Some exemplary applications may include recognizing phrases such as “Call XXX-XXX0XXXX” which may indicate that a prohibited 3 way call may be attempted, and recognizing discussions about the weather that may indicate planning for unrest, disruptions, riots or escapes.).  

Regarding claim 5 Horton teaches The method of claim 1, further comprising: 
receiving, by the computing system, an indication of a selection of a particular keyword (125; 131; 148-149 – receiving audio with keywords), and 
wherein outputting the indication includes generating the indication based on determining that at least one media file of the plurality of media files includes a speaker speaking the particular keyword
(131; 142: The automatic language detection process provides the ability to offer notification, alerting and routing options based on the spoken languages, such as real time notification when a speaker utters certain words or phrases in a particular language;  [0148] Some exemplary topics and phrases may include “Sexual Misconduct”, “Threat Potential”, and “Drug Related.” The lexicons may be modified by the user to create custom and potentially shared templates from which notifications and alerts may be automatically generated when the established criteria is met. Some exemplary applications may include recognizing phrases such as “Call XXX-XXX0XXXX” which may indicate that a prohibited 3 way call may be attempted, and recognizing discussions about the weather that may indicate planning for unrest, disruptions, riots or escapes).  

Regarding claim 6 Horton teaches The method of claim 1, wherein the indexed data includes, for each speaker, at least one of: 
a gender identifier, or an identifier of the language spoken by the speaker (141-142: language detection).  

Regarding claim 7 Horton teaches The method of claim 6, further comprising: 
determining, by the computing system, a subset of the one or more speakers, where each speaker of the subset of the one or more speakers is associated with the particular speaker (120: grouping the audio files by speaker; 79-84), and 
wherein determining the subset includes: 
determining that the speakers of the subset of one or more speakers and the particular speaker speak in a same media file (74: need for the facility is the ability to identify the individuals receiving outgoing calls from within the facility who could possibly be involved in a criminal activity in collusion with the inmates. The current disclosure describes an automated process to create BVPs for the caller as well as the called party in calls originating either inside or outside a facility, and to create BVP's from any source of audio data originating from inside or outside the facility.; 95: An occasion where this is useful is if the inmate is on a watch list and the agency wants to know who is leaving him a voice mail in real time.; 98; 140).  

Regarding claim 8 Horton teaches The method of claim 1, further comprising: 
receiving, by the computing system, an indication of a selection of a particular speaker of the one or more speakers (125; 149 – audio of speaker); and 
identifying, by the computing system, one or more media files that include speech by the particular speaker (149: request the system return all calls that originate from the secure facility in which the person of interest is a speaker.).  


Regarding claim 10 Horton teaches The method of claim 1, 
wherein the indexed data includes, for a media file of the plurality of media files, respective identifiers for multiple speakers of the one or more speakers that speak in the media file (140: indexing of audio files…with speakers…speaker identification; 38-39: speaker ID; 
[0094] The system may be able to identify one or both speakers during a call, which permits the system to operate in real time and to continue to confirm speaker identification throughout the duration of the call.), and 
wherein generating the one or more correlations based on the indexed data comprises identifying an association among the multiple speakers based on the identifiers for the multiple speakers that speak in the media file ([0094] The system may be able to identify one or both speakers during a call, which permits the system to operate in real time and to continue to confirm speaker identification throughout the duration of the call.; 140; 149).  


Regarding claim 12 Horton teaches The method of claim 1, wherein processing the plurality of media files to generate the indexed data includes: 
processing the plurality of media files to generate, for each media file, respective embeddings for one or more speakers having speech in the media file (79-84: processing selected segments to create a BVP for the target person); and 
matching speaker embeddings included in a watchlist to the embeddings for the one or more speakers having speech in the media file (20-21; 74; 94-95; 149  – identifying person of interest/those on watchlist).  

Regarding claim 13 Horton teaches The method of claim 1, wherein, for each media file of the plurality of media files, the corresponding one or more keywords identified in the speech in the media file are present in a transcript of the media file (145-148 transcription; keywords).  

Regarding claim 14 Horton teaches The method of claim 1, wherein the media file is a first media file, wherein the indication is first indication, and further comprising: 
receiving, by the computing system, a second media file of a speaker for enrollment, wherein the second media file includes speech of at least one speaker (65;125; 140 – receiving media files; 79-84: creating a BVP); 
processing, by the computing system, the second media file, wherein processing the second media file includes: 
extracting an embedding of the at least one speaker from the second media file (77; 79-84 creating a BVP; 140), and 
matching the embedding to a cluster of one or more clusters of a plurality of speakers, wherein each cluster of the plurality of clusters corresponds to a respective speaker of a plurality of speakers (120 – clustering by speaker based on BVP/ID); and 
outputting, by the computing system, based on matching the embedding to the cluster, a second indication that includes an indication of a match between the at least one speaker and a speaker of the plurality of speakers
(20; 74; [0094] The system may be able to identify one or both speakers during a call, which permits the system to operate in real time and to continue to confirm speaker identification throughout the duration of the call; 120;
[0149] The disclosed embodiments advantageously facilitates user requests for biometric analysis of an audio file where a speaker is believed to be a current inmate, former inmate, an associate of a current or former inmate, a person promoting criminal activity, or generally a person of interest. For example, a user may upload an audio file which a speaker presents a threat to an emergency operator. The user may then request that the system identify the speaker from the pool of voice prints in the biometric voice print database 508 assigned to some or all speakers that meet certain criteria, such as inmates released over a specified time frame. As another example, the user may provide an audio file including speech from a person of interest and request the system return all calls that originate from the secure facility in which the person of interest is a speaker.).  

Regarding claim 15 Horton teaches The method of claim 1, wherein the plurality of media files include media files with audio events, and wherein generating the correlations includes generating correlations that include associations among the audio events (125; 140 – obtaining audio and identifying data/relationships/associations of the content).  

Regarding claim 16 Horton teaches A computing system, comprising: 
memory; and 
one or more programmable processors in communication with the memory and configured to: 
obtain, from one or more sources that provide media files over one or more networks, a plurality of media files that each includes speech of one or more speakers; 
process the plurality of media files to generate indexed data, wherein the indexed data includes, for each media file of the plurality of media files, a corresponding embedding for each speaker of the one or more speakers identified in the media file and a corresponding transcript of speech for each language identified in the speech in the media file; 
receive an indication of at least one of a selection of a particular speaker from the one or more speakers or a selection of a particular keyword from a plurality of keywords; 
generate one or more correlations based on the indexed data, wherein the one or more correlations include at least one of an association among the one or more speakers or an association among keywords detected in the transcripts as spoken by the one or more speakers; and 
output, based on the one or more correlations, an indication regarding the one or more correlations.  
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 

Claim 17 recites limitations similar to claim 2 and is rejected for similar rationale and reasoning 
Claim 18 recites limitations similar to claim 3 and is rejected for similar rationale and reasoning 
Claim 19 recites limitations similar to claim 4 and is rejected for similar rationale and reasoning 

Regarding claim 20 Horton teaches Non-transitory computer-readable media comprising instructions that, when executed by one or more processors, cause the one or more processors to: 
obtain, from one or more sources that provide media files over one or more networks, a plurality of media files that each includes speech of one or more speakers; 
process the plurality of media files to generate indexed data, wherein the indexed data includes, for each media file of the plurality of media files, a corresponding embedding for each speaker of the one or more speakers identified in the media file and a corresponding transcript of speech for each language identified in the speech in the media file; 
receive an indication of at least one of a selection of a particular speaker from the one or more speakers or a selection of a particular keyword from a plurality of keywords; 
generate one or more correlations based on the indexed data, wherein the one or more correlations include at least one of an association among the one or more speakers or an association among keywords detected in the transcripts as spoken by the one or more speakers; and 
output, based on the one or more correlations, an indication regarding the one or more correlations.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 

Claim Rejections - 35 USC § 103
6.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

7.	Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Horton et al (2020/0411013) in view of Mathew (2020/0076845).
Regarding claim 9 Horton does not specifically teach where Mathew (2020/0076845) teaches The method of claim 1, wherein the one or more sources comprise a Clearnet site and a darknet site (22 deep web, dark web, surface web).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Mathew for an improved system allowing for additional media to be obtained by crawling in spaces throughout the web.
Horton already teaches accessing media for transcription, identification, and indexing, and one could look to Mathew to provide access to additional sources for an improved system while presenting a reasonable expectation of success, and with the incorporation further improving on the detection and prevention of potential threats. 

8.	Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Horton et al (2020/0411013) in view of Karas et al (2003/0088397).
Regarding claim 11 Horton teaches The method of claim 1, further comprising: outputting, by the computing system, a graphical user interface (GUI), wherein the GUI includes {one or more visual representations of the} one or more correlations (fig 5; para: 21; 134 user interface; 136; 141; 142; 148 notifications and alerts; 149);  
But does not specifically teach where Karas teaches one or more visual representations of the one or more correlations (41; 51 display; to allow a user to navigate and utilize the indexed information stream data; 60; 61; 72)
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Karas for an improved system to allow user to better observe the correlations in helping to detect and eliminate potential threats.
Conclusion
9.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool.  To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.
For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/Primary Examiner, Art Unit 2655

Read full office action

Prosecution Timeline

Aug 23, 2024

Application Filed

Mar 09, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/312,688

Patent 12586599

AUDIO SIGNAL PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM WITH MACHINE LEARNING AND FOR MICROPHONE MUTE STATE FEATURES IN A MULTI PERSON VOICE CALL

2y 5m to grant Granted Mar 24, 2026

18/484,282

Patent 12586568

SYNTHETICALLY GENERATING INNER SPEECH TRAINING DATA

2y 5m to grant Granted Mar 24, 2026

18/179,756

Patent 12573376

Dynamic Language and Command Recognition

2y 5m to grant Granted Mar 10, 2026

18/629,200

Patent 12562157

GENERATING TOPIC-SPECIFIC LANGUAGE MODELS

2y 5m to grant Granted Feb 24, 2026

18/484,538

Patent 12555562

VOICE SYNTHESIS FROM DIFFUSION GENERATED SPECTROGRAMS FOR ACCESSIBILITY

2y 5m to grant Granted Feb 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

76%

Grant Probability

86%

With Interview (+10.3%)

2y 10m

Median Time to Grant

Low

PTA Risk

Based on 647 resolved cases by this examiner. Grant probability derived from career allow rate.