DETAILED ACTION
Introduction
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . A request for continued examination (RCE) under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application on 03/02/2026 after the final rejection of 08/28/2025. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. The Applicants’ RCE submission is therefore entered. No claims have been added or cancelled, but claims 1, 5, 8, 10, 13, 17 and 20 have been amended in this submission. Thus claims 1-20 are currently pending for reconsideration by the Examiner and are examined below.
Response to arguments
2. With regards to the double patenting rejection, the Applicant requests that the rejection be kept in abeyance until the prior art related issues are resolved. The Examiner understands this position however this rejection is repeated here for the sake of completeness of record.
Applicant’s arguments with regards to the prior art have been fully considered but are moot in light of new grounds of rejection as necessitated by amendments presented in this latest response.
Information Disclosure Statement
3. The Information Statement (IDS) filed on 03/02/2026 has been accepted/considered and is in compliance with the provisions of 37 CFR 1.97.
Double Patenting
4. The non-statutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper time-wise extension of the "right to exclude" granted by a patent and to prevent possible harassment by multiple assignees. A non-statutory obviousness-type double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Omum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); and In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed e-terminal disclaimer (e-TD) in compliance with 37 CFR 1.321 (c) or 1.321(d) may be used to overcome an actual or provisional rejection based on a non-statutory double patenting ground provided the conflicting application or patent either is shown to be commonly owned with this application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. Effective January 1, 1994, a registered attorney or agent of record may sign an e-terminal disclaimer. An e-terminal disclaimer signed by the assignee must fully comply with 37 CFR 3.73(b).
Claims 1-20 of the instant Application are rejected on the ground of non-statutory obviousness-type double patenting as being unpatentable over claims 1-19 of U.S. Patent # 12014750. Although the conflicting claims are not identical, they are not patentably distinct from each other because the claims of the present application are broader in scope than those of U.S. Patent # 12014750 and hence the claims of U.S. Patent # 12014750 can anticipate those of the present invention. That is, the claims of U.S. Patent # 12014750 contain every limitation of the claims of the present application or the claims of the present application are obvious variants thereof. It should be noted that this is in fact a non-provisional non-statutory obviousness-type double patenting rejection because the conflicting claims have in fact been patented.
As an example; claim 1 of the instant application and claim 1 of U.S. Patent # 12014750 both teach steps for extracting an audio segment from a video track, detecting voice activity to identify starting and ending timestamps of voice, transcribing the at least one audio segment to identify/separate the audio of at least one speaker and scoring the audio of the at least one speaker to identify interactions of interest. One of ordinary skill in the art would recognize that it would have been obvious to replace the method claim of U.S. Patent # 12014750 with the system claim of the instant application in order to have a patent with wider applicability and freedom to operate. Also, removal of any additional steps is obvious: In re Karlson, 136 USPQ 184 (1963): "Omission of an element and its function is an obvious expedient if the remaining elements perform the same functions as before".
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
5. Claim 1-5, 7-8 and 10-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Childress (U.S. Patent Application Publication # 2019/0096428) in view of Qu (U.S. Patent Application Publication # 2020/0273450 A1).
With regards to claim 1, Childress teaches a system of using machine natural language processing to analyze language in transcribed camera footage comprising a non-transitory, processor-readable medium that implements an audio and language analyzer and that stores instructions that, when executed by a processor, cause the processor to extract an audio stream from a body camera video track (Paragraphs 4-5, teach various hardware implementations including computer, computer readable medium. Figure 3 and para 16, teach that an audio file associated with a video file of interest can be transcribed into text and thereafter analyzed using tone analysis. Para 37 and figure 2, teaches a sentiment module of the server that is configured to perform an automated natural language text analysis. Para 39, further teaches that the recording data is in the form of audio-video data recorded by a public safety officer's body camera);
transcribe the audio stream to produce a transcription (Figure 3 and para 16, teach that an audio file associated with a video file of interest can be transcribed into text and thereafter analyzed using tone analysis. Paragraphs 45-47 and figure 3, teach that data gathered is utilized to identify persons of interest in need of training. Training algorithms may be utilized to identify persons of interest and to identify baseline metrics);
score the audio of the at least one speaker after separation of the audio based on the at least one segment to identify interactions of interest (Paragraphs 18, 43 and 48, teach that sentiment analysis-based classification may be converted into numeric values, e.g., negative values may represent negative sentiment and more positive values may represent more positive sentiment. With these numeric values, a score can be associated with an interaction between a person of interest and the public. Scoring can be accomplished by first partitioning each interaction into “n” sections, where a first section represents the start of the interaction, and the “nth” section represents the end of the interaction. Next, sentiment analysis can be run on each section to obtain a “sentiment score” for each section. Thereafter, an overall “sentiment score” may be created by combining the sum of the sections to determine overall sentiment of the conversation and the sum of the change in sentiment between each section to determine how the sentiment changes throughout the conversation. Weights for combining data may be determined in a learning phase using supervised/semi-supervised learning techniques. By monitoring personnel's sentiment score over time, a business/government entity can determine when individuals engage the public in a negative manner);
provide the at least one segment as input to a voice detection model to identify audio of at least one speaker from the audio stream (Para 40, teaches that the sentiment module will automatically detect a beginning and an ending of an interaction event. Figure 3 and para 16, teach that an audio file associated with a video file of interest can be transcribed into text and thereafter analyzed using tone analysis, including a model such IBM Watson tone analyzer. Paragraphs 45-47 and figure 3, teach that data gathered is utilized to identify persons of interest in need of training. Training algorithms may be utilized to identify persons of interest and to identify baseline metrics);
Childress may not explicitly detail assign each word from the transcription a start time and a stop time. This is taught by Qu (Para 79, teaches labelling the start and end timestamp of each unigram, i.e. word);
Qu also teaches extracting at least one segment from the transcription based on the start time and the stop time for each word from the transcription (Para 80, teaches a script segmenting module that is used to segment the script into structural blocks of a first type, such as, for example, clauses, phrases and/or sentences, comprising the script);
Childress and Qu can be considered as analogous art as they belong to a similar field of endeavor in speech processing. It would thus have been obvious to one having ordinary skill in the art to advantageously combine the teachings of Qu (Use of end of word to select segments of voice input) with those of Childress (Use of body camera footage to analyze speech-based interaction between two parties) so as to provide an improved visual analytics system that allows users to understand good prosodic patterns in high-quality speech samples, and thus to discover good narration strategies (Qu, para 13).
With regards to claim 2, Childress teaches the system of claim 1 wherein the at least one speaker is a figure of authority, including one of a police-officer, emergency technician, guard, soldier, doctor or first responder (Paragraphs 2 and 38, teach that at least one speaker can be a public safety officer such as a police officer).
With regards to claim 3, Childress teaches the system of claim 2 wherein the interactions of interest include whether the figure of authority is escalating or de-escalating a situation (Para 14, teaches that video footage of a person of interest with a history of negative interactions with the public may be labeled as “escalating”. Likewise, video footage of a person of interest with a history of positive or de-escalating interactions with the public may be labeled as “de-escalating”. Para 42, teaches that tone data gathered is analyzed by sentiment module to detect escalation or de-escalation patterns in order to obtain escalation data).
With regards to claim 4, Childress teaches the system of claim 2 wherein the interactions of interest include whether the figure of authority is using respectful language or negative language (Paragraphs 41-42, teach that said sentiment module conducts a sentiment analysis of audio data associated with an interaction event in order to obtain tone data. The sentiment module analyzes audio data for tone type and for change in tone over time. Tone type can be a positive tone or a negative tone. For example, negative tone types that may be identified include tone types that indicate that a person is scared, angry, or worried. Audio data is first converted into text data, and then the sentiment module analyzes the text data for tone type and for change in tone over time. The tone data gathered is analyzed by sentiment module to detect escalation or de-escalation patterns in order to obtain escalation data. By way of example, the amount of a particular type of tone e.g., positive or negative in each section of an interaction event may be analyzed by sentiment module to determine if the amount of the particular type of tone is increasing or decreasing from one section to the next. By way of example, if positive words or phrases increase over time in an interaction event, sentiment module may determine that a de-escalation pattern has occurred. In aspects, analyses of change in tone over time from a beginning to an end of a conversation, as well as the combination of tones being used, can indicate an escalating or de-escalating situation).
With regards to claim 5, Childress teaches the system of claim 2 wherein the at least one segment is at least one first segment, the non-transitory, processor-readable medium further storing instructions to cause the processor to re-segment the transcription based on the audio of the at least one speaker to produce at least one second segment, the audio of the at least one speaker being scored based on the at least one second segment to identify the interactions of interest (Para 48, teaches that escalation data from one group of participants can be compared to a second group of participants. Escalation data from client facing personnel in a first service area may be compared to escalation data from client facing personnel in a second service area to identify any differences in average sentiments scores with respect to the first and second service areas. Sentiment module is configured to combine sentiment scores for each of the plurality of sections “n” in an interaction event to determine an overall sentiment score for the interaction event. An interaction event can be divided into “n” sections with each section representing a five-minute recording interval and “n” sections can each represent a predetermined number of characters in a text file).
With regards to claim 7, Childress teaches the system of claim 2 wherein the figure of authority is anonymously identified based on voice quality (Para 40, teaches that sentiment module identifies one or more interaction events within the recording data. The sentiment module will automatically detect a beginning and an ending of an interaction event based on whether audio data of recording data meets a predetermined threshold value e.g., the sound level of the audio data exceeds a predetermined value).
With regards to claim 8, Childress teaches the system of claim 1 wherein the transcription identifies whether audio of at least one other speaker is included on the at least one segment (Paragraphs 15-16, teach that data-points constituting an interaction between a person of interest and the public can be identified within the classified video data. A sentiment-analysis program is used to track how a tone progresses through the course of an interaction between a person of interest and the public).
With regards to claim 10, Childress teaches the system of claim 1 wherein the system is further operable to identify events that may have occurred in the body camera video track based on language cues in the at least one segment (Para 16, teaches a sentiment-analysis program that is used to track how a tone progresses through the course of an interaction between a person of interest and the public. Tone is intended to mean an indication of a mental state of a person or persons. By way of example, a tone can indicate an emotion, such as joy, cheerfulness, anger, fear, sadness and disgust. Tone analysis may also be referred to as sentiment analysis. An audio file associated with a video file of interest can be transcribed into text, and thereafter analyzed using tone analysis. Different tone analysis methods, such as semantic or linguistic analysis, can be carried out. Linguistic analysis can be used to detect and interpret emotional, social and language cues found in text).
With regards to claim 11, Childress teaches the system of claim 10 wherein the system is further operable to: compress the body camera video track based on the events (Paragraphs 39-40, teach that the recording data could be in the form of audio-video data recorded by a public safety officer's body camera. It should be understood that the recording data may contain data that is not relevant to the escalation detection system, e.g., audio-video data recorded by a public safety officer's body camera while they are driving between events may not contain any interaction events between the officer and a member of the public. Accordingly, sentiment module identifies one or more interaction events within the recording data. An interaction event may be identified in a number of ways, e.g. the sentiment module will automatically detect a beginning and an ending of an interaction event based on whether audio data of recording data meets a predetermined threshold value e.g., if the sound level of the audio data exceeds a predetermined value. The beginning and end of an interaction event will be automatically determined by the sentiment module based on the presence of one or more participants within a video frame i.e., one or more participants are visually recorded for a predetermined amount of time. Manual detection of an interaction event alone, or in combination with automated techniques using the sentiment module, are utilized to identify a beginning and an end of an interaction event).
With regards to claim 12, Childress may not explicitly detail the limitation wherein the system operates in real-time. However, Qu teaches this aspect (See para 55);
Childress and Qu can be considered as analogous art as they belong to a similar field of endeavor in speech processing. It would thus have been obvious to one having ordinary skill in the art to advantageously combine the teachings of Qu (Use of end of word to select segments of voice input) with those of Childress (Use of body camera footage to analyze speech-based interaction between two parties) so as to provide an improved visual analytics system that allows users to understand good prosodic patterns in high-quality speech samples, and thus to discover good narration strategies (Qu, para 13).
With regards to claims 13-17 and 19-20, these are computer readable medium (CRM) claims for the corresponding apparatus claims 1-5 and 7-8. These two sets of claims are related as CRM and apparatus of using the same, with each claimed system element's function corresponding to the claimed CRM step. Accordingly, claims 13-17 and 19-20 are similarly rejected under the same rationale as applied above with respect to apparatus claims 1-5 and 7-8. Para 5 of Childress, teaches such a computer readable medium.
6. Claims 6 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Childress in view of Qu and further in view of Garrido (U.S. Patent Application Publication # 2018/0350389).
With regards to claim 6, although Childress and Qu teach the system of claims 1 and 2, they may not explicitly detail the limitation wherein scoring the audio includes an analysis for word disfluencies or filler words to analyze speaker confidence. However, Garrido teaches this aspect (Paragraphs 25-38 and figures 1-2, teach that individual speaking traits may not be equally indicative of a particular subjective criteria for which the spoken input is being evaluated, e.g. if the subjective criteria, for which the spoken input is being evaluated, is confidence, then the use of filler sounds, as would be reflected by the filler sounds score can be more consequential, to a determination of whether the spoken input sounds confident or not, than the vocal fry score. Conversely, if the subjective criteria, for which the spoken input is being evaluated, is listener engagement, or entertainment, then filler sounds, such as reflected by the filler sounds score can be less significant than tag questions reflected by the tag questions score).
Childress, Qu and Garrido can be considered as analogous art as they belong to a similar field of endeavor in speech processing. It would thus have been obvious to one having ordinary skill in the art to advantageously combine the teachings of Garrido (Use of scoring that accounts for word disfluencies or filler words to analyze speaker confidence) with those of Childress and Qu so as to provide feedback regarding whether the user speaking meets subjective criteria, such as whether they are speaking in a persuasive or confident manner (Garrido, para 2).
With regards to claim 18, this is a CRM claim for the corresponding apparatus claim 6. These two claims are related as CRM and apparatus of using the same, with each claimed system element's function corresponding to the claimed CRM step. Accordingly, claim 18 is similarly rejected under the same rationale as applied above with respect to apparatus claim 6. Para 5 of Childress, teaches such a computer readable medium.
7. Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Childress in view of Qu and further in view of Bonser (U.S. Patent Application Publication # 2018/0350389).
With regards to claim 9, Childress and Qu teach the system of claims 1 and 2, they may not explicitly detail the limitation wherein the audio of the at least one other speaker is either selectively removed or analyzed by the system. However, Bonser teaches this aspect (Para 12, teaches a parsing process in which an audio stream is analyzed to produce a speaker record automatically identifying one or more portions of the audio stream that correspond to at least one known speaker profile).
Childress, Qu and Bonser can be considered as analogous art as they belong to a similar field of endeavor in conversation analysis systems. It would thus have been obvious to one having ordinary skill in the art to advantageously combine the teachings of Bonser (audio of the at least other speaker is either selectively removed or analyzed by the system) with those of Childress and Qu so as to provide an improved system of generating a transcript of a multi-party communication (Bonser, para 7).
Conclusion
8. The following prior art, made of record but not relied upon, is considered pertinent to applicant's disclosure: Zhang (U.S. Patent Application Publication # 2022/0277752 A1), Chen (U.S. Patent Application Publication # 2021/0304628 A1). These references are also included in the PTO-892 form attached with this office action.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. If you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). In case you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NEERAJ SHARMA whose contact information is given below. The examiner can normally be reached on Monday to Friday 8 am to 5 pm. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis-Desir can be reached on 571-272-7799 (Direct Phone). The fax number for the organization where this application or proceeding is assigned is 571-273-8300.
/NEERAJ SHARMA/
Primary Examiner, Art Unit 2659
571-270-5487 (Direct Phone)
571-270-6487 (Direct Fax)
neeraj.sharma@uspto.gov (Direct Email)