DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendments filed 02/27/2026 have been accepted and considered in this office action. Claims 1-2, 4-10 have been considered. Claim 3 has been cancelled.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot in view of new grounds of rejection necessitated by the applicant’s amendments to the claims.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1,2, and 4-10 are rejected under 35 U.S.C. 101 because the claimed invention is mental process without significantly more.
Independent claims 1, 9, and 10 regard a process that, as drafted under its broadest reasonable interpretation (BRI), covers measuring the utterance timings, times where the speakers in the conversation are switching, utterance lengths for each speaker, and also the state of the conversation using the utterance segment lengths of overlapping of speakers or segments of silence. For example, under the BRI the method claim relates to: while receiving real-time input of voice data comprising audio data, picked up by at least one sound pickup device, of a conversation including utterances of multiple speakers (human can mentally receive real time voice/audio data input and act as sound pickup device or record via ChatGPT and generic computer microphone component):
estimating a starting time and an ending time of [[an]] each utterance of each main speaker among the multiple speakers, based on the voice data relating to a conversation that includes utterances of multiple speakers (human can time utterance of speakers mentally of with stopwatches for 2 or more speakers);
identifying a timing of a switch between the main speakers based on the estimated starting time and the estimated ending time (human can calculate this timing of switch mentally or with pen and paper); [[and]]
evaluating a state of the conversation based on by inputting dialogue information from before and after the identified timing of the switch into a trained model, the dialogue information being acquired based on the voice data, and the trained model being trained using training data that adopts the dialogue information as input data and adopts, as correct data, a label indicating the state of the conversation before and after the timing of the switch (a person can mentally evaluate the empathy/state of conversation based on dialogue, apply labels to it mentally, and reach conclusion/alert bystander); and
generating and outputting evaluation data including a result of the evaluating, in response to detecting, based on an output of the trained model, that a predetermined alert condition regarding the state of the conversation has been satisfied (Human can use output from trained ChatGPT model to generate and output evaluation data based on their own mental predetermined alert condition regarding state of conversation),
wherein the dialogue information includes at least one of (i) a length of an overlapping segment in which utterance segments of the main speakers overlap or a length of a silent segment in which utterance segments of the main speakers do not overlap, and includes and (ii) a length of the utterance segment of each of the main speakers (human can time these segments, mentally or writing them down with pen and paper) and
wherein the evaluating computes the length of the overlapping segment by subtracting the estimated starting time of an utterance segment of one of the main speakers after the timing of the switch, from the estimated ending time of an utterance segment of another one of the main speakers before the timing of the switch (human can subtract starting and ending times mentally or with a pen and paper in order to compute length of overlapping segments)
As described above, these limitations can be carried out as a series of mental steps. The judicial exception is not integrated into a practical application because the only additional elements recited are a system comprising of a computer processor and memory, which is general purpose hardware being used as a tool to implement the mental process, and non-transitory computer-readable program code that is conventional components that utilizes the basic functions of a computer.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, as described above, the only additional elements recited are a system comprising of a computer processor and memory, which is general purpose hardware being used as a tool to implement the mental process, and non-transitory computer-readable program code that is conventional components that utilizes the basic functions of a computer.
The remaining dependent claims fail to add patent eligible subject matter to independent claim 1:
Claim 2 simply adds requirements for what counts as main speaker utterance which a human can check mentally or with a ben and paper through comparisons
Claim 4 simply adds labeling the name of the coercive speaker and outputting the probability that the speakers speak coercively which are classification/likelihood estimations a human can perform mentally or with a pen and paper base don observation of a conversation
Claim 5 simply adds labeling whether conversation is active and outputting degree of activity which a human can perform mentally when listening to conversation
Claim 6 simply adds outputting alert information in association with identified timing, which a human can perform by shouting during that identified time.
Claims 7 and 8 simply add iterative checking against a specific threshold which is basic counting and comparison that a human can do mentally or with a pen and paper.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-2, 4-10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Chawla et al. (hereinafter Chawla) (US 20220319535 A1) in view of Wang et al. (hereinafter Wang) (US 20200219517 A1).
Regarding claim 1, Chawla teaches: A non-transitory computer readable medium including computer executable instructions, wherein the instructions which, when executed by a hardware processor of a computer, cause the processor to perform processes a method comprising (Chawla, P[0004]):
while receiving real-time input of voice data comprising audio data, picked up by at least one sound pickup device, of a conversation including utterances of multiple speakers (Chawla, P[0012]: "receive audio data identifying a conversation including a plurality of speakers."):
estimating a starting time and an ending time of [[an]] each utterance of each main speaker among the multiple speakers, based on the voice data (Chawla, Abstract: "identify a plurality of speaker segments", P[0013]: "identifying start and end times of a speaker in the audio data.") relating to a conversation that includes utterances of multiple speakers;
[[and]]
evaluating a state of the conversation by inputting dialogue information from before and after the identified timing of the switch into a trained model, the dialogue information being acquired based on the voice data (Chawla, P[0012]: "utilizes machine learning models… calculate an empathy score based on the analysis of the plurality of speaker segments both audio and textually" (the "empathy score" here is the conversation state. By analyzing thing "plurality of segments" the model inherently looks at the dialogue surrounding the transitions)), and the trained model being trained using training data that adopts the dialogue information as input data and adopts, as correct data, a label indicating the state of the conversation before and after the timing of the switch (Chawla, P[0050]: "The customer system may utilize the empathy score as additional training data for retraining the one or more of the rectification models, thereby increasing the quantity of training data" ("supervised learning" where the evaluated state (label) is fed back into the model as training data)) ; and
generating and outputting evaluation data including a result of the evaluating, in response to detecting, based on an output of the trained model, that a predetermined alert condition regarding the state of the conversation has been satisfied (Chawla, P[0049]: "may perform one or more actions based on the empathy score. In some implementations, the one or more actions include the customer system providing the empathy score for display, scheduling training" ("scheduling training" or "providing a refund" is the functional output of an alert condition being met)),
wherein the dialogue information includes at least one of (i) a length of an overlapping segment in which utterance segments of the main speakers overlap or a length of a silent segment in which utterance segments of the main speakers do not overlap, and includes and (ii) a length of the utterance segment of each of the main speakers (Chawla, P[0028]: " The plurality of errors may include… an overlapping speaker error (e.g., a percentage of time that a speaker, of multiple speakers of a speaker segment, does not get labeled)" ("percentage of time" (length) of overlap used as feature/error metric)) and
Chawla does not teach:
identifying a timing of a switch between the main speakers based on the estimated starting time and the estimated ending time
wherein the evaluating computes the length of the overlapping segment by subtracting the estimated starting time of an utterance segment of one of the main speakers after the timing of the switch, from the estimated ending time of an utterance segment of another one of the main speakers before the timing of the switch However, Wang teaches:
identifying a timing of a switch between the main speakers based on the estimated starting time and the estimated ending time (Wang, P[0032]: "entry zt indicates whether or not a speaker change occurs at the corresponding embedding entry xt at time t… the diarization results 280 may predict a speaker change value 255 for each fixed-length segment 220. In the example shown, the speaker change values 255 may be represented as a sequence of change point indicators Z=(z.sub.1, z.sub.2, . . . , z.sub.T), where entry zt indicates whether or not a speaker change occurs at the corresponding embedding entry xt at time t.", Wang, P[0036]: "the size of each window 215 may be 240 milliseconds (ms) and the fixed overlap between each sliding window 215 may be 50-percent (50%)" (Wang describes specific sequence of binary indicators which are uniquely determined by speaker labels and zt is a binary indicator that triggers at a specific time t. Wang explicitly ties this indicator to a specific time t and Wang's segments (windows) are defined by specific millisecond durations and offsets, which constitute the start and end time of observation));
wherein the evaluating computes the length of the overlapping segment by subtracting the estimated starting time of an utterance segment of one of the main speakers after the timing of the switch, from the estimated ending time of an utterance segment of another one of the main speakers before the timing of the switch (Wang, P[0036]: "sliding windows 215 having a fixed size and a fixed overlap. For instance, the size of each window 215 may be 240 milliseconds (ms) and the fixed overlap between each sliding window 215 may be 50-percent (50%)" (fixed overlap of fixed size is the functional disclosure of the length calculation and the system has inherently performed the subtraction if calculating the overlap length or percentage as this overlap length is calculated to determine next window's start position meaning start times must be calculated through subtraction (240 ms window with 50% overlap has 120ms length of overlap derived from starting/ending boundaries))).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chawla in view of Wang. Doing so would have provided the speaker diarization and change-point detection methods of Wang (Wang, Abstract) with the empathetic state evaluation system of Chawla (Chawla, Abstract) and doing so would have improved the computational efficiency and accuracy of the state evaluation by ensuring the machine learning model is inputting dialogue information precisely “before and after” a verified speaker transition resulting in a more robust and reliable conversation analysts system capable of triggering alert information with higher temporal resolution and mathematical certainty.
Regarding claim 2, Chawla in view of Wang teaches the medium according to claim 1
Chawla further teaches:
wherein the estimating estimates comprises estimating, as the utterances of the main speakers each utterance of each main speaker, an utterance which has an whose utterance segment having (i) has a length equal to or above a threshold (Chawla, P[0035], (Levenshtein thresholding is used to filter data/segments))
Chawla does not teach:
and whose utterance segment (ii) is not included [[in]] within another utterance segment
However, Wang teaches:
and whose utterance segment (ii) is not included [[in]] within another utterance segment (Wang, P[0024]: "interleaves the states of different RNN instances (i.e., different speakers) in the time domain.", Wang, P[0021], (describes speech segmentation module which is configured to remove non-speech parts, nested utterance is filtered/segmented (functionally reads on removing) as separate "non-homogenous" part effectively not including it in primary speaker's segment).
Regarding claim 4, Chawla in view of Wang teaches the medium according to claim 1
Chawla further teaches:
wherein: the label is a name of a main speaker who speaks coercively (Chawla, P[0048]: "The empathy score may provide an indication of whether one of the plurality of speakers, associated with the empathy score, is empathetic, neutral, or non-empathetic. The customer system may determine an empathy score for each speaker segment" ("non-empathetic" state reads on coercive speech)), and the evaluating outputs a probability with which the main speakers speak coercively as an evaluation result relating to the state of the conversation (Chawla, P[0023]: ". The output of a clustering model may include a list of class labels, a plurality of confidence scores indicating a likelihood that the class labels accurately identify a speaker", P[0045]: "CNN" (in context of machine learning/CNN confidence scores indicating a likelihood reads on probability)).
Regarding claim 5, Chawla in view of Wang teaches the medium according to claim 1
Chawla further teaches:
wherein: the label is a value relating to whether the conversation is active or not, and the evaluating outputs a degree of activity of the conversation as an evaluation result relating to the state of the conversation (Chawla, P[0028], (speech vs non-speech reads on active or not), P[0048], (scores for each conversation state is a numeric measure of degree of activity aswell)).
Regarding claim 6, Chawla in view of Wang teaches the medium according to claim 1
Chawla further teaches:
wherein [[if]] the generating and outputting outputs alert information in response to a determination in the evaluating, based on the dialogue information, that a coercive switch between the main speakers is detected based on the dialogue information, the evaluating outputs alert information being output (Chawla, P[0012], (evaluation triggered by dialogue surround the transitions) P[0049], (action associated with score)
Chawla does not teach:
in association with the timing an identified timing of the switch at which the coercive switch between the main speakers is detected
However, Wang teaches:
in association with the timing an identified timing of the switch at which the coercive switch between the main speakers is detected (Wang, P[0032], (specific timing identified by Wang's change point indicator zt and the fact is happens at time t the evaluation is performed on segments surrounding the transition and the alert is timestamped to that specific switch inherently)).
Regarding claim 7, Chawla in view of Wang teaches the medium according to claim 1
Chawla further teaches:
wherein [[if]] the generating and outputting outputs alert information in response to a determination in the evaluating, based on the dialogue information, that a coercive switch between the main speakers is detected in the conversation a number of times equal to or above a threshold (Chawla, P[0040]-P[0049], (iterative threshold checking where predetermined alert condition is the threshold which results in the action, one or more actions can include several "output" options as specified in P[0049]), P[0048], (non-empathetic state shows coercive switch)) based on the dialogue information, the evaluating outputs alert information.
Regarding claim 8, Chawla in view of Wang teaches the medium according to claim 1
Chawla further teaches:
wherein [[if]] the generating and outputting outputs alert information in response to a determination in the evaluating, based on the dialogue information, that the conversation is detected as being inactive a number of times equal to or above a threshold (Chawla, P[0040]-P[0049], (iterative threshold checking, iterative threshold checking where predetermined alert condition is the threshold which results in the action, one or more actions can include several "output" options as specificied in P[0049]), P[0028], (speech vs non-speech)) based on the dialogue information, the evaluating outputs alert information.
Regarding claim 9, claim 9 recites the apparatus corresponding to the computer executable instructions in a non-transitory computer readable medium presented in claim 1 and is rejected under the same grounds as above.
Additionally, the combination further discloses or makes obvious:
A conversation evaluation apparatus comprising processing circuitry configured to (Wang, P[0065])
Regarding claim 10, claim 10 recites the method corresponding to the computer executable instructions in a non-transitory computer readable medium presented in claim 1 and is rejected under the same grounds as above.
Additionally, the combination further discloses or makes obvious:
A conversation evaluation method executed under control of a hardware processor of a computer, the method comprising (Wang, P[0060]):
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHASHIDHAR S MANOHARAN whose telephone number is (571)272-6772. The examiner can normally be reached M-F 8:00-4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SHASHIDHAR SHANKAR MANOHARAN/Examiner, Art Unit 2655
/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655