Last updated: April 19, 2026
Application No. 18/632,130
STRUCTURED AUDIO CONVERSATIONS WITH ASYNCHRONOUS AUDIO AND ARTIFICIAL INTELLIGENCE TEXT SNIPPETS

Non-Final OA §103
Filed
Apr 10, 2024
Examiner
SIRJANI, FARIBA
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Anecure Inc.
OA Round
1 (Non-Final)
Interview Optional

— +31.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 547 resolved cases, 2023–2026
Examiner Intelligence

SIRJANI, FARIBA View full profile →
Grants 76% — above average
Career Allow Rate
414 granted / 547 resolved
+13.7% vs TC avg
Strong +31% interview lift
Without
With
+31.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
31 currently pending
Career history
578
Total Applications
across all art units
Statute-Specific Performance

§101
14.1%
-25.9% vs TC avg
§103
49.1%
+9.1% vs TC avg
§102
14.7%
-25.3% vs TC avg
§112
10.7%
-29.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 547 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1 and 3-20 are pending. Claims 1 and 19-20 are independent and have been amended by including the substance of canceled Claim 2 and further limiting language.  Claim 2 is canceled.
This Application was published as U.S. 20250028742.
Apparent priority: 17 July 2023.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-3, 5-9, 15-16, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kishan (U.S. 20100268534) in view of Biswas (U.S. 202000027572).
Regarding Claim 1, Kishan teaches:
1. A method comprising: 
receiving one or more audio recordings that are part of a conversation by one or more speakers; [Kishan, Abstract.  Figures 1 and 2 showing multiple users with User ID 1, User ID 2 in Figure 1 and User A, User B, User C in Figure 2 each having his own microphone and participating in a conversation via their “communication device 104/108.”]
using speech recognition to convert the audio recordings to full text transcriptions of the audio recordings; [Kishan, Figure 1, “Transcription Application 110/111” on each user computing device and “Transcription Applications 224A, 224B, 224C” in Figure 2 on “Computing Devices 222A, 222B, 222C.”  Figure 1, “Transcribed Text” on each “computing device.”  Figures 4A, 4B, 410, 430: “Recognize Input, Save Recognized Text to Document with Timestamp.”]
applying artificial intelligence to generate text snippets from the full text transcriptions; and [Kishan, Figure 3B teaches redacting or changing the transcript before sending it to be merged with the other transcripts. “[0041] FIG. 3B is similar to FIG. 3A except that additional privacy is provided, by needing consent to release the transcript after the conversation or some part thereof concludes, instead of beforehand (if consent is used at all) as in dynamic live transcription….”  “[0042] This addresses privacy because each user's own voice is separately recognized, and in this mode users need to explicitly opt-in to share their transcription side with others. User's may review (or have a manager/attorney review) their text before releasing, and the release may be a redacted version. A section of transcribed speech that is removed or changed may be simply removed, or marked as intentionally deleted or changed. A user may make the release contingent on the other user's release, for example, and the timestamps may be used to match each user's redacted parts to the other's redacted parts for fairness in sharing.”  Figure 4B, 440: “[0058] In the post-transcription consent model, the sending at step 440 may be to an intermediary service or the like that only forwards the text if the other user's text is received. Some analysis may be performed to ensure that each user is sending corresponding text and timestamps that correlate, to avoid a user sending meaningless text in order to receive the other user's correct transcripts; an audio recording may ensure that the text can be recreated, manually if necessary. Merging may also take place at the intermediary, which allows matching up redacted portions, for example.”]
posting the text snippets alongside the corresponding audio recordings. [Kishan, Figures 3A and 3B show the screen monitor with the posted merged text portions and the recording available for review.  Figure 1, “Merge/Release mechanism 130.”  Figure 4A, 4B, 414. “[0043] To help maintain context and for other reasons, the actual audio may be recorded and saved, and linked to by links embedded in the transcribed text, for example. Note that the audio recording may have a single link thereto, with the timestamps used as offsets to the appropriate time of the speech. In on implementation, the transcript is clickable, as each word is time-stamped (in contrast to only the utterance). Via interaction with the text, the text or any part thereof may be copied and forwarded along with the link (or link/offset/duration) to another party, which may then hear the actual audio. Alternatively, the relevant part of the audio may be forwarded as a local copy (e.g., a file) with the corresponding text.”   “[0036] …As another example, written call transcripts (automatically generated with the users' consent as needed) may be unified with other text communication, such as seamlessly threaded with e-mail, instant messaging, document collaboration, and so forth.  …”  See Also [0037].  “[0050] … Similarly, step 414 represents receiving the text from the other user or users, and dynamically merging it into the transcript based on its timestamp data. An alternative is for the clients to upload their individual results to a central server, which then handles merging. Merging can be done for both the transcript and the audio.”  Figure 4B, 442: “[0057] If allowed, step 440 sends the document to the other user for merging with that user's recognized text. Step 442 receives the other document for merging, merges it, and outputs it in some suitable way, such as a document or email thread for saving. Note that the receiving, merging and/or outputting at step 442 may be done at each user's machine, or at a central server.”  “[0052] When the transcription application is done, the transcription may be output in some way. For example, it may become part of an email chain as described above, saved in conjunction with an audio recording, and so forth.”]
Kishan does teach an automatic feature for releasing the snippet/ selecting the snippets that can be released: “[0042] … A user may make the release contingent on the other user's release, for example, and the timestamps may be used to match each user's redacted parts to the other's redacted parts for fairness in sharing.”  “[0058] … Some analysis may be performed to ensure that each user is sending corresponding text and timestamps that correlate, to avoid a user sending meaningless text ….”
However, Kishan does not use the term “AI” and a reference is added.
Biswas teaches:
applying artificial intelligence to generate text snippets from the full text transcriptions; and [Biswas summarizes the transcripts of a conversation between customers and customer service reps (Figure 1A) and can even provide portions of the summary (Figure 1D) to the customer.  So it generates snippets in one or two consecutive rounds. Biswas uses a trained neural network model for summarization/ generating snippets.  See Figure 2, 220, 225. “[0075] FIG. 2 is a diagram illustrating an example 200 of training and using a machine learning model in connection with generating a summary of a multi-speaker conversation. …”  “[0083] As shown by reference number 220, the machine learning system may train a machine learning model using the set of observations and using one or more machine learning algorithms, …”  The features/parameters that are trained for as shown in Figure 2 at 210: number of speakers, topic models, relative summary length.]
Kishan and Biswas pertain to summarizing conversations and it would have been obvious to use the machine learning system of Biswas to train the summarization model of Kishan as a more modern method of performing a task. This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 2, Kishan teaches:
2. The method of claim 1, wherein generating the text snippets occurs asynchronously as the audio recordings are received. [Kishan generates the transcript asynchronously in that it is done “independent of transmission” (Abstract) and this teaching corresponds to definition provided in [0014] of the instant Application.  “[0006] Briefly, various aspects of the subject matter described herein are directed towards a technology by which speech from communicating users is separately recognized as text of each user. The recognition is performed independent of any transmission of that speech to the other user, e.g., on each user's local computing device. The separately recognized text is then merged into a transcript of the communication.”   “[0007] In one aspect, speech is received from a first user who is speaking with a second user. The speech is recognized independent of any transmission of that speech to the second user (e.g., on a recognition channel that is independent of the transmission channel). Recognized text corresponding to speech of the second user is obtained and merged with the text of the first user into a transcript. Audio from separate streams may also be merged.”  Kishan also generates snippets by removing/changing parts of the transcript and this occurs after the transcript is generated, i.e. asynchronously.  “[0042] … User's may review (or have a manager/attorney review) their text before releasing, and the release may be a redacted version. A section of transcribed speech that is removed or changed may be simply removed, or marked as intentionally deleted or changed. ….”  The snippets are generated by the person redacting the transcript.]

Regarding Claim 3, Kishan teaches:
3. The method of claim 2, wherein new text snippets are generated when speakers speak, and the new text snippets are added to previously generated text snippets to create a growing thread of the conversation. [Kishan teaches the machine/system adding the newly arriving transcript that may be redacted ([0042]) leading to the generation of snippets which are subsequently placed in a thread.  Figure 3B shows how snippets are added to the thread as approval is received at 344.  “[0041] FIG. 3B is similar to FIG. 3A except that additional privacy is provided, by needing consent to release the transcript after the conversation or some part thereof concludes, instead of beforehand (if consent is used at all) as in dynamic live transcription. One difference in FIG. 3B from FIG. 3A is a placeholder 344 that marks the other user's transcribed speech as having taken place, but not yet being available, awaiting the other user's consent to obtain it.”]

Regarding Claim 5, Kishan takes the snippet by modification by users.
Biswas teaches:
5. The method of claim 1, wherein generating the text snippets from the full text transcriptions comprises: 
the artificial intelligence detecting topics from the full text transcriptions and then selecting one or more sentences from the full text transcriptions about the detected topics. [Biswas, Figure 1C shows that the summarization module 112 summarizes the speaker transcripts based on key terms 140 extracted by the topic model to generate the conversation summary 145.  “[0050] As shown in FIG. 1C, and by reference number 140, the summarization system 106 may summarize the speaker transcripts based on key terms. For example, the summarization system 106 (e.g., using the summarization module 112) may generate a first transcript summary of the first speaker transcript based on the set of key terms associated with the first topic (hereafter referred to “first set of key terms”), as described below. Additionally, the summarization system 106 (e.g., using the summarization module 110) may generate a second transcript summary of the second speaker transcript based on the set of key terms associated with the second topic (hereafter referred to “second set of key terms”), as described below.”]
Rationale for combination as provided for Claim 1.  Summarization is from Biswas and details of it come from Biswas under the same rationale.

Regarding Claim 6, Kishan takes the snippet by modification by users.
Biswas teaches:
6. The method of claim 1, wherein generating the text snippets from the full text transcriptions comprises: 
the artificial intelligence selecting portions of the full text transcriptions that summarize what speakers are saying. [Biswas, Figure 1A showing the “summarization module 112” and Figure 1D, first a summary is generated at 112 and stored at 155 and then a portion of the conversation summary may further be selected at 160.  Both operate on the full transcripts obtained by the “transcription system 104.”]
Rationale as provided for Claim 1.  AI was introduced from Biswas.

Regarding Claim 7, Kishan teaches “topic-clustering” in [0045] as one of the methods of mining the conversation data but does not use it for generating the snippets which are generated by a user redacting or changing portions of his transcript. 
Biswas teaches:
7. The method of claim 1, wherein generating the text snippets from the full text transcriptions comprises: 
the artificial intelligence detecting a topic of an original posting that started the conversation and [Biswas is directed to summarizing a conversation between a customer and a customer service rep based on topics that are detected by extracting keywords.  Figure 1C, “Topic Models.”  “[0036] A topic model may refer to a model (e.g., a statistical model) that analyzes text …. Based on analyzing the text, the topic model may identify one or more topics associated with the text and identify a set of key terms associated with a respective topic of the one or more topics….”  “A system may separate a transcript of a conversation into a first section corresponding to a first speaker in the conversation, and a second section corresponding to a second speaker in the conversation. The system may…determine, based on one or more topic models, a first set of key terms associated with the first speaker transcript and a second set of key terms associated with the second speaker transcript. …”  Abstract.]
then selecting text snippets from replies that are about the detected topic. [Biswas, Figure 1C shows that the summarization module 112 summarizes the speaker transcripts based on key terms 140 extracted by the topic model to generate the conversation summary 145.  [0050].  Figure 1D shows that the summarization module 112 provides a portion/snippets of the conversation summary 160 to the user device 102.  Considering that Biswas is set in the context of customer service it would involve questions from the customer and answers by the representative/second summary.  “[0071] As shown in FIG. 1D, and by reference number 160, the summarization system 106 may provide the conversation summary. For example, the summarization system 106 may provide a portion of the conversation summary using the customer information. For instance, the summarization system 106 may transmit a portion of the conversation summary (e.g., a portion or an entirety of the second speaker transcript summary) to the customer to memorialize a resolution of an issue associated with the conversation and to prevent additional telephone calls from the customer regarding the same issue….”]
Rationale for combination as provided for Claim 1.  Summarization by AI is from Biswas and details of it come from Biswas under the same rationale.

Regarding Claim 8, Kishon lets the user redact and modify the transcript.
Biswas teaches:
8. The method of claim 1, wherein generating the text snippets from the full text transcriptions comprises: 
the artificial intelligence selecting portions of the full text transcriptions that include words or phrases that are trending online and/or that are preselected. [Biswas generates its summaries based on key terms that preselected.  Figure 1A, “Key Term Extraction Module 110” and Figure 1C showing the speaker key terms being extracted according to the Topic Models.]
Rationale for combination as provided for Claim 1.  Summarization by AI is from Biswas and details of it come from Biswas under the same rationale.

Regarding Claim 9, Kishan takes portions by redaction or modification but not necessarily standard summarization.
Biswas teaches:
9. The method of claim 1, wherein generating the text snippets from the full text transcriptions comprises: 
the artificial intelligence summarizing the full text transcriptions. [Biswas, Figures 1A and 1B show summarization system 106 operating on the output of the transcription system 104.]
Rationale for combination as provided for Claim 1.  Summarization by AI is from Biswas and details of it come from Biswas under the same rationale.

Regarding Claim 15, Kishan teaches:
15. The method of claim 1, wherein the text snippets are in multiple languages based on languages spoken in the audio recordings. [Kishan is directed to the use of personalized language model in order to achieve better transcripts.  Abstract.  Kishan does mention foreign language as well: “[0044] Another type of interaction may tie the transcript to a dictionary or search engine. For example, by hovering the mouse pointer over a transcript, foreign language dictionary software may provide instant translations for the hovered-over word (or phrase). …”  This teaching implies the transcript to at least including phrases in a different language.]

Regarding Claim 16, Kishan does not elaborate the automatic process of selection and posting of snippets/redacted transcripts. 
Biswas teaches:
16. The method of claim 1, wherein the artificial intelligence dynamically switches between different techniques for generating the text snippets based on a success metric. [Biswas summarizes based on keywords and then provides a portion (snippet) at 160 of Figure 1D based on different criteria.  “[0071] As shown in FIG. 1D, and by reference number 160, the summarization system 106 may provide the conversation summary. For example, the summarization system 106 may provide a portion of the conversation summary using the customer information. For instance, the summarization system 106 may transmit a portion of the conversation summary (e.g., a portion or an entirety of the second speaker transcript summary) to the customer to memorialize a resolution of an issue associated with the conversation and to prevent additional telephone calls from the customer regarding the same issue. …”  The metric used in Biswas is success in resolving the issue.  (Claim 17 defines the “success metric” and a different reference is cited for it.)]
Rationale for combination as provided for Claim 1.  Summarization by AI is from Biswas and details of it come from Biswas under the same rationale.

Regarding Claim 18, Kishan teaches:
18. The method of claim 1, wherein the one or more audio recordings comprise a plurality of audio recordings that are part of the conversation between multiple speakers. [Kishan, Figures 1 and 2 show the conversation between multiple speakers.  “Described is a technology that provides highly accurate speech-recognized text transcripts of conversations, particularly telephone or meeting conversations….”  Abstract.  “[0022] … As can be readily appreciated, more than two users may be participating in the conversation. Further, not all users in the conversation need to be participating in the transcription process.”]

Claim 19 is a computer program product system claim with limitations corresponding to the limitations of method Claim 6 and is rejected under similar rationale.  Hardware components including processor and computer readable medium are shown in Kishan, Figure 5.

Claim 20 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Hardware components including processor and computer readable medium are shown in Kishan, Figure 5.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Kishan and Biswas further in view of Adlersberg (U.S. 11334618).
Regarding Claim 4, Kishan teaches that the snippets include links to the audio recordings. [0043].  Biswas does not teach providing the audio again because the participants have had their conversation. 
Adlersberg teaches:
4. The method of claim 1, wherein the audio recordings are posted before the text snippets are available, and the text snippets are posted later alongside the audio recordings. [Adlersberg, “Devices, systems, and methods of Capturing the Moment in audio discussions and recordings. A user utilizes an electronic device to record audio, or to participate in an audio conversation or an audio/video conversation. During the conversation, the user clicks on presses a button to create an audio bookmark at desired time-points. Later, the system generates separate a short audio-clip for a few sentences that were spoken before and after each such audio bookmarks. The system further generates an aggregated clip or summary of the bookmarked segments, as well as textual transcription of the bookmarked content.”  “… The generated “audio trailer” may be downloaded or saved by User Adam as an audio file, or may be shared or sent or posted; and may be accompanied by a textual transcript that corresponds to the speech portions that were said within those time-slots that are part of the “audio trailer”, to thus provide a transcript version or a textual representation of that “audio trailer”….”  10:34-41.]
Kishan/Biswas and Aldersberg pertain to processing conversations and it would have been obvious to post the audio as it comes in followed by a transcript that needs to be generated from the audio and any summaries/snippets of that transcript as provided in Aldersberg. This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Kishan and Biswas further in view of Church (U.S. 20190370283).
Regarding Claim 10, Kishan takes portions by redaction or modification but not necessarily standard summarization.  Summarization of Biswas is not random and is based on the extracted keywords that are topic specific. 
Church teaches:
10. The method of claim 1, wherein generating the text snippets from the full text transcriptions comprises: 
the artificial intelligence selecting randomly from the full text transcriptions. [Church, “It is desirable to have audio and/or video systems and processing tools that can automatically record audio/video and analyze such recordings to capture material that may be relevant to a user. In one or more embodiments disclosed herein, a recording may be condensed by using one or more tools, including but not limited to, converting speech to text and searching for relevant content, keywords, and the like; detecting and sorting speakers and/or content (e.g., events in a conversation); removing non-substantive content (e.g., silences and other irrelevant content); adjusting the audio for increased playback speed; using prosodic and other indicia in the audio to identify areas of interest; performing diarization; using pseudo-random or random sample to select content; and other methods to extract information to provide a summary or representation of recorded content for review by a user.”  “[0063] In embodiments, recorded data, e.g., recorded data that has undergone one or more of the non-substantive content removal processes discussed in Section B above, may be randomly or pseudo-randomly selected to be sampled for removal (or non-removal) of certain substantive content in order to generate a digest file that may serve as a consolidated content of the original recording. It is understood that the consolidated content may be the result of any type of content reduction. For example, in embodiments, rather than generating a truly random sample or non-substantive content, a digest file and the amount of randomly or pseudo-randomly selected recorded speech contained therein may be tailored to comprise recorded speech that is selected based on an association with a particular speaker, event (e.g., music), time (e.g., a child's regular playtime), keyword, or any other combination of one or more parameters, such as a user-selected runtime.”]
Kishan/Biswas and Church pertain to processing conversations and it would have been obvious to use the random selection of Church in place of the summarization of the combination as one method of generating summaries. This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Kishan and Biswas further in view of Mara (U.S. 9569432).
Regarding Claim 11, Kishan teaches that the transcripts are redacted.  This process leads to any combination of sentences passing through.  Biswas does not select just the first sentences or the last sentences.
Marra teaches:
11. The method of claim 1, wherein generating the text snippets from the full text transcriptions comprises: the artificial intelligence selecting the first sentences, middle sentences or last sentences of the full text transcriptions and combining the selected sentences. [Mara: “…. In one implementation, the snippet selection circuit 130 generates the snippet from the content. For example, the snippet selection circuit 130 can generate snippets according to a snippet generation criteria. For example, if the content is a news article, the snippet selection circuit 130 can generate a snippet based on the snippet generation criteria indicating that the snippet should include one or more elements of the news article, such as a headline, an identified portion of text (e.g., the first sentence), a quote in the news article, an image or picture corresponding to the news article, or combination thereof. …. For example, a first snippet of a news article can have a headline, a thumbnail, and a first portion of text that corresponds to the first paragraph of the news article. A different snippet of the same news article can have the same headline, the same thumbnail, but a first portion of text that corresponds to the third paragraph of the news article.”  5:18-50.]
Kishan/Biswas and Mara pertain to processing conversations and generating summaries and it would have been obvious to use the method of selection of Mara in place of the summarization of the combination as one method of generating summaries. This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Kishan and Biswas further in view of Goel (U.S. 20190294668).
Regarding Claim 12, Kishan teaches that the recordings are associated with the transcript so that the transcript can be corrected later:  “[0058] … an audio recording may ensure that the text can be recreated, manually if necessary….”  Biswas summarizes based on keywords and then provides a portion (snippet) at 160 of Figure 1D based on different criteria. 
Goel teaches:
12. The method of claim 1, wherein the artificial intelligence also uses the audio recordings to generate the text snippets from the full text transcriptions. [Goel teaches that emotion/tone are deduced from the audio as context and are used for generating a summary.  See Figure 2, “emotion estimation unit 214” and Figure 9, “tone analysis” of “speech” leading to “emotion extraction estimation.”  See [0057].  Figure2 2 and 7-8, “summary generation unit 204.”  “[0064] … The multimedia analysis engine 110 generates the emotions based on at least one of … a speech tone analysis method and a background audio score analysis method. The multimedia analysis engine 108 generates the insights based on the generated analytics and the emotions.”  “[0047] The summary generation unit 204 can be configured for generating the summary for the contents of the multimedia. The summary can include at least one of a text summary and a video/visual summary….”  “[0072] Embodiments herein enable the multimedia analysis engine 110 to generate the visual summary based on at least one of the classified image frames, the identified keywords and/or keyphrases, the textual summary/the transcript, the detected objects and/or actions, the detected emotions and so on….”  “[0077] Visual summary/video summary: The visual summary can be an automatic summarization of the content of the multimedia that is represented in the form of audio portions, the visual portions and so on. The visual summary can represent the most relevant content of the multimedia in a condensed form. The visual summary may be quick to read and view, so that the user can obtain the accurate information what has been spoken or visualized inside the multimedia. …”  Visual Summary and Insight can both be snippets of the Claim and depend on Emotion detected from tone.  “[0019] FIG. 9 is an example diagram illustrating generation of emotions and related analytics and insights, according to embodiments as disclosed herein;”]
Kishan/Biswas and Goel pertain to processing conversations and it would have been obvious to refer back to the audio as done in Mara to generate a summary of higher confidence. This combination falls under simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 13, Kishan teaches:
13. The method of claim 12, wherein the artificial intelligence 
determines emphases in the full text transcriptions based on voice intonations from the corresponding audio recordings, and [Goel: “[0004] There are also few unsolved problems in conventional media space, due to which the user may not able to consume the multimedia productively or effectively. The unsolved problems can be, but not limited to, selection of the multimedia which caters to user's media consumption (what is my situation), identifying intent (what I want to achieve), generation of insights (for example in case of a product survey/usage), generation of inferential data, categorization of emotions for better understanding of themes/emphasis/speaker intent/customer satisfaction, performing a search within the multimedia and across the multimedia with cross-linked and specific common information, performing zoom in to relevant portions, locating/filtering/avoiding media satisfying certain criteria, using the multimedia itself as a tool to evaluate user' feedback and product use statistics and glean valuable information and so on.”  “[0059] In an embodiment, the insights generation unit 216 interprets a list of detected emotions (by the emotion estimation unit 214) and provides inferences. For example, the insights generation unit 216 provides the inferences about at least one of perceptions and emphases of a speaker, underlying themes in the contents of the multimedia, perceptions of satisfaction or dissatisfactions of products, interpersonal/human-object interactions and so on….”] 
generates the text snippets based on the emphases in the full text transcriptions. [Goel, Visual Summary and Insight can both be snippets of the Claim and depend on Emotion detected from tone and emphases.  “[0077] Visual summary/video summary: The visual summary can be an automatic summarization of the content of the multimedia that is represented in the form of audio portions, the visual portions and so on. The visual summary can represent the most relevant content of the multimedia in a condensed form. The visual summary may be quick to read and view, so that the user can obtain the accurate information what has been spoken or visualized inside the multimedia. …”    “[0019] FIG. 9 is an example diagram illustrating generation of emotions and related analytics and insights, according to embodiments as disclosed herein;”]

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Kishan and Biswas further in view of Marey (U.S. 20220012296).
Regarding Claim 14, Kishan in [0044] below teaches that words or phrases of a posted transcript may need translation but does not teach detecting the spoken language of each speaker. Biswas does not include translation.
Marey teaches:
14. The method of claim 1, further comprising: 
detecting a language spoken by one of the speakers; and [Marey, Figure 3 shows the translation options.  “[0023] …  In some embodiments, artificial intelligence models (e.g., neural networks) may be used to classify text into a particular language and perform machine translations of the text in the source language into a target language. …” “[0022] In some embodiments, a language of one or more posts on the social media platform may be considered by the system when generating recommended posts to a user. For example, FIG. 3 shows exemplary displays 300 in which user 304 has posted original post 306 in a particular language (e.g., Arabic).….”] 
translating text snippets in different languages to the detected spoken language. [Marey, paragraphs [0022]-[0023] teach the various options for translation of the subsequent posts to the detected language of the post of the user 304.  “[0023] In some embodiments, the system may generate as the recommended posts translated versions of posts 308, 310, semantically similar versions of posts 308, 310 but in the language of the original post, and/or commonly used text strings (e.g., from a religious text associated with the language of the user, such as passages that are commonly used in the context of the original post)….”]
Kishan/Biswas and Marey pertain to processing conversations and it would have been obvious to translate the generated snippets as done in Marey in order to accommodate different langauges. This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Kishan and Biswas further in view of Khanzode (U.S. 20160179954).
Regarding Claim 17, Kishan and Biswas do not teach the type of success metric used for determining the snippets.
Khanzode teaches: 
17. The method of claim 16, wherein the success metric is based on at least one of (i) how many speakers listen to the audio recordings after viewing the text snippets, (ii) how many speakers participate in the conversation by replying after viewing the text snippets, and (iii) how many speakers share the conversation after viewing the text snippets. [Khanzode teaches that the portions/web pages that are more popular are assigned higher popularity scores which teaches the success metric of the Claim.  “[0009] In this example, at least a portion of the metadata derivatives may include the participation score of the certain user with respect to the file. As yet another example, the files may include multiple web pages and at least a portion of the primary metadata may include a number of comments posted to the web pages. In this example, the method may include deriving the metadata derivatives by calculating popularity scores for the web pages based at least in part on the number of comments posted to the web pages. In this example, at least a portion of the metadata derivatives may include the popularity scores.”  “[0057] As another example, files 210(1)-(N) may represent web pages and metadata 212(1)-(N) may represent a number of comments posted to files 210(1)-(N). In this example, mining module 106 may derive metadata derivatives from metadata 212(1)-(N) by calculating popularity scores for files 210(1)-(N) based on the number of comments posted to files 210(1)-(N). The metadata derivatives may include the popularity scores. The term “popularity score,” as used herein, generally refers to any type or form of score and/or classification that indicates the popularity level of certain digital content. For example, mining module 106 may give web pages that include more than five hundred comments a high popularity score, web pages that include between fifty and five hundred comments a medium popularity score, and web pages that include fewer than fifty comments a minimal popularity score.”]
Kishan/Biswas and Khanzode pertain to processing conversations and it would have been obvious to use the success metric of Khanzode with the system of combination to provide some gage for the generated summaries. This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Kanuganti (U.S. 20210182020) Figure 2A, Audio Processing System 206 leads to storing Emotion in 212.  Figure 2B shows that the Emotion information is obtained from Audio, Photos, and Social posts.  Tone is one of the indicators of Emotion.  The generated Transcripts store the Emotion/Tone as metadata: “[0103] The audio processing system can generate one or more emotion parameters 264. The audio processing system can identify emotions associated with a conversation by detecting audio characteristics (e.g., speech rate, tone, pitch, intonation, energy level) in the speech often associated with certain types of emotions. …”  Then any generated summaries would use the context parameters including emotion/tone: “[0140] The data structures can be configured to store content data (e.g., emails, tweets, audio transcriptions), and context data or context parameters, which includes metadata (e.g., people, time, location, source) and generated data (e.g., emotions, summaries) as described herein. ….”  “[0142] … As another example, a memory chunk can include, as generated data, a summary of all of the utterances in the child memory blocks.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Fariba Sirjani/
Primary Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Apr 10, 2024
Application Filed
Nov 10, 2025
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/454,031
Patent 12603099
SELF-ADJUSTING ASSISTANT LLMS ENABLING ROBUST INTERACTION WITH BUSINESS LLMS
2y 5m to grant Granted Apr 14, 2026
18/152,553
Patent 12579482
Schema-Guided Response Generation
2y 5m to grant Granted Mar 17, 2026
18/341,681
Patent 12572737
GENERATIVE THOUGHT STARTERS
2y 5m to grant Granted Mar 10, 2026
18/406,094
Patent 12537013
AUDIO-VISUAL SPEECH RECOGNITION CONTROL FOR WEARABLE DEVICES
2y 5m to grant Granted Jan 27, 2026
18/180,329
Patent 12492008
Cockpit Voice Recorder Decoder
2y 5m to grant Granted Dec 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+31.0%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 547 resolved cases by this examiner. Grant probability derived from career allow rate.